AIP-C01 · Question #35
AIP-C01 Question #35: Real Exam Question with Answer & Explanation
Sign in or unlock AIP-C01 to reveal the answer and full explanation for question #35. The question stem and answer options stay visible for context.
Question
A publishing company is developing a chat assistant that uses a containerized large language model (LLM) that runs on Amazon SageMaker AI. The architecture consists of an Amazon API Gateway REST API that routes user requests to an AWS Lambda function. The Lambda function invokes a SageMaker AI real-time endpoint that hosts the LLM. Users report uneven response times. Analytics show that a high number of chats are abandoned after 2 seconds of waiting for the first token. The company wants a solution to ensure that p95 latency is under 800 ms for interactive requests to the chat assistant. Which combination of solutions will meet this requirement? (Select TWO.)
Options
- AEnable model preload upon container startup. Implement dynamic batching to process multiple
- BSelect a larger GPU instance type for the SageMaker AI endpoint. Set the minimum number of
- CSwitch to a multi-model endpoint. Use lazy loading without request batching.
- DSet the minimum number of instances to greater than 0. Enable response streaming.
- ESwitch to Amazon SageMaker Asynchronous Inference for all requests. Store requests in an
Unlock AIP-C01 to see the answer
You've previewed enough free AIP-C01 questions. Unlock AIP-C01 for full answers, explanations, the timed quiz mode, progress tracking, and the master PDF. Question stem and options stay visible so you can still see what's on the exam.