nerdexam
AmazonAmazon

AIP-C01 · Question #35

AIP-C01 Question #35: Real Exam Question with Answer & Explanation

Sign in or unlock AIP-C01 to reveal the answer and full explanation for question #35. The question stem and answer options stay visible for context.

Deployment, Operations, and Optimization

Question

A publishing company is developing a chat assistant that uses a containerized large language model (LLM) that runs on Amazon SageMaker AI. The architecture consists of an Amazon API Gateway REST API that routes user requests to an AWS Lambda function. The Lambda function invokes a SageMaker AI real-time endpoint that hosts the LLM. Users report uneven response times. Analytics show that a high number of chats are abandoned after 2 seconds of waiting for the first token. The company wants a solution to ensure that p95 latency is under 800 ms for interactive requests to the chat assistant. Which combination of solutions will meet this requirement? (Select TWO.)

Options

  • AEnable model preload upon container startup. Implement dynamic batching to process multiple
  • BSelect a larger GPU instance type for the SageMaker AI endpoint. Set the minimum number of
  • CSwitch to a multi-model endpoint. Use lazy loading without request batching.
  • DSet the minimum number of instances to greater than 0. Enable response streaming.
  • ESwitch to Amazon SageMaker Asynchronous Inference for all requests. Store requests in an

Unlock AIP-C01 to see the answer

You've previewed enough free AIP-C01 questions. Unlock AIP-C01 for full answers, explanations, the timed quiz mode, progress tracking, and the master PDF. Question stem and options stay visible so you can still see what's on the exam.

Topics

#SageMaker Inference#LLM Deployment#Low Latency#Real-time AI
Full AIP-C01 PracticeBrowse All AIP-C01 Questions