AIP-C01 · Question #78
AIP-C01 Question #78: Real Exam Question with Answer & Explanation
The correct answer is B: Deploy a low-latency, real-time optimized model on Amazon Bedrock. Purchase provisioned. Option B is the correct solution because it aligns with AWS guidance for building high-throughput, ultra-low-latency GenAI applications while maintaining predictable costs and automatic scaling. Amazon Bedrock provides access to foundation models that are specifically optimized f
Question
A company is developing a generative AI (GenAI) application that analyzes customer service calls in real time and generates suggested responses for human customer service agents. The application must process 500,000 concurrent calls during peak hours with less than 200 ms end- to-end latency for each suggestion. The company uses existing architecture to transcribe customer call audio streams. The application must not exceed a predefined monthly compute budget and must maintain auto scaling capabilities. Which solution will meet these requirements?
Options
- ADeploy a large, complex reasoning model on Amazon Bedrock. Purchase provisioned throughput
- BDeploy a low-latency, real-time optimized model on Amazon Bedrock. Purchase provisioned
- CDeploy a large language model (LLM) on an Amazon SageMaker real-time endpoint that uses
- DDeploy a mid-sized language model on an Amazon SageMaker serverless endpoint that is
Explanation
Option B is the correct solution because it aligns with AWS guidance for building high-throughput, ultra-low-latency GenAI applications while maintaining predictable costs and automatic scaling. Amazon Bedrock provides access to foundation models that are specifically optimized for real- time inference use cases, including conversational and recommendation-style workloads that require responses within milliseconds. Low-latency models in Amazon Bedrock are designed to handle very high request rates with minimal per-request overhead. Purchasing provisioned throughput ensures that sufficient model capacity is reserved to handle peak loads, eliminating cold starts and reducing request queuing during traffic surges. This is critical when supporting up to 500,000 concurrent calls with strict latency requirements. Automatic scaling policies allow the application to dynamically adjust capacity based on demand, ensuring cost efficiency during off-peak hours while maintaining performance during peak usage. This directly supports the requirement to stay within a predefined monthly compute budget.
Topics
Community Discussion
No community discussion yet for this question.