A banking company uses an application to collect large volumes of transactional data. The company uses Amazon Kinesis Data Streams for real-time analytics. The company's application uses the PutRecord action to send data to Kinesis Data Streams. A data engineer has observed network outages during certain times of day. The data engineer wants to configure exactly-once delivery for the entire processing pipeline. Which solution will meet this requirement?

Question

Accepted Answer

A. Design the application so it can remove duplicates during processing by embedding a unique ID Achieving exactly-once delivery in a streaming pipeline, especially with potential network outages, often relies on the consumer being idempotent. By embedding a unique ID (like a UUID or a combination of event details) in each record at the source, the downstream processing application can detect and discard duplicate records that may arise from retries during network outages, ensuring each record is processed only once.

Answer

B. Update the checkpoint configuration of the Amazon Managed Service for Apache Flink Updating the checkpoint configuration of Amazon Managed Service for Apache Flink primarily helps Flink applications recover from failures and maintain state consistency during processing, contributing to at-least-once or exactly-once processing within Flink itself, but it does not prevent duplicate ingestion into Kinesis Data Streams from the producer.

Answer

C. Design the data source so events are not ingested into Kinesis Data Streams multiple times. While ideal, ensuring the data source never ingests events multiple times can be challenging or impossible to guarantee in the face of unreliable networks or application failures, and the solution needs to handle duplicates that do occur.

Answer

D. Stop using Kinesis Data Streams. Use Amazon EMR instead. Use Apache Flink and Apache Switching from Kinesis Data Streams to Amazon EMR (which is a big data processing framework, not a streaming ingestion service) and using Apache Flink and Apache Spark does not inherently solve the exactly-once delivery problem originating from producer retries and network outages, and is an unnecessary platform change.

A banking company uses an application to collect large volumes of transactional data. The company uses Amazon Kinesis Data Streams for real-time analytics. The company's application uses the PutRecord

Question

Options

How the community answered

Why each option

Topics

Community Discussion