A company is migrating a legacy application to an Amazon S3 based data lake. A data engineer reviewed data that is associated with the legacy application. The data engineer found that the legacy data contained some duplicate information. The data engineer must identify and remove duplicate information from the legacy application data. Which solution will meet these requirements with the LEAST operational overhead?

Question

Accepted Answer

B. Write an AWS Glue extract, transform, and load (ETL) job. Use the FindMatches machine AWS Glue's FindMatches machine learning transform is specifically designed to identify duplicate records, even if they are not exact matches (fuzzy matching), providing a managed and serverless solution that minimizes operational overhead for deduplication tasks.

Answer

A. Write a custom extract, transform, and load (ETL) job in Python. Use the Writing a custom Python ETL job involves significant operational overhead for development, deployment, and maintenance of the custom code and infrastructure, and `hashlib` is typically used for cryptographic hashing, not for semantic deduplication of records.

Answer

C. Write a custom extract, transform, and load (ETL) job in Python. Import the Python dedupe Writing a custom Python ETL job and importing the `dedupe` library requires managing custom infrastructure and library dependencies, leading to higher operational overhead compared to a fully managed service feature.

Answer

D. Write an AWS Glue extract, transform, and load (ETL) job. Import the Python dedupe library. Use While using AWS Glue for the ETL job is good, importing and managing an external Python library like `dedupe` within Glue is generally more complex and has higher operational overhead than using native Glue ML transforms like FindMatches for deduplication.

A company is migrating a legacy application to an Amazon S3 based data lake. A data engineer reviewed data that is associated with the legacy application. The data engineer found that the legacy data

Question

Options

How the community answered

Why each option

Topics

Community Discussion