nerdexam
AmazonAmazon

DEA-C01 · Question #64

DEA-C01 Question #64: Real Exam Question with Answer & Explanation

The correct answer is B: Write an AWS Glue extract, transform, and load (ETL) job. Use the FindMatches machine. To identify and remove duplicate information from legacy application data with the least operational overhead during migration to an Amazon S3 data lake, an AWS Glue extract, transform, and load (ETL) job using the FindMatches machine learning transform is the most efficient solu

Data Ingestion and Transformation

Question

A company is migrating a legacy application to an Amazon S3 based data lake. A data engineer reviewed data that is associated with the legacy application. The data engineer found that the legacy data contained some duplicate information. The data engineer must identify and remove duplicate information from the legacy application data. Which solution will meet these requirements with the LEAST operational overhead?

Options

  • AWrite a custom extract, transform, and load (ETL) job in Python. Use the
  • BWrite an AWS Glue extract, transform, and load (ETL) job. Use the FindMatches machine
  • CWrite a custom extract, transform, and load (ETL) job in Python. Import the Python dedupe
  • DWrite an AWS Glue extract, transform, and load (ETL) job. Import the Python dedupe library. Use

Explanation

To identify and remove duplicate information from legacy application data with the least operational overhead during migration to an Amazon S3 data lake, an AWS Glue extract, transform, and load (ETL) job using the FindMatches machine learning transform is the most efficient solution.

Common mistakes.

  • A. Writing a custom Python ETL job involves significant operational overhead for development, deployment, and maintenance of the custom code and infrastructure, and hashlib is typically used for cryptographic hashing, not for semantic deduplication of records.
  • C. Writing a custom Python ETL job and importing the dedupe library requires managing custom infrastructure and library dependencies, leading to higher operational overhead compared to a fully managed service feature.
  • D. While using AWS Glue for the ETL job is good, importing and managing an external Python library like dedupe within Glue is generally more complex and has higher operational overhead than using native Glue ML transforms like FindMatches for deduplication.

Concept tested. AWS Glue FindMatches for deduplication

Reference. https://docs.aws.amazon.com/glue/latest/dg/machine-learning-transforms.html

Topics

#AWS Glue#ETL#Data Deduplication#Machine Learning Transforms

Community Discussion

No community discussion yet for this question.

Full DEA-C01 PracticeBrowse All DEA-C01 Questions