nerdexam
DatabricksDatabricks

CERTIFIED-DATA-ENGINEER-PROFESSIONAL · Question #15

CERTIFIED-DATA-ENGINEER-PROFESSIONAL Question #15: Real Exam Question with Answer & Explanation

The correct answer is E: Replace the current overwrite logic with a merge statement to modify only those records that have. The approach that would simplify the identification of the changed records is to replace the current overwrite logic with a merge statement to modify only those records that have changed, and write logic to make predictions on the changed records identified by the change data fee

Building and Managing Production Data Pipelines

Question

A table in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources. The churn prediction model used by the ML team is fairly stable in production. The team is only interested in making predictions on records that have changed in the past 24 hours. Which approach would simplify the identification of these changed records?

Options

  • AApply the churn model to all rows in the customer_churn_params table, but implement logic to
  • BConvert the batch job to a Structured Streaming job using the complete output mode; configure a
  • CCalculate the difference between the previous model predictions and the current
  • DModify the overwrite logic to include a field populated by calling
  • EReplace the current overwrite logic with a merge statement to modify only those records that have

Explanation

The approach that would simplify the identification of the changed records is to replace the current overwrite logic with a merge statement to modify only those records that have changed, and write logic to make predictions on the changed records identified by the change data feed. This approach leverages the Delta Lake features of merge and change data feed, which are designed to handle upserts and track row-level changes in a Delta table. By using merge, the data engineering team can avoid overwriting the entire table every night, and only update or insert the records that have changed in the source data. By using change data feed, the ML team can easily access the change events that have occurred in the customer_churn_params table, and filter them by operation type (update or insert) and timestamp. This way, they can only make predictions on the records that have changed in the past 24 hours, and avoid re-processing the unchanged records.

Topics

#Delta Lake MERGE#Data Synchronization#Change Data Capture#Data Engineering Patterns

Community Discussion

No community discussion yet for this question.

Full CERTIFIED-DATA-ENGINEER-PROFESSIONAL PracticeBrowse All CERTIFIED-DATA-ENGINEER-PROFESSIONAL Questions