nerdexam
DatabricksDatabricks

CERTIFIED-MACHINE-LEARNING-PROFESSIONAL · Question #20

CERTIFIED-MACHINE-LEARNING-PROFESSIONAL Question #20: Real Exam Question with Answer & Explanation

The correct answer is E: Tuning the file size. There appears to be an error in the marked correct answer. Based on the question's own wording, A. Z-Ordering is the correct answer - and the question actually rules out E explicitly. Why A (Z-Ordering) is correct: Z-Ordering is a Delta Lake optimization that colocates similar re

Question

A machine learning engineering team has written predictions computed in a batch job to a Delta table for querying. However, the team has noticed that the querying is running slowly. The team has already tuned the size of the data files. Upon investigating, the team has concluded that the rows meeting the query condition are sparsely located throughout each of the data files. Based on the scenario, which of the following optimization techniques could speed up the query by colocating similar records while considering values in multiple columns?

Options

  • AZ-Ordering
  • BBin-packing
  • CWrite as a Parquet file
  • DData skipping
  • ETuning the file size

Explanation

There appears to be an error in the marked correct answer. Based on the question's own wording, A. Z-Ordering is the correct answer - and the question actually rules out E explicitly.

Why A (Z-Ordering) is correct: Z-Ordering is a Delta Lake optimization that colocates similar records within the same data files by mapping multi-dimensional column values onto a single space-filling curve. It directly solves the stated problem - rows matching query conditions are spread sparsely across files - by physically reorganizing data so that rows with similar values in multiple specified columns land together.

Why the distractors are wrong:

  • B. Bin-packing compacts small files into larger ones to reduce file count overhead - it addresses file quantity, not row placement within files.
  • C. Write as Parquet is already what Delta tables use under the hood; switching format doesn't reorganize row placement.
  • D. Data skipping is a read-time optimization (using min/max column statistics to skip files) - it doesn't colocate records and works poorly when rows are sparsely distributed, which is exactly the current problem.
  • E. Tuning file size is explicitly stated as already done and does not address intra-file row distribution.

Memory tip: Think of Z-Ordering as "sorting in multiple dimensions at once" - the Z hints at the zigzag pattern of a space-filling curve traversing multiple column axes simultaneously.

Community Discussion

No community discussion yet for this question.

Full CERTIFIED-MACHINE-LEARNING-PROFESSIONAL PracticeBrowse All CERTIFIED-MACHINE-LEARNING-PROFESSIONAL Questions