DP-300 · Question #133
DP-300 Question #133: Real Exam Question with Answer & Explanation
This question tests knowledge of Delta Lake partitioning strategies in Azure Databricks to optimize storage and query performance for incremental load pipelines.
Question
Hotspot Question You plan to develop a dataset named Purchases by using Azure Databricks. Purchases will contain the following columns: - ProductID - ItemPrice - LineTotal - Quantity - StoreID - Minute - Month - Hour - Year - Day You need to store the data to support hourly incremental load pipelines that will vary for each StoreID. The solution must minimize storage costs. How should you complete the code? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Answer:
Options
- __typehotspot
- variantdropdown
Explanation
This question tests knowledge of Delta Lake partitioning strategies in Azure Databricks to optimize storage and query performance for incremental load pipelines.
Approach. The correct approach is to use Delta format with partitioning on StoreID and then time-based columns (Year, Month, Day, Hour) in that order. The code should use partitionBy('StoreID', 'Year', 'Month', 'Day', 'Hour') when writing the DataFrame, because the pipelines vary per StoreID and run hourly, meaning partitioning by StoreID first allows efficient pruning per store, and then hierarchical time partitioning (Year > Month > Day > Hour) enables incremental hourly loads to only touch the relevant partition directories. This minimizes storage costs by avoiding full table scans and rewrites - only the relevant hour/store partition is updated during each incremental load. Using Delta format also enables ACID transactions and partition overwrite, which is essential for reliable incremental pipelines.
Concept tested. Delta Lake partitioning strategy in Azure Databricks - specifically choosing the correct partition columns and their order to optimize incremental load pipelines that are scoped by StoreID and time (hourly), while minimizing storage overhead and unnecessary data reads/writes.
Reference. Microsoft Learn: Optimize Delta Lake for Azure Databricks - https://learn.microsoft.com/en-us/azure/databricks/delta/best-practices#choose-the-right-partition-column
Topics
Community Discussion
No community discussion yet for this question.