CERTIFIED-DATA-ANALYST-ASSOCIATE · Question #29
CERTIFIED-DATA-ANALYST-ASSOCIATE Question #29: Real Exam Question with Answer & Explanation
The correct answer is B: Partitioning the data into smaller chunks. Partitioning the data (B) is the correct optimization strategy because it reduces the amount of data scanned per query by organizing data into logical segments (e.g., by date or region), allowing Databricks to perform partition pruning - skipping irrelevant chunks entirely rather
Question
A Data analyst has been tasked with optimizing a Databricks SQL query for a large dataset. What should you consider when trying to improve query performance?
Options
- AIncreasing the size of the cluster to handle the data
- BPartitioning the data into smaller chunks
- CUsing a higher level of parallelism for the query
- DIncreasing the timeout for the query
Explanation
Partitioning the data (B) is the correct optimization strategy because it reduces the amount of data scanned per query by organizing data into logical segments (e.g., by date or region), allowing Databricks to perform partition pruning - skipping irrelevant chunks entirely rather than scanning the full dataset.
- A is wrong because scaling up the cluster addresses compute capacity, not query efficiency; a poorly structured query on a bigger cluster is still a poorly structured query.
- C is wrong because higher parallelism can actually hurt performance if data is skewed or if the overhead of coordinating more tasks outweighs the benefit - it's not a general-purpose fix.
- D is wrong because increasing the timeout only prevents queries from failing; it does nothing to make them run faster.
Memory tip: Think of partitioning like filing cabinets - instead of searching every drawer for a file, you go straight to the labeled drawer. The goal is to read less data, not to throw more hardware or time at the problem.
Community Discussion
No community discussion yet for this question.