DEA-C02 · Question #95
DEA-C02 Question #95: Real Exam Question with Answer & Explanation
The correct answer is C: Use the method to store results temporarily, apply filtering operations. Option C is correct because storing intermediate results temporarily using Snowpark's cacheResult() method avoids redundant recomputation and keeps processed data within Snowflake's storage layer, reducing costly data movement between nodes. Applying filtering operations early (p
Question
A Data Engineer is using Snowpark to perform data transformations on large data sets in Snowflake. The Engineer wants to optimize the transformation process by minimizing the amount of data distributed across nodes, and leveraging Snowflake's compute resources most effectively. Which combination of strategies should the Engineer use to meet these requirements?
Options
- ALoad data using read, apply sorting operations before filtering, and use
- BPerform joins using the default settings, apply transformations across all columns simultaneously,
- CUse the method to store results temporarily, apply filtering operations
- DCreate data samples with session.createDataFrame(), merge datasets using unionAll(),
Explanation
Option C is correct because storing intermediate results temporarily using Snowpark's cacheResult() method avoids redundant recomputation and keeps processed data within Snowflake's storage layer, reducing costly data movement between nodes. Applying filtering operations early (predicate pushdown) shrinks the working dataset before joins or aggregations, directly minimizing how much data gets shuffled across nodes - the core of the optimization goal.
Why the distractors are wrong:
- A is wrong because sorting before filtering wastes compute - you sort a large dataset only to then discard much of it. Always filter first to reduce data volume, then sort the smaller result.
- B is wrong because default join settings make no guarantees about data distribution optimization, and transforming all columns simultaneously can be wasteful when only a subset of columns are needed downstream.
- D is wrong because
session.createDataFrame()is designed for small, local Python/Pandas data - not large datasets. UsingunionAll()as a merging strategy adds rows rather than optimizing distribution.
Memory tip: Think "Cache and Filter First" - in Snowpark, cacheResult() saves your progress (like a checkpoint), and early filters act as a gate that keeps unnecessary data out of the pipeline entirely. If data doesn't enter the pipeline, it can't be distributed needlessly.
Topics
Community Discussion
No community discussion yet for this question.