CERTIFIED-DATA-ENGINEER-PROFESSIONAL · Question #70
CERTIFIED-DATA-ENGINEER-PROFESSIONAL Question #70: Real Exam Question with Answer & Explanation
The correct answer is E: %sh executes shell code on the driver node. The code does not take advantage of the worker. The code is using %sh to execute shell code on the driver node. This means that the code is not taking advantage of the worker nodes or Databricks optimized Spark. This is why the code is taking longer to execute. A better approach would be to use Databricks libraries and APIs to
Question
The following code has been migrated to a Databricks notebook from a legacy workload: The code executes successfully and provides the logically correct results, however, it takes over 20 minutes to extract and load around 1 GB of data. Which statement is a possible explanation for this behavior?
Options
- A%sh triggers a cluster restart to collect and install Git. Most of the latency is related to cluster
- BInstead of cloning, the code should use %sh pip install so that the Python code can get executed
- C%sh does not distribute file moving operations; the final line of code should be updated to
- DPython will always execute slower than Scala on Databricks. The run.py script should be
- E%sh executes shell code on the driver node. The code does not take advantage of the worker
Explanation
The code is using %sh to execute shell code on the driver node. This means that the code is not taking advantage of the worker nodes or Databricks optimized Spark. This is why the code is taking longer to execute. A better approach would be to use Databricks libraries and APIs to read and write data from Git and DBFS, and to leverage the parallelism and performance of Spark. For example, you can use the Databricks Connect feature to run your Python code on a remote Databricks cluster, or you can use the Spark Git Connector to read data from Git repositories as Spark DataFrames.
Topics
Community Discussion
No community discussion yet for this question.