nerdexam
DatabricksDatabricks

CERTIFIED-DATA-ENGINEER-PROFESSIONAL · Question #81

CERTIFIED-DATA-ENGINEER-PROFESSIONAL Question #81: Real Exam Question with Answer & Explanation

The correct answer is B: Set spark.sql.shuffle.partitions to 2,048 partitions (1TB*1024*1024/512), ingest the data, execute. The key to efficiently converting a large JSON dataset to Parquet files of a specific size without shuffling data lies in controlling the size of the output files directly. Setting spark.sql.files.maxPartitionBytes to 512 MB configures Spark to process data in chunks of 512 MB. T

Data Processing Optimization

Question

A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part- file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto-Optimize & Auto-Compaction cannot be used. Which strategy will yield the best performance without shuffling data?

Options

  • ASet spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow
  • BSet spark.sql.shuffle.partitions to 2,048 partitions (1TB*1024*1024/512), ingest the data, execute
  • CSet spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB bytes, ingest the data, execute the
  • DIngest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB*
  • ESet spark.sql.shuffle.partitions to 512, ingest the data, execute the narrow transformations, and

Explanation

The key to efficiently converting a large JSON dataset to Parquet files of a specific size without shuffling data lies in controlling the size of the output files directly. Setting spark.sql.files.maxPartitionBytes to 512 MB configures Spark to process data in chunks of 512 MB. This setting directly influences the size of the part-files in the output, aligning with the target Narrow transformations (which do not involve shuffling data across partitions) can then be applied Writing the data out to Parquet will result in files that are approximately the size specified by spark.sql.files.maxPartitionBytes, in this case, 512 MB. The other options involve unnecessary shuffles or repartitions (B, C, D) or an incorrect setting for this specific requirement (E).

Topics

#Spark Configuration#Parquet#Output File Sizing#Performance Tuning

Community Discussion

No community discussion yet for this question.

Full CERTIFIED-DATA-ENGINEER-PROFESSIONAL PracticeBrowse All CERTIFIED-DATA-ENGINEER-PROFESSIONAL Questions