CERTIFIED-DATA-ENGINEER-PROFESSIONAL Practice Questions
123 real CERTIFIED-DATA-ENGINEER-PROFESSIONAL exam questions with expert-verified answers and explanations. Page 2 of 3.
- Question #51Monitoring and Optimization
Where in the Spark UI can one diagnose a performance problem induced by not leveraging predicate push-down?
Spark UIQuery OptimizationPredicate PushdownPerformance Tuning - Question #52Data Manipulation and Transformation with Apache Spark
Review the following error traceback: Which statement describes the error being raised?
Spark DataFramesError HandlingDebuggingColumn Operations - Question #53Managing Libraries and Dependencies on Databricks
Which distribution does Databricks support for installing custom Python code packages?
Python PackagingDatabricks LibrariesDependency ManagementCustom Code Deployment - Question #54Python Programming Fundamentals
Which Python variable contains a list of directories to be searched when trying to locate required modules?
Python ModulesModule Importsys modulePython Fundamentals - Question #55Developing Data Pipelines
Incorporating unit tests into a PySpark application requires upfront attention to the design of your jobs, or a potentially significant refactoring of existing code. Which statemen...
Unit TestingPySparkSoftware Engineering Best PracticesDebugging - Question #56Data Pipeline Development
Which statement describes integration testing?
Integration TestingSoftware TestingData Pipeline ReliabilityTesting Methodologies - Question #57Managing Databricks Workflows
Which REST API call can be used to review the notebooks configured to run as tasks in a multi- task job?
Databricks Jobs APIREST APIJob ConfigurationNotebook Tasks - Question #58Data Orchestration
A Databricks job has been configured with 3 tasks, each of which is a Databricks notebook. Task A does not depend on other tasks. Tasks B and C run in parallel, with each having a...
Databricks JobsTask DependenciesJob ExecutionFailure Handling - Question #59Implementing and Managing Delta Lake Tables
A Delta Lake table was created with the below query: Realizing that the original query had a typographical error, the below code was executed: ALTER TABLE prod.sales_by_stor RENAME...
Delta LakeTable ManagementMetastoreDDL Operations - Question #60Designing and Implementing Data Ingestion and Transformation Pipelines
The data engineering team maintains a table of aggregate statistics through batch nightly updates. This includes total sales for the previous day alongside totals and averages for...
Batch ProcessingData AccuracyData Pipeline DesignSCD Type 1 - Question #61Databricks Cluster Management and Monitoring
A distributed team of data analysts share computing resources on an interactive cluster with autoscaling configured. In order to better manage costs and query throughput, the works...
Databricks ClustersAutoscalingCluster MonitoringCluster Event Log - Question #62Optimizing Spark Applications
Which statement describes the correct use of pyspark.sql.functions.broadcast?
PySparkBroadcast JoinSpark OptimizationSpark DataFrames - Question #63Delta Lake Architecture and Performance Optimization
A Databricks SQL dashboard has been configured to monitor the total number of records present in a collection of Delta Lake tables using the following query pattern: SELECT COUNT (...
Delta LakeTransaction LogPerformance OptimizationDatabricks SQL - Question #64Cluster Monitoring and Optimization
When evaluating the Ganglia Metrics for a given cluster with 3 executor nodes, which indicator would signal proper utilization of the VM's resources?
Performance MonitoringResource UtilizationCluster ManagementGanglia Metrics - Question #65Databricks Jobs and Orchestration
A member of the data engineering team has submitted a short notebook that they wish to schedule as part of a larger data pipeline. Assume that the commands provided below produce t...
Databricks JobsNotebook Best PracticesJob Scheduling - Question #66Delta Lake Table Operations
A Delta Lake table was created with the below query: Consider the following query: DROP TABLE prod.sales_by_store If this statement is executed by a workspace admin, which result w...
Delta LakeDDL OperationsTable ManagementData Deletion - Question #67Spark Application Monitoring and Troubleshooting
Which of the following technologies can be used to identify key areas of text when parsing Spark Driver log4j output?
Spark LogsRegexText ParsingTroubleshooting Tools - Question #68Data Governance and Security
All records from an Apache Kafka producer are being ingested into a single Delta Lake table with the following schema: key BINARY, value BINARY, topic STRING, partition LONG, offse...
Delta LakeData PartitioningData RetentionData Access Control - Question #69Data Definition and Management
The data architect has mandated that all tables in the Lakehouse should be configured as external (also known as "unmanaged") Delta Lake tables. Which approach will ensure that thi...
Delta LakeExternal TablesSQL DDLLakehouse Architecture - Question #70Optimize Data Workloads on Databricks
The following code has been migrated to a Databricks notebook from a legacy workload: The code executes successfully and provides the logically correct results, however, it takes o...
Databricks execution modelDriver nodePerformance optimizationDistributed computing - Question #71Data Ingestion and Transformation
A data engineer is configuring a pipeline that will potentially see late-arriving, duplicate records. In addition to de-duplicating records within the batch, which of the following...
Delta LakeData DeduplicationMERGE INTOUpsert - Question #72Managing Databricks Workflows
A Databricks job has been configured with 3 tasks, each of which is a Databricks notebook. Task A does not depend on other tasks. Tasks B and C run in parallel, with each having a...
Databricks JobsTask DependenciesError HandlingData Persistence - Question #73Data Transformation and Processing
The data engineering team maintains the following code: Assuming that this code produces logically correct results and the data in the source table has been de-duplicated and valid...
Data Lakehouse ArchitectureBatch Data ProcessingData AggregationTable Overwrite Operations - Question #75Databricks Platform Operations
Assuming that the Databricks CLI has been installed and configured correctly, which Databricks CLI command can be used to upload a custom Python Wheel to object storage mounted wit...
Databricks CLIDBFSFile UploadPython Wheels - Question #76Query Performance Optimization
The data science team has requested assistance in accelerating queries on free form text from user reviews. The data is currently stored in Parquet with the below schema: item_id I...
Delta LakeQuery OptimizationData SkippingText Data - Question #77Implement Data Governance and Security
The data architect has decided that once data has been ingested from external sources into the Databricks Lakehouse, table access controls will be leveraged to manage permissions f...
Databricks Unity CatalogSQL PermissionsGRANT StatementTable Access Control - Question #78Data Governance and Security
A junior data engineer has manually configured a series of jobs using the Databricks Jobs UI. Upon reviewing their work, the engineer realizes that they are listed as the "Owner" f...
Databricks JobsJob OwnershipPermissionsAccess Control - Question #79Delta Lake Data Management
A CHECK constraint has been successfully added to the Delta table named activity_details using the following logic: A batch job is attempting to insert new records to the table, in...
Delta LakeCHECK ConstraintData IntegrityBatch Processing - Question #80Streaming Data Processing
A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and a...
Spark Structured StreamingWatermarkingLate-arriving dataEvent time processing - Question #81Data Processing Optimization
A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part- file size of 512 MB. Because Parquet is being used instead of Delta Lake, buil...
Spark ConfigurationParquetOutput File SizingPerformance Tuning - Question #82Databricks Data Architecture
Two of the most common data locations on Databricks are the DBFS root storage and external object storage mounted with dbutils.fs.mount(). Which of the following statements is corr...
Databricks File System (DBFS)Object StorageData Storage ArchitectureMount Points - Question #83Data Transformation and Delivery for Analytics
The business intelligence team has a dashboard configured to track various summary metrics for retail stories. This includes total sales for the previous day alongside totals and a...
Batch ProcessingData MaterializationPerformance OptimizationCost Management - Question #84Managing Delta Lake Tables
A Delta Lake table in the Lakehouse named customer_parsams is used in churn prediction by the machine learning team. The table contains information about customers derived from a n...
Delta LakeTime TravelSQL QueriesData Versioning - Question #85Streaming Data Processing
A data engineer is performing a join operating to combine values from a static userlookup table with a streaming DataFrame streamingDF. Which code block attempts to perform an inva...
Spark Structured StreamingStream-static joinDataFrame operationsJoin types - Question #86Data Ingestion
Which statement describes the default execution mode for Databricks Auto Loader?
Databricks Auto LoaderData IngestionIncremental ProcessingFile Listing Mode - Question #87Monitoring and Optimizing Spark Application Performance
Spill occurs as a result of executing various wide transformations. However, diagnosing spill requires one to proactively look for key indicators. Where in the Spark UI are two of...
Spark UIPerformance TuningData SpillShuffle Operations - Question #88Optimizing Delta Lake Tables
A Delta Lake table representing metadata about content from user has the following schema: Based on the above schema, which column is a good candidate for partitioning the Delta Ta...
Delta LakeData PartitioningPerformance OptimizationData Modeling - Question #89Implement Data Quality in Delta Live Tables
A team of data engineer are adding tables to a DLT pipeline that contain repetitive expectations for many of the same data quality checks. One member of the team suggests reusing t...
Delta Live Tables (DLT)Data QualityExpectationsCode Reusability - Question #90Implement and Manage Security
The data engineer team has been tasked with configured connections to an external database that does not have a supported native connector with Databricks. The external database al...
Databricks SecretsAccess ControlSecurity ManagementLeast Privilege - Question #91Implement Data Quality with Delta Live Tables
A user wants to use DLT expectations to validate that a derived table report contains all records from the source, included in the table validation_copy. The user attempts and fail...
DLT ExpectationsData ValidationSQL ViewsData Quality - Question #92Pipeline Orchestration and Management
A data engineer needs to capture pipeline settings from an existing in the workspace, and use them to create and version a JSON file to create a new pipeline. Which command should...
Databricks CLIDelta Live TablesPipeline ManagementJSON Configuration - Question #93Environment Management
What is a method of installing a Python package scoped at the notebook level to all nodes in the currently active cluster?
Databricks LibrariesPython PackagesCluster ManagementNotebooks - Question #94Software Development Practices
A data engineer is testing a collection of mathematical functions, one of which calculates the area under a curve as described by another function. Which kind of the test does the...
Software TestingUnit TestingData Engineering Best PracticesCode Quality - Question #95Data Quality and Testing
Which statement describes a key benefit of an end-to-end test?
End-to-End TestingTesting MethodologiesApplication Testing - Question #96Testing Data Solutions
A Data engineer wants to run unit's tests using common Python testing frameworks on python functions defined across several Databricks notebooks currently used in production. How c...
Unit TestingProduction EnvironmentsTest Data ManagementDatabricks Development - Question #97Delta Lake Table Management
In order to prevent accidental commits to production data, a senior data engineer has instituted a policy that all development work will reference clones of Delta Lake tables. Afte...
Delta LakeShallow CloneVACUUM OperationData Retention - Question #98Performance Optimization
A junior data engineer is working to implement logic for a Lakehouse table named silver_device_recordings. The source data contains 100 unique fields in a highly nested JSON struct...
Delta LakePerformance OptimizationData SkippingSchema Design - Question #99Data Governance and Security
The DevOps team has configured a production workload as a collection of notebooks scheduled to run daily using the Jobs Ul. A new data engineering hire is onboarding to the team an...
Databricks PermissionsNotebook Access ControlSecurity Best PracticesProduction Workloads - Question #100Data Transformation and Processing
The view updates represents an incremental batch of all newly ingested data to be inserted or updated in the customers table. The following logic is used to process these records....
Slowly Changing Dimensions (SCD)SQL MERGEData VersioningData Warehousing Concepts - Question #101Configure and Manage Databricks Clusters and Spark Runtimes
Which statement regarding spark configuration on the Databricks platform is true?
Spark ConfigurationDatabricks ClustersConfiguration ScopeEnvironment Management