CERTIFIED-DATA-ENGINEER-PROFESSIONAL Practice Questions
123 real CERTIFIED-DATA-ENGINEER-PROFESSIONAL exam questions with expert-verified answers and explanations. Page 1 of 3.
- Question #1Databricks Job Orchestration and Parameterization
An upstream system has been configured to pass the date for a given batch of data to the Databricks Jobs API as a parameter. The notebook to be scheduled will use this parameter to...
Databricks JobsNotebook Parameterizationdbutils WidgetsData Ingestion - Question #2Administering Databricks Workspaces
The Databricks workspace administrator has configured interactive clusters for each of the data engineering groups. To control costs, clusters are set to terminate after 30 minutes...
Databricks PermissionsCluster AccessAccess ControlWorkspace Security - Question #3Deploying and Operating Data Pipelines
When scheduling Structured Streaming jobs for production, which configuration automatically recovers from query failures and keeps costs low?
Structured StreamingCluster ManagementCost OptimizationFault Tolerance - Question #4Data Monitoring and Alerting
The data engineering team has configured a Databricks SQL query and alert to monitor the values in a Delta Lake table. The recent_sensor_recordings table contains an identifying se...
Databricks SQLMonitoring & AlertingSQL AggregationData Interpretation - Question #5Databricks Repos and Version Control
A junior developer complains that the code in their notebook isn't producing the correct results in the development environment. A shared screenshot reveals that while they're usin...
Databricks ReposGit BranchingVersion ControlRemote Repositories - Question #6Databricks Security and Access Control
The security team is exploring whether or not the Databricks secrets module can be leveraged for connecting to an external database. After testing the code with all Python variable...
Databricks SecretsSecuritydbutilsCredential Management - Question #7Delta Lake Data Management
The data science team has created and logged a production model using MLflow. The following code correctly imports and applies the production model to output the predictions as a n...
Delta LakeSpark DataFrame APIData IngestionData Persistence - Question #8Data Ingestion and Processing
An upstream source writes Parquet data as hourly batches to directories named with the current date. A nightly batch job runs the following code to ingest all data from the previou...
Data IngestionDeduplicationDelta LakeData Quality - Question #9Databricks Notebook Language Interoperability
A junior member of the data engineering team is exploring the language interoperability of Databricks notebooks. The intended outcome of the below code is to register a view of all...
Databricks NotebooksLanguage InteroperabilitySQL ViewsPython Variables - Question #10Delta Lake Performance Optimization
A Delta table of weather records is partitioned by date and has the below schema: date DATE, device_id INT, temp FLOAT, latitude FLOAT, longitude FLOAT To find all the records from...
Delta LakeData SkippingQuery OptimizationMetadata Management - Question #11Delta Lake Data Retention and Governance
The data engineering team has configured a job to process customer requests to be forgotten (have their data deleted). All user data that needs to be deleted is stored in Delta Lak...
Delta LakeData RetentionVACUUM CommandTime Travel - Question #12Orchestrating Production Workloads
A junior data engineer has configured a workload that posts the following JSON to the Databricks REST API endpoint 2.0/jobs/create. Assuming that all configurations and referenced...
Databricks JobsREST APIJob CreationWorkload Automation - Question #13Data Ingestion
An upstream system is emitting change data capture (CDC) logs that are being written to a cloud object storage directory. Each record in the log indicates the change type (insert,...
Change Data Capture (CDC)Delta LakeMedallion ArchitectureData Ingestion - Question #14Data Transformation
An hourly batch job is configured to ingest data files from a cloud object storage container where each batch represent all records produced by the source system in a given hour. T...
Delta LakeSCD Type 1MERGE INTOData Transformation - Question #15Building and Managing Production Data Pipelines
A table in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number...
Delta Lake MERGEData SynchronizationChange Data CaptureData Engineering Patterns - Question #16Data Storage and Management
A table is registered with the following code: Both users and orders are Delta Lake tables. Which statement describes the results of querying recent_orders?
Materialized ViewsDelta LakeData PersistenceQuery Execution - Question #17Workload Management and Optimization
A production workload incrementally applies updates from an external Change Data Capture feed to a Delta Lake table as an always-on Structured Stream job. When data was initially m...
Delta LakeAuto OptimizeMERGE OperationsFile Compaction - Question #18Data Streaming
Which statement regarding stream-static joins and static Delta tables is correct?
Structured StreamingDelta LakeStream-Static JoinMicrobatch Processing - Question #19Streaming Data Processing
A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and a...
Spark Structured StreamingWindow FunctionsData AggregationDataFrame API - Question #20Streaming Data Processing
A data architect has designed a system in which two Structured Streaming jobs will concurrently write to a single bronze Delta table. Each job is subscribing to a different topic f...
Structured StreamingCheckpointingDelta LakeConcurrency - Question #21Streaming Data Processing
A Structured Streaming job deployed to production has been experiencing delays during peak hours of the day. At present, during normal execution, each microbatch of data is process...
Structured StreamingPerformance OptimizationTrigger IntervalMicrobatch Processing - Question #22Delta Lake Features and Performance Optimization
Which statement describes Delta Lake Auto Compaction?
Delta LakeAuto CompactionData OptimizationFile Management - Question #23Real-time Data Processing with Spark Structured Streaming
Which statement characterizes the general programming model used by Spark Structured Streaming?
Spark Structured StreamingProgramming ModelStream Processing ConceptsUnbounded Tables - Question #24Optimizing Spark Workloads
Which configuration parameter directly affects the size of a spark-partition upon ingestion of data into Spark?
Spark PartitioningSpark ConfigurationData Ingestion - Question #25Optimizing Spark Applications
A Spark job is taking longer than expected. Using the Spark UI, a data engineer notes that the Min, Median, and Max Durations for tasks in a particular stage show the minimum and m...
Spark PerformanceData SkewSpark UITroubleshooting - Question #26Spark Performance Optimization
Each configuration below is identical to the extent that each cluster has 400 GB total of RAM, 160 total cores and only one Executor per VM. Given a job with at least one wide tran...
Spark PerformanceCluster ConfigurationData ShufflingExecutor Sizing - Question #27Data Manipulation with Delta Lake
A junior data engineer on your team has implemented the following code block. The view new_events contains a batch of records with the same schema as the events Delta table. The ev...
Delta LakeMERGE StatementDuplicate HandlingData Ingestion - Question #28Ingesting and Transforming Data
A junior data engineer seeks to leverage Delta Lake's Change Data Feed functionality to create a Type 1 table representing all of the values that have ever been valid for all rows...
Delta LakeChange Data Feed (CDF)Data VersioningData Ingestion - Question #29Designing and Implementing Data Pipelines
A new data engineer notices that a critical field was omitted from an application that writes its Kafka source to Delta Lake. This happened even though the critical field was in th...
Delta LakeData ReliabilityBronze LayerStreaming Data - Question #30Building Data Pipelines with Delta Lake
A nightly job ingests data into a Delta Lake table using the following code: The next step in the pipeline requires a function that returns an object that can be used to manipulate...
Delta LakeStructured StreamingData PipelinesData Ingestion - Question #31Ingesting and Transforming Data
A junior data engineer is working to implement logic for a Lakehouse table named silver_device_recordings. The source data contains 100 unique fields in a highly nested JSON struct...
Schema InferenceDelta LakeJSON DataData Ingestion - Question #32Data Transformation and Loading
The data engineering team maintains the following code: Assuming that this code produces logically correct results and the data in the source tables has been de-duplicated and vali...
Batch ProcessingDelta LakeTable OverwriteData Write Modes - Question #33Data Governance and Security
The data engineering team is migrating an enterprise system with thousands of tables and views into the Lakehouse. They plan to implement the target architecture using a series of...
Data SecurityLakehouse ArchitectureData GovernanceAccess Control - Question #34Data Storage and Management
The data architect has mandated that all tables in the Lakehouse should be configured as external Delta Lake tables. Which approach will ensure that this requirement is met?
Delta LakeExternal TablesTable ManagementSQL DDL - Question #35Data Management and Schema Evolution
To reduce storage and compute costs, the data engineering team has been tasked with curating a series of aggregate tables leveraged by business intelligence dashboards, customer-fa...
Schema EvolutionData ManagementData LakehouseChange Management - Question #36Optimizing Data Lake Performance
A Delta Lake table representing metadata about content posts from users has the following schema: user_id LONG, post_text STRING, post_id STRING, longitude FLOAT, latitude FLOAT, p...
Delta LakeQuery OptimizationFile SkippingColumn Statistics - Question #37Databricks Infrastructure Planning
A small company based in the United States has recently contracted a consulting firm in India to implement several new data engineering pipelines to power artificial intelligence a...
Cloud Cost ManagementData LocalityDatabricks Workspace DeploymentPerformance Optimization - Question #38Data Management and Quality
The downstream consumers of a Delta Lake table have been complaining about data quality issues impacting performance in their applications. Specifically, they have complained that...
Delta LakeCHECK ConstraintsData QualityTable Alteration - Question #39Optimizing Data Lakehouse Performance
Which of the following is true of Delta Lake and the Lakehouse?
Delta LakeData SkippingPerformance OptimizationTable Statistics - Question #40Data Modeling and Data Warehousing Concepts
The view updates represents an incremental batch of all newly ingested data to be inserted or updated in the customers table. The following logic is used to process these records....
Slowly Changing DimensionsData WarehousingData ModelingETL/ELT Concepts - Question #41Manage Security and Access Control
The DevOps team has configured a production workload as a collection of notebooks scheduled to run daily using the Jobs UI. A new data engineering hire is onboarding to the team an...
Databricks PermissionsNotebook Access ControlWorkspace SecurityProduction Best Practices - Question #42Data Governance and Security
A table named user_ltv is being used to create a view that will be used by data analysts on various teams. Users in the workspace are configured into groups, which are used for set...
Dynamic Data MaskingColumn-Level SecurityData Access ControlViews - Question #43Data Governance and Metadata Management
The data governance team has instituted a requirement that all tables containing Personal Identifiable Information (PH) must be clearly annotated. This includes adding column comme...
SQL CommandsMetadata ManagementData GovernanceDatabricks SQL - Question #44Delta Lake Data Governance and Management
The data governance team is reviewing code used for deleting records for compliance with GDPR. They note the following logic is used to delete records from the Delta Lake table nam...
Delta LakeData DeletionTime TravelData Retention - Question #45Data Storage and Management on Databricks
An external object storage container has been mounted to the location /mnt/finance_eda_bucket. The following logic was executed to create a database for the finance team: After the...
Managed TablesDatabase LocationDatabricks StorageExternal Mounts - Question #46Security and Governance
Although the Databricks Utilities Secrets module provides tools to store sensitive credentials and avoid accidentally displaying them in plain text users should still be careful wi...
Databricks SecretsSecurityAPI AccessCredential Management - Question #47Managing Databricks Workflows and Jobs
What statement is true regarding the retention of job run history?
Databricks JobsJob ManagementData RetentionPlatform Configuration - Question #48Security and Governance
A data engineer, User A, has promoted a new pipeline to production by using the REST API to programmatically create several jobs. A DevOps engineer, User B, has configured an exter...
Audit LogsDatabricks REST APIPersonal Access TokensJob Management - Question #49Optimizing Spark Applications
A user new to Databricks is trying to troubleshoot long execution times for some pipeline logic they are working on. Presently, the user is executing code cell-by-cell, using displ...
Spark Lazy EvaluationDatabricks NotebooksPerformance TuningSpark Actions - Question #50Performance Optimization and Monitoring
A production cluster has 3 executor nodes and uses the same virtual machine type for the driver and executor. When evaluating the Ganglia Metrics for this cluster, which indicator...
Spark Performance TuningCluster MonitoringDriver BottleneckDatabricks Ganglia Metrics