CERTIFIED-DATA-ENGINEER-PROFESSIONAL Practice Questions
123 real CERTIFIED-DATA-ENGINEER-PROFESSIONAL exam questions with expert-verified answers and explanations. Page 3 of 3.
- Question #102Orchestrating Data Pipelines
The business reporting tem requires that data for their dashboards be updated every hour. The total processing time for the pipeline that extracts transforms and load the data for...
Databricks JobsCost OptimizationCluster ConfigurationSLA Management - Question #103Code Management and Version Control
A developer has successfully configured credential for Databricks Repos and cloned a remote Git repository. Hey don not have privileges to make changes to the main branch, which is...
Databricks ReposGit BranchingGit WorkflowVersion Control - Question #104ML Model Integration in Spark Data Pipelines
The data science team has created and logged a production model using MLflow. The model accepts a list of column names and returns a new column of type DOUBLE. The following code c...
PySpark DataFramesMLflowUDFsModel Inference - Question #105Schema Management in Delta Lake
The following table consists of items found in user carts within an e-commerce website. The following MERGE statement is used to update this table using an updates view, with schem...
Delta LakeSchema EvolutionMERGE StatementData Engineering - Question #106Building and Managing Streaming Pipelines
A data team's Structured Streaming job is configured to calculate running aggregates for item sales to update a downstream marketing dashboard. The marketing team has introduced a...
Structured StreamingCheckpointingSchema EvolutionStreaming Operations - Question #107Delta Lakehouse Data Management and Optimization
Which statement describes Delta Lake optimized writes?
Delta LakeOptimized WritesData Storage OptimizationFile Compaction - Question #108Data Pipelines and Workflows
A DLT pipeline includes the following streaming tables: Raw_lot ingest raw device measurement data from a heart rate tracking device. Bgm_stats incrementally computes user statisti...
Delta Live TablesPipeline ConfigurationData RetentionTable Properties - Question #109Data Modeling and Schema Evolution
A data pipeline uses Structured Streaming to ingest data from kafka to Delta Lake. Data is being stored in a bronze table, and includes the Kafka_generated timesamp, key, and value...
Schema EvolutionDelta LakeStructured StreamingData Ingestion - Question #110Data Security and Governance
The data engineer team is configuring environment for development testing, and production before beginning migration on a new data pipeline. The team requires extensive testing on...
Data SecurityEnvironment ManagementAccess ControlData Governance - Question #111Data Access and Security
The marketing team is looking to share data in an aggregate table with the sales organization, but the field names used by the teams do not match, and a number of marketing specifi...
SQL ViewsData Access ControlData TransformationData Governance - Question #112Designing and Implementing Data Models on Databricks
A junior data engineer is migrating a workload from a relational database system to the Databricks Lakehouse. The source system uses a star schema, leveraging foreign key constrain...
Delta Lake ACIDData MigrationForeign Key ConstraintsLakehouse Architecture - Question #113Data Lakehouse Design and Architecture
A data architect has heard about lake's built-in versioning and time travel capabilities. For auditing purposes they have a requirement to maintain a full of all valid street addre...
Delta LakeTime TravelSCD (Slowly Changing Dimensions)Data Auditing - Question #115Data Governance and Security
A table named user_ltv is being used to create a view that will be used by data analysts on various teams. Users in the workspace are configured into groups, which are used for set...
SQL ViewsData FilteringAccess Control - Question #116Data Governance and Security
The data governance team is reviewing user for deleting records for compliance with GDPR. The following logic has been implemented to propagate deleted requests from the user_looku...
Delta LakeData DeletionTime TravelVACUUM - Question #117Building and Managing Data Pipelines with Delta Live Tables
A data engineer wants to reflector the following DLT code, which includes multiple definition with very similar code: In an attempt to programmatically create these tables using a...
Delta Live Tables (DLT)Parameterized PipelinesConfiguration ManagementData Engineering Best Practices - Question #118Spark Application Optimization and Monitoring
The data engineer is using Spark's MEMORY_ONLY storage level. Which indicators should the data engineer look for in the spark UI's Storage tab to signal that a cached table is not...
Spark CachingSpark UISpark Storage LevelsPerformance Monitoring - Question #119Databricks Workspace Management
What is the first of a Databricks Python notebook when viewed in a text editor?
Databricks notebooksNotebook formatFile structureDatabricks workspace - Question #120Orchestrating Data Pipelines
The Databricks CLI is use to trigger a run of an existing job by passing the job_id parameter. The response that the job run request has been submitted successfully includes a file...
Databricks CLIJob ManagementRun IDsWorkflows - Question #121Streaming Data Processing Optimization
A Structured Streaming job deployed to production has been resulting in higher than expected cloud storage costs. At present, during normal execution, each microbatch of data is pr...
Structured StreamingTrigger ConfigurationCost OptimizationCloud API Costs - Question #122Building and Managing Streaming Data Pipelines
A task orchestrator has been configured to run two hourly tasks. First, an outside system writes Parquet data to a directory mounted at /mnt/raw_orders/. After this data is written...
Spark Structured StreamingWatermarkingEvent-time DeduplicationLate Data Handling - Question #123Building and Maintaining Data Pipelines
A nightly batch job is configured to ingest all data files from a cloud object storage container where records are stored in a nested directory structure YYYY/MM/DD. The data for e...
Structured StreamingDelta LakeIncremental ProcessingCost Optimization - Question #124Optimizing Databricks Data Ingestion and Processing
A large company seeks to implement a near real-time solution involving hundreds of pipelines with parallel updates of many tables with extremely high volume and high velocity data....
Databricks Cluster TypesDelta Lake PerformanceReal-time Data EngineeringCloud Storage Optimization - Question #125Spark Cluster Management and Fault Tolerance
Each configuration below is identical to the extent that each cluster has 400 GB total of RAM 160 total cores and only one Executor per VM. Given an extremely long-running job for...
Spark ArchitectureFault ToleranceCluster ConfigurationHigh Availability