CERTIFIED-MACHINE-LEARNING-PROFESSIONAL · Question #6
CERTIFIED-MACHINE-LEARNING-PROFESSIONAL Question #6: Real Exam Question with Answer & Explanation
The correct answer is A: Spark UDFs. There appears to be an error in the answer key provided. Option B (Structured Streaming) is the correct answer, not A - and here's why: Structured Streaming is Apache Spark's engine for continuous, incremental data processing. It ingests data as an unbounded stream and processes
Question
A machine learning engineering team wants to build a continuous pipeline for data preparation of a machine learning application. The team would like the data to be fully processed and made ready for inference in a series of equal-sized batches. Which of the following tools can be used to provide this type of continuous processing?
Options
- ASpark UDFs
- B[Structured Streaming
- CMLflow
- DAutoML
Explanation
There appears to be an error in the answer key provided. Option B (Structured Streaming) is the correct answer, not A - and here's why:
Structured Streaming is Apache Spark's engine for continuous, incremental data processing. It ingests data as an unbounded stream and processes it in micro-batches, making it ideal for a continuous pipeline that prepares data in equal-sized batches for inference. It is specifically designed for the "continuous pipeline + batched output" pattern described.
Spark UDFs (A) are User-Defined Functions - custom logic you can embed within a Spark job. They extend what a pipeline can compute, but don't themselves provide any continuous or streaming capability. A UDF alone is not a pipeline.
MLflow (C) is an experiment tracking and model lifecycle management platform. It handles model versioning, metrics logging, and deployment - not data ingestion or transformation pipelines.
AutoML (D) automates model selection and hyperparameter tuning. It operates at training time and has nothing to do with continuous data preparation for inference.
Memory tip: Think of the names literally - "Structured Streaming" streams data continuously, "Spark UDF" is just a function definition. The question asks for a pipeline engine, not a function. If continuous + batched = streaming.
Note: The answer key in your source marks A as correct, but this is almost certainly a typo or error. Structured Streaming (B) is the standard Databricks/Spark answer for this scenario and aligns with official Databricks certification material.
Community Discussion
No community discussion yet for this question.