PROFESSIONAL-DATA-ENGINEER · Question #243
PROFESSIONAL-DATA-ENGINEER Question #243: Real Exam Question with Answer & Explanation
The correct answer is D: Use Cloud Data Fusion to assign each city to a region that is labeled as 1, 2, 3, 4, or 5, and then use that number to represent the city in the model.. Option D satisfies both constraints - least coding and preserving the predictive variable. Cloud Data Fusion is a low-code, GUI-driven data integration tool, so mapping cities to regional numeric labels (1–5) requires no custom code while converting the categorical city field int
Question
You are working on a linear regression model on BigQuery ML to predict a customer's likelihood of purchasing your company's products. Your model uses a city name variable as a key predictive component. In order to train and serve the model, your data must be organized in columns. You want to prepare your data using the least amount of coding while maintaining the predictable variables. What should you do?
Options
- ACreate a new view with BigQuery that does not include a column with city information.
- BUse SQL in BigQuery to transform the state column using a one-hot encoding method, and make each city a column with binary values.
- CUse TensorFlow to create a categorical variable with a vocabulary list.
- DUse Cloud Data Fusion to assign each city to a region that is labeled as 1, 2, 3, 4, or 5, and then use that number to represent the city in the model.
Explanation
Option D satisfies both constraints - least coding and preserving the predictive variable. Cloud Data Fusion is a low-code, GUI-driven data integration tool, so mapping cities to regional numeric labels (1–5) requires no custom code while converting the categorical city field into a numeric column usable by a linear regression model.
Why the distractors are wrong:
- A eliminates the city column entirely, directly violating the requirement to maintain the predictive variable - a city described as a "key predictive component" should not be dropped.
- B is technically valid ML practice (one-hot encoding is standard for categorical variables), but it requires writing SQL manually, which means more coding than a point-and-click tool like Data Fusion.
- C introduces TensorFlow - a heavy framework requiring substantial coding - which directly contradicts the "least amount of coding" constraint.
Memory tip: Match the tool to the constraint. When the question says least coding, look for the managed/GUI tool (Cloud Data Fusion, Dataprep, etc.). When it says remove the variable, that's always wrong if the variable is described as predictive. "Low code = Data Fusion" is a reliable heuristic across GCP exam questions.
Topics
Community Discussion
No community discussion yet for this question.