MLS-C01 · Question #379
MLS-C01 Question #379: Real Exam Question with Answer & Explanation
The correct answer is D: Encode the country codes into numeric variables by using one-hot encoding.. To transform categorical country codes for model training with the smallest dimensionality increase and no information loss, one-hot encoding is the appropriate solution.
Question
A data scientist is conducting exploratory data analysis (EDA) on a dataset that contains information about product suppliers. The dataset records the country where each product supplier is located as a two-letter text code. For example, the code for New Zealand is "NZ." The data scientist needs to transform the country codes for model training. The data scientist must choose the solution that will result in the smallest increase in dimensionality. The solution must not result in any information loss. Which solution will meet these requirements?
Options
- AAdd a new column of data that includes the full country name.
- BEncode the country codes into numeric variables by using similarity encoding.
- CMap the country codes to continent names.
- DEncode the country codes into numeric variables by using one-hot encoding.
Explanation
To transform categorical country codes for model training with the smallest dimensionality increase and no information loss, one-hot encoding is the appropriate solution.
Common mistakes.
- A. Adding a new column with full country names merely provides redundant text data and does not convert the categorical codes into a numerical format suitable for many machine learning models.
- B. Similarity encoding (or embedding) can be used to represent categories in a lower-dimensional space, but it may involve some information loss depending on the specific method and desired dimensionality, and it's generally more complex than one-hot encoding for preserving all distinct categorical information.
- C. Mapping country codes to continent names significantly reduces dimensionality but results in substantial information loss, as the distinct identity of individual countries is no longer preserved, directly violating the 'no information loss' requirement.
Concept tested. Categorical feature encoding
Reference. https://docs.aws.amazon.com/sagemaker/latest/dg/feature-processing.html
Topics
Community Discussion
No community discussion yet for this question.