You work on a regression problem in a natural language processing domain, and you have 100M labeled exmaples in your dataset. You have randomly shuffled your data and split your dataset into train and

Sign in or unlock PROFESSIONAL-DATA-ENGINEER to reveal the answer and full explanation for question #211. The question stem and answer options stay visible for context.

Submitted by layla.eg· Mar 30, 2026Operationalizing machine learning models

Question

You work on a regression problem in a natural language processing domain, and you have 100M labeled exmaples in your dataset. You have randomly shuffled your data and split your dataset into train and test samples (in a 90/10 ratio). After you trained the neural network and evaluated your model on a test set, you discover that the root-mean-squared error (RMSE) of your model is twice as high on the train set as on the test set. How should you improve the performance of your model?

Options

AIncrease the share of the test sample in the train-test split.
BTry to collect more data and increase the size of your dataset.
CTry out regularization techniques (e.g., dropout of batch normalization) to avoid overfitting.
DIncrease the complexity of your model by, e.g., introducing an additional layer or increase sizing the size of vocabularies or n-grams used.

Unlock PROFESSIONAL-DATA-ENGINEER to see the answer

You've previewed enough free PROFESSIONAL-DATA-ENGINEER questions. Unlock PROFESSIONAL-DATA-ENGINEER for full answers, explanations, the timed quiz mode, progress tracking, and the master PDF. Question stem and options stay visible so you can still see what's on the exam.

Unlock PROFESSIONAL-DATA-ENGINEER - $49.99 / 30 days Sign in

Topics

#model underfitting#RMSE#model complexity#NLP regression

Full PROFESSIONAL-DATA-ENGINEER Practice