nerdexam
AmazonAmazon

MLS-C01 · Question #191

MLS-C01 Question #191: Real Exam Question with Answer & Explanation

The correct answer is D: Remove the stopwords from the blog post data by using the CountVectorizer function in. To prevent the Amazon SageMaker Neural Topic Model (NTM) from recommending stopwords as tags while retaining feasible rare words, the data scientist must preprocess the blog post data to remove stopwords. This can be effectively achieved by using text vectorization functions like

Modeling

Question

A data scientist is using the Amazon SageMaker Neural Topic Model (NTM) algorithm to build a model that recommends tags from blog posts. The raw blog post data is stored in an Amazon S3 bucket in JSON format. During model evaluation, the data scientist discovered that the model recommends certain stopwords such as "a," "an," and "the" as tags to certain blog posts, along with a few rare words that are present only in certain blog entries. After a few iterations of tag review with the content team, the data scientist notices that the rare words are unusual but feasible. The data scientist also must ensure that the tag recommendations of the generated model do not include the stopwords. What should the data scientist do to meet these requirements?

Options

  • AUse the Amazon Comprehend entity recognition API operations. Remove the detected
  • BRun the SageMaker built-in principal component analysis (PCA) algorithm with the blog
  • CUse the SageMaker built-in Object Detection algorithm instead of the NTM algorithm for
  • DRemove the stopwords from the blog post data by using the CountVectorizer function in

Explanation

To prevent the Amazon SageMaker Neural Topic Model (NTM) from recommending stopwords as tags while retaining feasible rare words, the data scientist must preprocess the blog post data to remove stopwords. This can be effectively achieved by using text vectorization functions like CountVectorizer, which includes options to filter out common stopwords before the data is fed to the NTM algorithm.

Common mistakes.

  • A. Amazon Comprehend's entity recognition focuses on identifying specific types of entities (e.g., people, locations) and is not the primary or most direct tool for general stopword removal in the context of preparing data for a topic model.
  • B. Principal Component Analysis (PCA) is a dimensionality reduction technique for numerical data and is entirely unsuitable for preprocessing text data or removing specific words like stopwords from blog posts.
  • C. The SageMaker built-in Object Detection algorithm is designed for computer vision tasks, specifically identifying objects within images, and is completely irrelevant to text classification or topic modeling of blog posts.

Concept tested. Text preprocessing for topic modeling (stopwords)

Reference. https://docs.aws.amazon.com/sagemaker/latest/dg/ntm.html

Topics

#Text Preprocessing#Stopword Removal#Neural Topic Model (NTM)#CountVectorizer

Community Discussion

No community discussion yet for this question.

Full MLS-C01 PracticeBrowse All MLS-C01 Questions