MLA-C01 · Question #104
MLA-C01 Question #104: Real Exam Question with Answer & Explanation
The correct answer is A: (2, 16). Option A is correct because the TF-IDF matrix has one row per document (2 sentences) and one column per unique feature (unigram or bigram) across the entire corpus. The 8 unique unigrams are: please, call, the, number, below, do, not, us. The 8 unique bigrams are: (please-call),
Question
A term frequency-inverse document frequency (tf-idf) matrix using both unigrams and bigrams is built from a text corpus consisting of the following two sentences: 1. Please call the number below. 2. Please do not call us. What are the dimensions of the tf-idf matrix?
Options
- A(2, 16)
- B(2, 8)
- C(2, 10)
- D(8, 10)
Explanation
Option A is correct because the TF-IDF matrix has one row per document (2 sentences) and one column per unique feature (unigram or bigram) across the entire corpus. The 8 unique unigrams are: please, call, the, number, below, do, not, us. The 8 unique bigrams are: (please-call), (call-the), (the-number), (number-below), (please-do), (do-not), (not-call), (call-us) - note that no bigrams overlap between the two sentences, yielding 8 + 8 = 16 total features, so the matrix is (2, 16).
Why the distractors fail:
- B (2, 8) counts only unigrams and ignores bigrams entirely.
- C (2, 10) likely reflects miscounting bigrams or conflating token counts (10 total word tokens across both sentences) with unique features.
- D (8, 10) swaps or misidentifies the axes - there are only 2 documents (rows), not 8.
Memory tip: Always count your features in two passes - one for unigrams, one for bigrams - then add them together. The matrix shape is always (# documents) × (# unique features), rows first.
Topics
Community Discussion
No community discussion yet for this question.