GENERATIVE-AI-ENGINEER-ASSOCIATE · Question #5
GENERATIVE-AI-ENGINEER-ASSOCIATE Question #5: Real Exam Question with Answer & Explanation
The correct answer is D: Create a unique identifier for each document, flatten the dataframe to one chunk per row and. To prepare the data for Databricks Vector Search ingestion, the correct approach is to: - Create a unique identifier for each document (e.g., a combination of the filename and an additional unique field to distinguish chunks from the same document). - Flatten the dataframe so tha
Question
A Generative AI Engineer has written scalable PySpark code to ingest unstructured PDF documents and chunk them in preparation for storing in a Databricks Vector Search index. Currently, the two columns of their dataframe include the original filename as a string and an array of text chunks from that document. What set of steps should the Generative AI Engineer perform to store the chunks in a ready-to- ingest manner for Databricks Vector Search?
Options
- AUse PySpark's autoloader to apply a UDF across all chunks, formatting them in a JSON structure
- BFlatten the dataframe to one chunk per row, create a unique identifier for each row, and enable
- CUtilize the original filename as the unique identifier and save the dataframe as is.
- DCreate a unique identifier for each document, flatten the dataframe to one chunk per row and
Explanation
To prepare the data for Databricks Vector Search ingestion, the correct approach is to: - Create a unique identifier for each document (e.g., a combination of the filename and an additional unique field to distinguish chunks from the same document). - Flatten the dataframe so that each chunk has its own row, making it easier to index individual text chunks for vector search. - Save the resulting dataframe into a Delta table to ensure the data is structured and ready for ingestion into the vector search index.
Topics
Community Discussion
No community discussion yet for this question.