DatabricksDatabricks
GENERATIVE-AI-ENGINEER-ASSOCIATE · Question #91
GENERATIVE-AI-ENGINEER-ASSOCIATE Question #91: Real Exam Question with Answer & Explanation
The correct answer is B: Flatten the dataframe to one chunk per row, create a unique identifier for each row, and save to a. See the full explanation below for the reasoning.
Data Ingestion and Preparation for Vector Search
Question
A Generative AI Engineer has successfully ingested unstructured documents and chunked them by document sections. They would like to store the chunks in a Vector Search index. The current format of the dataframe has two columns: (i) original document file name (ii) an array of text chunks for each document. What is the most performant way to store this dataframe?
Options
- ASplit the data into train and test set, create a unique identifier for each document, then save to a
- BFlatten the dataframe to one chunk per row, create a unique identifier for each row, and save to a
- CFirst create a unique identifier for each document, then save to a Delta table
- DStore each chunk as an independent JSON file in Unity Catalog Volume. For each JSON file, the
Topics
#Data Preprocessing#Vector Search Indexing#RAG Architecture#Databricks Data Engineering
Community Discussion
No community discussion yet for this question.