nerdexam
DatabricksDatabricks

GENERATIVE-AI-ENGINEER-ASSOCIATE · Question #5

GENERATIVE-AI-ENGINEER-ASSOCIATE Question #5: Real Exam Question with Answer & Explanation

The correct answer is D: Create a unique identifier for each document, flatten the dataframe to one chunk per row and. To prepare the data for Databricks Vector Search ingestion, the correct approach is to: - Create a unique identifier for each document (e.g., a combination of the filename and an additional unique field to distinguish chunks from the same document). - Flatten the dataframe so tha

Data Preparation for RAG and Vector Search

Question

A Generative AI Engineer has written scalable PySpark code to ingest unstructured PDF documents and chunk them in preparation for storing in a Databricks Vector Search index. Currently, the two columns of their dataframe include the original filename as a string and an array of text chunks from that document. What set of steps should the Generative AI Engineer perform to store the chunks in a ready-to- ingest manner for Databricks Vector Search?

Options

  • AUse PySpark's autoloader to apply a UDF across all chunks, formatting them in a JSON structure
  • BFlatten the dataframe to one chunk per row, create a unique identifier for each row, and enable
  • CUtilize the original filename as the unique identifier and save the dataframe as is.
  • DCreate a unique identifier for each document, flatten the dataframe to one chunk per row and

Explanation

To prepare the data for Databricks Vector Search ingestion, the correct approach is to: - Create a unique identifier for each document (e.g., a combination of the filename and an additional unique field to distinguish chunks from the same document). - Flatten the dataframe so that each chunk has its own row, making it easier to index individual text chunks for vector search. - Save the resulting dataframe into a Delta table to ensure the data is structured and ready for ingestion into the vector search index.

Topics

#PySpark Data Transformation#Vector Search Indexing#RAG Data Preparation#Data Modeling

Community Discussion

No community discussion yet for this question.

Full GENERATIVE-AI-ENGINEER-ASSOCIATE PracticeBrowse All GENERATIVE-AI-ENGINEER-ASSOCIATE Questions