Data Productivity Cloud Pipeline

Author: Matillion
Date Posted: Oct 24, 2024
Last Modified: Oct 24, 2024

Databricks Mosaic AI Vector Search

Issue RAG queries inside Databricks with Data Productivity Cloud pipelines using Mosaic AI Vector Search.

These pipelines demonstrate how to run Retrieval Augmented Generation (RAG) queries inside Databricks, preparing the prompt by using Mosaic AI Vector Search. The demonstration uses Edgar Allan Poe’s famous poem: The Raven.

To use Mosaic AI Vector Search in Databricks, you will need to:

  • Create a Vector Search Endpoint
  • Create reference data inside a Delta table, with Change Data Feed enabled
  • Use the reference data to create a Vector Search Index
  • Query the Index using VECTOR_SEARCH
  • Build an augmented prompt for AI_QUERY, to perform RAG

Creating a Vector Search Endpoint

Vector Search Endpoints are a type of compute resource, so in your Databricks console go to Compute > Vector Search and press Create.

Image ofCreate Databricks Vector Search Endpoint
Create Databricks Vector Search Endpoint

One Vector Search Endpoint can handle many Vector Search Indexes, so you may choose to re-use an existing endpoint if you have one already.

Creating reference data inside a Delta table

Run the Data Productivity Cloud pipeline 1 Mosaic AI Vector Search Setup to create the two Delta tables used in this demonstration:

  • stg_index_source - containing the reference data for the vector index
  • stg_questions - containing some example questions to ask via RAG
Image ofDelta Table with Change Data Feed - Reference Data
Delta Table with Change Data Feed - Reference Data

Enabling Change Data Feed

To enable Change Data Feed on a Delta table, use the following SQL:

ALTER TABLE `your-catalog`.`your-schema`.`your-table`
SET TBLPROPERTIES (delta.enableChangeDataFeed = true)

This is done automatically inside the 1 Mosaic AI Vector Search Setup pipeline. Note that, for efficiency, data is loaded into the table first, then Change Data Feed is enabled afterwards.

Creating a Vector Search Index

Now that the reference data exists, you can create a Vector Search Index using the Vector Search Endpoint created earlier.

In your Databricks console go to Catalog, locate your source table, then follow Create > Vector Search Index.

Image ofCreate Databricks Vector Search Index
Create Databricks Vector Search Index

In the settings, it’s easiest to choose “Compute Embeddings” to have Databricks manage them for you. You must then choose an embedding model, and in the screenshot below it’s databricks-gte-large-en which is provided by Databricks.

Image ofDatabricks Vector Search Index Parameters
Databricks Vector Search Index Parameters

Databricks offers the VECTOR_SEARCH function as the SQL interface to search a Vector Index. You supply the name of the index, the text to search, and the number of results. For example:

SELECT * FROM VECTOR_SEARCH(
  index => "your-catalog.default.name_of_new_vector_index",
  query => "Your search text here",
  num_results => 3);

You can see this in action in the 3 Vector Search pipeline. The Vector Search component takes questions from one table (stg_questions) and lateral joins to VECTOR_SEARCH function calls: one per question.

Image ofMatillion Databricks Vector Search
Matillion Databricks Vector Search

To flatten these “wide” format results into a usable prompt, the following component concatenates the top three search results into a single prompt like this:

`search_result[0]`.`document_text` || '\n' ||
`search_result[1]`.`document_text` || '\n' ||
`search_result[2]`.`document_text` || '\n'

The last component in the pipeline is an AI Query that invokes an LLM (Meta-Llama-3.1-70B-Instruct in this case) for every question in the stg_questions table.

Image ofMatillion Databricks AI Query with RAG
Matillion Databricks AI Query with RAG

Even in this simple example, the advantages of RAG are obvious. When you run this yourself, you may see slightly different result, but here’s what I found:

QuestionNo RAGLLM with RAG
What does the raven sayA reference to Edgar Allan Poe’s classic poem “The Raven”! According to the poem, the raven says: “Quoth the Raven, ‘Nevermore.’” In other words, the raven’s ominous and repetitive response to the narrator’s questions is a single word: “Nevermore.” This word drives home the finality of death and the loss of the narrator’s beloved, Lenore.The raven says: “Nevermore”
What kind of shore is itIt seems like you’re asking about the characteristics of a shore, but without more context, it’s a bit challenging to provide a specific answer. Shores can vary greatly depending on their location, the body of water they border (ocean, lake, river), and the geological and climatic «+400 more words»The shore is described as “Nightly”

With the extra context provided by the Vector Index Search, the LLM is able to quickly produce concise and accurate answers.


Downloads

Licensed under: Matillion Free Subscription License

Installation Instructions

How to Install a Data Productivity Cloud Pipeline