Example of RAG Application Foundation

What is RAG?

RAG, known as Retrieval-Augmented Generation, is a technology that combines information retrieval and text generation to improve the accuracy and relevance of text generated by large language models (LLMs). LLM may not be able to obtain up-to-date information due to limitations of its training data.

For example, when I asked GPT about the latest version of MatrixOne, it didn't give an answer.

In addition, these models can sometimes produce misleading information and produce factually incorrect content. For example, when I asked Lu Xun about his relationship with Zhou Shuren, GPT started a serious nonsense.

To solve the above problem, we can retrain the LLM model, but at a high cost. The main advantage of RAG, on the other hand, is that it avoids having to train again for specific tasks. Its high availability and low threshold make it one of the most popular scenarios in LLM systems, on which many LLM applications are built. The core idea of RAG is for the model to not only rely on what it learns during the training phase when generating responses, but also to utilize external, up-to-date, proprietary sources of information, so that users can optimize the output of the model by enriching the input with additional external knowledge bases based on the actual situation.

RAG's workflow typically consists of the following steps:

Retrieve: Find and extract the information most relevant to the current query from a large data set or knowledge base.
Augment: Combines retrieved information or data sets with the LLM to enhance the performance of the LLM and the accuracy of the output.
Generate: Utilize LLM to generate new text or responses using retrieved information.

The following is a flow chart for Native RAG:

As you can see, the retrieval link plays a crucial role in the RAG architecture, and MatrixOne's ability to retrieve vectors provides powerful data retrieval support for building RAG applications.

Role of Matrixone in RAG

As a hyperconverged database, Matrxione comes with its own vector capabilities, which play an important role in RAG applications in the following ways:

Efficient information retrieval: Matrxione has vector data types specifically designed to process and store high-dimensional vector data. It uses special data structures and indexing strategies, such as KNN queries, to quickly find data items that most closely resemble query vectors.
Support for large-scale data processing: Matrxione's ability to effectively manage and process large-scale vector data is a core feature of the retrieval component of the RAG system, which enables the RAG system to quickly retrieve the information most relevant to user queries from vast amounts of data.
Improved generation quality: Through the retrieval capabilities of Matrxione's vector capabilities, RAG technology can introduce information from an external knowledge base to produce more accurate, rich, and contextualized text that improves the quality of generated text.
Security and privacy protection: Matrxione can also protect data with data security measures such as encrypted storage and access control, which is particularly important for RAG applications that handle sensitive data.
Simplify the development process: Using Matrxione simplifies the development process for RAG applications because it provides an efficient mechanism for storing and retrieving vectorized data, reducing the burden on developers in data management.

Based on Ollama, this paper combines Llama2 and Mxbai-embed-large to quickly build a Native RAG application using Matrixone's vector capabilities.

Prepare before you start

Relevant knowledge

Ollama: Ollama is an open source large language model service tool that allows users to easily deploy and use large-scale pre-trained models in their hardware environment. Ollama's primary function is to deploy and manage large language models (LLMs) within Docker containers, enabling users to quickly run them locally. Ollama simplifies the deployment process by allowing users to run open source large language models locally with a single command through simple installation instructions.

Llama2:llama2 is an open source language large model for understanding and generating long text that can be used for research and commercial purposes.

Mxbai-embed-large: mxbai-embed-large is an open source embedding model designed for text embedding and retrieval tasks. The model generates an embedding vector size of 1024.

Software Installation

Before you begin, confirm that you have downloaded and installed the following software:

Verify that you have completed the standalone deployment of MatrixOne.
Verify that you have finished installing Python 3.8 (or plus). Verify that the installation was successful by checking the Python version with the following code:

python3 -V

Verify that you have completed installing the MySQL client.
Download and install the pymysql tool. Download and install the pymysql tool using the following code:

pip3 install pymysql

Verify that you have finished installing ollama. Verify that the installation was successful by checking the ollama version with the following code:

ollama -v

Download the LLM model llama2 and embedding model mxbai-embed-large:

ollama pull llama2 ollama pull mxbai-embed-large

Build your app

Building table

Connect to MatrixOne and create a table called rag_tab to store text information and corresponding vector information.

create table rag_tab(content text,embedding vecf32(1024));

Text vectorization stored to MatrixOne

Create the python file rag_example.py, slice and vectorize the textual information using the mxbai-embed-large embedding model, and save it to MatrixOne's rag_tab table.

import ollama
import pymysql.cursors

conn = pymysql.connect(
        host='127.0.0.1',
        port=6001,
        user='root',
        password = "111",
        db='db1',
        autocommit=True
        )
cursor = conn.cursor()

#Generate embeddings
documents = [
"MatrixOne is a hyper-converged cloud & edge native distributed database with a structure that separates storage, computation, and transactions to form a consolidated HSTAP data engine. This engine enables a single database system to accommodate diverse business loads such as OLTP, OLAP, and stream computing. It also supports deployment and utilization across public, private, and edge clouds, ensuring compatibility with diverse infrastructures.",
"MatrixOne touts significant features, including real-time HTAP, multi-tenancy, stream computation, extreme scalability, cost-effectiveness, enterprise-grade availability, and extensive MySQL compatibility. MatrixOne unifies tasks traditionally performed by multiple databases into one system by offering a comprehensive ultra-hybrid data solution. This consolidation simplifies development and operations, minimizes data fragmentation, and boosts development agility.",
"MatrixOne is optimally suited for scenarios requiring real-time data input, large data scales, frequent load fluctuations, and a mix of procedural and analytical business operations. It caters to use cases such as mobile internet apps, IoT data applications, real-time data warehouses, SaaS platforms, and more.",
"Matrix is a collection of complex or real numbers arranged in a rectangular array.",
"The lastest version of MatrixOne is v24.1.2.4,released on September 23th, 2024."
"We are excited to announce MatrixOne v22.0.8.0 release on 2023/6/30."
]

for i,d in enumerate(documents):
  response = ollama.embeddings(model="mxbai-embed-large", prompt=d)
  embedding = response["embedding"]
  insert_sql = "insert into rag_tab(content,embedding) values (%s, %s)"
  data_to_insert = (d, str(embedding))
  cursor.execute(insert_sql, data_to_insert)

View quantity in `rag_tab` table

mysql> select count(*) from rag_tab;
+----------+
| count(*) |
+----------+
|        6 |
+----------+
1 row in set (0.00 sec)

As you can see, the data was successfully stored into the database.

Indexing (not required)

In large-scale high-dimensional data retrieval, if a full search is used, the similarity calculation with each vector in the entire data set needs to be performed for each query, which results in significant performance overhead and latency. The use of vector index can effectively solve the above problems,by establishing efficient data structures and algorithms to optimize the search process,improve retrieval performance,reduce computing and storage costs,and enhance the user experience. Therefore, we build an IVF-FLAT vector index for the vector field

SET GLOBAL experimental_ivf_index = 1; -- turn on vector index 
create index idx_rag using ivfflat on rag_tab(embedding) lists=1 op_type "vector_l2_ops";

Vector retrieval

Once the data is ready, you can search the database for the most similar content based on the questions we asked. This step relies heavily on the vector retrieval capabilities of MatrixOne, which supports multiple similarity searches, where we use l2_distance to retrieve and set the number of returned results to 3.

prompt = "What is the latest version of MatrixOne?"

response = ollama.embeddings(
  prompt=prompt,
  model="mxbai-embed-large"
)
query_embedding= embedding = response["embedding"]
query_sql = "select content from rag_tab order by l2_distance(embedding,%s) asc limit 3"
data_to_query = str(query_embedding)
cursor.execute(query_sql, data_to_query)
data = cursor.fetchall()

Enhanced generation

We combine what we retrieved in the previous step with LLM to generate an answer.

#enhance generate 
output = ollama.generate(
  model="llama2",
  prompt=f"Using this data: {data}. Respond to this prompt: {prompt}"
)

print(output['response'])

Console output related answer:

Based on the provided data, the latest version of MatrixOne is v24.1.2.4, which was released on September 23th, 2024.

After enhancement, the model generates the correct answer.