📢 Disclaimer
This blog post uses Microsoft Fabric features that are currently in preview. These functionalities may have limited capabilities, are not intended for production use, and are not subject to SLAs. To enable preview features in Fabric, your Fabric administrator must activate them in the Admin Portal. For more details, refer to the official documentation.
Introduction
If you’re reading this, chances are you’ve spent some time experimenting with Large Language Models (LLMs) like GPT, Claude or DeepSeek. You’ve probably also noticed that while these models excel at generating human-like text, they have limitations when it comes to providing accurate, up-to-date information. The reason for that is that they generate text based on the knowledge they acquired during training.
This leads to two major issues:
- Outdated information – As they cannot access new information unless retrained.
- Hallucination and/or lack of sources – When not knowing the answer, they often generate confident false responses without citing sources, which can be problematic as the users cannot verify the information.
What is Retrieval-Augmented Generation (RAG)?
This is why the concept of Retrieval-Augmented Generation (RAG) emerged, which combines the capabilities of retrieval-based and generation-based models. It involves using a retriever to search and retrieve relevant documents or information from a knowledge source, which is then used by a generator (LLM) to produce more accurate and contextually relevant responses.
The knowledge source can be a large corpus of documents, articles, or policies, which are converted into embeddings and stored in a vector database – a database that stores, indexes, and retrieves high-dimensional vectors.
📖 Definition
Embeddings represent words as dense vectors in an N-dimensional space, capturing semantic relationships between words. Words with similar meanings have embeddings that are closer together in this space. The similarity between embeddings is often measured using a distance metric, with cosine similarity being the most commonly used.
Here’s how it works:
- A user asks a question, i.e., provides a query to the system.
- The retriever searches for and retrieves relevant documents from the vector database based on the user’s query.
- The retrieved documents are combined with the user’s query and passed to the generator (LLM)
- The generator then produces an answer based on both the retrieved information and its own knowledge with the ability to cite sources.
What We’ll Build in This Blog Series
Use Case: Sales Enablement
Through this blog series, we’ll build a scalable RAG application on Microsoft Fabric, focusing on a simple use case: empowering sales teams with quick and efficient access to case studies.
Sales team rely on case studies to demonstrate the value of their products or services to potential clients, but manually searching for them can be time-consuming.
A RAG application solves this by instantly retrieving relevant success stories, helping salespeople engage clients more effectively.
To demonstrate this, I used publicly available Microsoft Case Studies, web scraped with Python (requests
, BeautifulSoup4
) and converted to PDFs (pdfkit
).
Technical overview
-
Part 1: We’ll focus on building and populating the backbone of the retriever component of the RAG: The Vector Database. We’ll walk through the steps of ingesting, processing, and embedding text data, as well as storing the resulting embeddings in a vector database.
-
Part 2: We will integrate the retriever with a large language model to complete the generation part of the pipeline.
Architecture - Part 1
Here’s what we need:
- A Microsoft Fabric capacity and workspace, including:
- A lakehouse for data storage
- A SQL Database (preview) for storing embeddings
- A notebook for running the RAG pipeline
- A collection of PDF files containing the case studies
If you’re fortunate enough to be using an 💰F64💰 (or higher) capacity, you can use pre-built AI models in Fabric (Preview). Otherwise you’ll need to deploy the LLM and embeddings model through Azure AI Foundry.
Now, without further ado, open your notebook, and let’s get started!
Install Required Libraries
Before we begin, run the following command to install the required dependencies:
%pip -q install pypdf langchain-community langchain-sqlserver langchain-openai azure-core pydantic typing_extensions azure-identity==1.17.1
Ingestion
For the purpose of this blog, we’ll focus on a simple manual file upload:
- Prepare Your Data: Ensure you have a folder (here data) containing all your PDF files.
- Go to Lakehouse Explorer: Open the Lakehouse Explorer in Microsoft Fabric.
- Upload the Folder: Click on the three dots next to the Files folder and select Upload Folder.
In a production environment, you would typically automate the ingestion process using tools like Data Factory’s pipelines or Copy Job (Preview), while keeping track of previously processed files in a metadata table. But that’s a topic for another blog post.
Processing
Step 1: Load the File Contents
First, we load the file contents into a Spark DataFrame
with the following code snippet:
from pyspark.sql.functions import substring_index
# Load the file contents from the specified directory
# Each row will contain the binary content of the file
file_content_df = spark.read.format("binaryFile") \
.load("Files/data/*") \
.withColumn("file_name", substring_index("path", "/", -1)) \
.withColumnRenamed("path", "file_path") \
.select("file_path", "file_name", "content")
Step 2: Extract Text from PDF
To process the PDF content for use in LLMs, we’ll extract raw text using the pypdf
library.
Since we may need to process a large number of files, I suggest using a PySpark UDF (User-Defined Function) in order to parallelize the extraction process.
import pandas as pd
import io
import pypdf
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType
# Define UDF to extract text from PDF binary content
def extract_text_from_pdf(content):
try:
# Read the PDF content from bytes
reader = pypdf.PdfReader(io.BytesIO(content))
# Extract text from all pages
text = ""
for page in reader.pages:
text += page.extract_text() + "\n"
return text
except Exception as e:
return f"Error extracting text: {e}"
# Register the UDF
extract_text_udf = udf(extract_text_from_pdf, StringType())
# Apply the conversion UDF to the DataFrame
analyzed_df = file_content_df.withColumn("text", extract_text_udf(col("content"))).drop("content")
Step 3: Split Text into Chunks
Next we will split the extracted text into smaller chunks, using SynapseML’s PageSplitter:
from synapse.ml.featurize.text import PageSplitter
from pyspark.sql.functions import posexplode, col, concat
# Initialize PageSplitter to split the markdown content based on min/max page (chunk) length.
ps = (
PageSplitter()
.setInputCol("text")
.setOutputCol("chunks")
.setMinimumPageLength(3000)
.setMaximumPageLength(4000)
)
# Apply the PageSplitter transformation to the DataFrame that contains the markdown contents
splitted_df = ps.transform(analyzed_df)
# Each "chunks" column contains the chunks for a single document in an array
# The posexplode function will separate each chunk into its own row
exploded_df = splitted_df.select("file_path", "file_name", posexplode(col("chunks")).alias("chunk_index", "chunk"))
# Add a unique identifier for each chunk
exploded_df = exploded_df.withColumn("chunk_unique_id", concat("file_name", "chunk_index")).drop("chunk_index")
💡 Tip
Feel free to adjust thesetMaximumPageLength
andsetMinimumPageLength
parameters to suit your needs. Smaller chunks improve precision but risk losing context, while larger chunks enhance recall but may introduce irrelevant information. It’s all about finding the right balance for your use case.
At this point, we have successfully extracted the text from the files, split it into smaller chunks, and added unique identifiers to each chunk. These chunks can now be embedded, i.e., converted into vectors using an embedding model.
Embedding
Step 1: Initialize the Vector Store
In this step, we will use the Fabric SQL database at our disposal. Before diving in, ensure you’ve created a service principal with the necessary permissions to create a table in your chosen schema.
To interact with our vector store, we’ll leverage the SQLServer_VectorStore
class from the langchain_sqlserver
package. This class makes it incredibly easy to store, retrieve, and search for embeddings in a SQL Server database, abstracting away all the complexity so we don’t have to handle it ourselves.
It requires an embedding model, and we will be using Azure OpenAI’s text-embedding-ada-002
.
from langchain_sqlserver import SQLServer_VectorStore
from langchain_community.vectorstores.utils import DistanceStrategy
from langchain_core.documents import Document
from langchain_openai import AzureOpenAIEmbeddings
# Initialize the Azure OpenAI embeddings service
embeddings = AzureOpenAIEmbeddings(
azure_endpoint=AZURE_OPENAI_ENDPOINT,
azure_deployment="text-embedding-ada-002",
openai_api_version="2023-05-15",
openai_api_key=AZURE_OPENAI_API_KEY
)
# Build the connection string
connection_string = (
f"Driver=ODBC Driver 18 for SQL Server;"
f"Server={SERVER_NAME};"
f"Database={DB_NAME};"
f"Uid={SP_APP_ID};"
f"Pwd={SP_KEY};"
"Authentication=ActiveDirectoryServicePrincipal;"
"Encrypt=yes;"
"TrustServerCertificate=no;"
)
# Initialize the vector store
vector_store = SQLServer_VectorStore(
connection_string=connection_string,
distance_strategy=DistanceStrategy.COSINE,
embedding_function=embeddings,
embedding_length=1536, # The length of the embeddings produced by text-embedding-ada-002
db_schema=EMBEDDINGS_SCHEMA,
table_name=EMBEDDINGS_TABLE_NAME
)
💡 Tip
I’ve stored all my keys in Azure Key Vault, and they can be easily queried using the built-in package notebookutils. Thenotebookutils.credentials
module provides utilities to get access tokens and manage secrets in an Azure Key Vault.
By initializing the vector store, a table will be automatically created in the specified schema of our SQL database—it’s all handled for us!
Step 2: Store Embeddings in the Vector Database
Now that we have initialized the vector store, the next step is to insert our processed document chunks into the database.
We will iterate over our DataFrame
, where each row represents a chunk, and convert them into Document
objects with metadata such as the source file name.
from langchain_core.documents import Document
# Gather documents and IDs
documents = []
ids = []
for row in exploded_df.collect():
documents.append(Document(page_content=row["chunk"], metadata={"source": row["file_name"]}))
ids.append(row["chunk_unique_id"])
# Add the new document embeddings to the vector store.
vector_store.add_documents(documents=documents, ids=ids)
💡 Tip
Adding metadata to theDocument
object is optional, but it comes in handy when you’re returning search results and want to keep track of where each embedding came from.
Once this step is complete, the vector store will be populated with embeddings, making it ready for similarity search and retrieval 🎉
Conclusion
In this blog post, we covered the foundational steps for building a scalable Retrieval-Augmented Generation (RAG) application on Microsoft Fabric. We started by ingesting PDF documents, extracting their text, splitting them into chunks, and embedding those chunks into a vector store powered by Fabric’s SQL database.
In the next part of this series, we will focus on integrating the retrieval step with a large language model, allowing it to generate context-aware answers using the stored embeddings.
Stay tuned for Part 2 🚀
Resources
Here’s a list of resources that were useful in writing this blog post, and which you might find helpful as well:
-
Building RAG Application on Microsoft Fabric & Azure Open AI
- This workshop was my starting point for the blog post, and I recommend it as it uses slightly different services and approaches.
- Livestream recording: Building a RAG App on Microsoft Fabric - Pamela Fox
-
Langchain-SQL-RAG GitHub Repository
- Contains samples to help utilize the new Langchain Native Vector Support in Azure SQL Database.