📢 Disclaimer
This blog post uses Microsoft Fabric features that are currently in preview. These functionalities may have limited capabilities, are not intended for production use, and are not subject to SLAs. To enable preview features in Fabric, your Fabric administrator must activate them in the Admin Portal. For more details, refer to the official documentation.

Introduction

If you’re reading this, chances are you’ve spent some time experimenting with Large Language Models (LLMs) like GPT, Claude or DeepSeek. You’ve probably also noticed that while these models excel at generating human-like text, they have limitations when it comes to providing accurate, up-to-date information. The reason for that is that they generate text based on the knowledge they acquired during training.

This leads to two major issues:

  • Outdated information – As they cannot access new information unless retrained.
  • Hallucination and/or lack of sources – When not knowing the answer, they often generate confident false responses without citing sources, which can be problematic as the users cannot verify the information.

What is Retrieval-Augmented Generation (RAG)?

This is why the concept of Retrieval-Augmented Generation (RAG) emerged, which combines the capabilities of retrieval-based and generation-based models. It involves using a retriever to search and retrieve relevant documents or information from a knowledge source, which is then used by a generator (LLM) to produce more accurate and contextually relevant responses.

The knowledge source can be a large corpus of documents, articles, or policies, which are converted into embeddings and stored in a vector database – a database that stores, indexes, and retrieves high-dimensional vectors.

📖 Definition
Embeddings represent words as dense vectors in an N-dimensional space, capturing semantic relationships between words. Words with similar meanings have embeddings that are closer together in this space. The similarity between embeddings is often measured using a distance metric, with cosine similarity being the most commonly used.

Here’s how it works: RAG Workflow

  1. A user asks a question, i.e., provides a query to the system.
  2. The retriever searches for and retrieves relevant documents from the vector database based on the user’s query.
  3. The retrieved documents are combined with the user’s query and passed to the generator (LLM)
  4. The generator then produces an answer based on both the retrieved information and its own knowledge with the ability to cite sources.

What We’ll Build in This Blog Series

Use Case: Sales Enablement

Through this blog series, we’ll build a scalable RAG application on Microsoft Fabric, focusing on a simple use case: empowering sales teams with quick and efficient access to case studies.
Sales team rely on case studies to demonstrate the value of their products or services to potential clients, but manually searching for them can be time-consuming.
A RAG application solves this by instantly retrieving relevant success stories, helping salespeople engage clients more effectively.

To demonstrate this, I used publicly available Microsoft Case Studies, web scraped with Python (requests, BeautifulSoup4) and converted to PDFs (pdfkit).

Technical overview

  • Part 1: We’ll focus on building and populating the backbone of the retriever component of the RAG: The Vector Database. We’ll walk through the steps of ingesting, processing, and embedding text data, as well as storing the resulting embeddings in a vector database.

  • Part 2: We will integrate the retriever with a large language model to complete the generation part of the pipeline.

Architecture - Part 1

RAG VDB Archi

Here’s what we need:

  • A Microsoft Fabric capacity and workspace, including:
    • A lakehouse for data storage
    • A SQL Database (preview) for storing embeddings
    • A notebook for running the RAG pipeline
  • A collection of PDF files containing the case studies

If you’re fortunate enough to be using an 💰F64💰 (or higher) capacity, you can use pre-built AI models in Fabric (Preview). Otherwise you’ll need to deploy the LLM and embeddings model through Azure AI Foundry.

Now, without further ado, open your notebook, and let’s get started!

Install Required Libraries

Before we begin, run the following command to install the required dependencies:

%pip -q install pypdf langchain-community langchain-sqlserver langchain-openai azure-core pydantic typing_extensions azure-identity==1.17.1

Ingestion

For the purpose of this blog, we’ll focus on a simple manual file upload:

  1. Prepare Your Data: Ensure you have a folder (here data) containing all your PDF files.
  2. Go to Lakehouse Explorer: Open the Lakehouse Explorer in Microsoft Fabric.
  3. Upload the Folder: Click on the three dots next to the Files folder and select Upload Folder.

In a production environment, you would typically automate the ingestion process using tools like Data Factory’s pipelines or Copy Job (Preview), while keeping track of previously processed files in a metadata table. But that’s a topic for another blog post.

Processing

Step 1: Load the File Contents

First, we load the file contents into a Spark DataFrame with the following code snippet:

from pyspark.sql.functions import substring_index

# Load the file contents from the specified directory
# Each row will contain the binary content of the file
file_content_df = spark.read.format("binaryFile") \
    .load("Files/data/*") \
    .withColumn("file_name", substring_index("path", "/", -1)) \
    .withColumnRenamed("path", "file_path") \
    .select("file_path", "file_name", "content")

Step 2: Extract Text from PDF

To process the PDF content for use in LLMs, we’ll extract raw text using the pypdf library. Since we may need to process a large number of files, I suggest using a PySpark UDF (User-Defined Function) in order to parallelize the extraction process.

import pandas as pd
import io
import pypdf 
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType

# Define UDF to extract text from PDF binary content
def extract_text_from_pdf(content):
    try:
        # Read the PDF content from bytes
        reader = pypdf.PdfReader(io.BytesIO(content))
        # Extract text from all pages
        text = ""
        for page in reader.pages:
            text += page.extract_text() + "\n"
        return text
    except Exception as e:
        return f"Error extracting text: {e}"

# Register the UDF
extract_text_udf = udf(extract_text_from_pdf, StringType())

# Apply the conversion UDF to the DataFrame
analyzed_df = file_content_df.withColumn("text", extract_text_udf(col("content"))).drop("content")

Step 3: Split Text into Chunks

Next we will split the extracted text into smaller chunks, using SynapseML’s PageSplitter:

from synapse.ml.featurize.text import PageSplitter
from pyspark.sql.functions import posexplode, col, concat

# Initialize PageSplitter to split the markdown content based on min/max page (chunk) length.
ps = (
    PageSplitter()
    .setInputCol("text")
    .setOutputCol("chunks")
    .setMinimumPageLength(3000)
    .setMaximumPageLength(4000)
)

# Apply the PageSplitter transformation to the DataFrame that contains the markdown contents
splitted_df = ps.transform(analyzed_df)

# Each "chunks" column contains the chunks for a single document in an array
# The posexplode function will separate each chunk into its own row
exploded_df = splitted_df.select("file_path", "file_name", posexplode(col("chunks")).alias("chunk_index", "chunk"))

# Add a unique identifier for each chunk
exploded_df = exploded_df.withColumn("chunk_unique_id", concat("file_name", "chunk_index")).drop("chunk_index")

💡 Tip
Feel free to adjust the setMaximumPageLength and setMinimumPageLength parameters to suit your needs. Smaller chunks improve precision but risk losing context, while larger chunks enhance recall but may introduce irrelevant information. It’s all about finding the right balance for your use case.

At this point, we have successfully extracted the text from the files, split it into smaller chunks, and added unique identifiers to each chunk. These chunks can now be embedded, i.e., converted into vectors using an embedding model.

Embedding

Step 1: Initialize the Vector Store

In this step, we will use the Fabric SQL database at our disposal. Before diving in, ensure you’ve created a service principal with the necessary permissions to create a table in your chosen schema.

To interact with our vector store, we’ll leverage the SQLServer_VectorStore class from the langchain_sqlserver package. This class makes it incredibly easy to store, retrieve, and search for embeddings in a SQL Server database, abstracting away all the complexity so we don’t have to handle it ourselves.

It requires an embedding model, and we will be using Azure OpenAI’s text-embedding-ada-002.

from langchain_sqlserver import SQLServer_VectorStore
from langchain_community.vectorstores.utils import DistanceStrategy
from langchain_core.documents import Document
from langchain_openai import AzureOpenAIEmbeddings

# Initialize the Azure OpenAI embeddings service
embeddings = AzureOpenAIEmbeddings(
    azure_endpoint=AZURE_OPENAI_ENDPOINT,
    azure_deployment="text-embedding-ada-002", 
    openai_api_version="2023-05-15",
    openai_api_key=AZURE_OPENAI_API_KEY
)

# Build the connection string
connection_string = (
    f"Driver=ODBC Driver 18 for SQL Server;"
    f"Server={SERVER_NAME};"
    f"Database={DB_NAME};"
    f"Uid={SP_APP_ID};"
    f"Pwd={SP_KEY};"  
    "Authentication=ActiveDirectoryServicePrincipal;"
    "Encrypt=yes;"
    "TrustServerCertificate=no;"
)

# Initialize the vector store
vector_store = SQLServer_VectorStore(
    connection_string=connection_string,
    distance_strategy=DistanceStrategy.COSINE,
    embedding_function=embeddings,
    embedding_length=1536, # The length of the embeddings produced by text-embedding-ada-002
    db_schema=EMBEDDINGS_SCHEMA,
    table_name=EMBEDDINGS_TABLE_NAME
)  

💡 Tip
I’ve stored all my keys in Azure Key Vault, and they can be easily queried using the built-in package notebookutils. The notebookutils.credentials module provides utilities to get access tokens and manage secrets in an Azure Key Vault.

By initializing the vector store, a table will be automatically created in the specified schema of our SQL database—it’s all handled for us!

Step 2: Store Embeddings in the Vector Database

Now that we have initialized the vector store, the next step is to insert our processed document chunks into the database.

We will iterate over our DataFrame, where each row represents a chunk, and convert them into Document objects with metadata such as the source file name.

from langchain_core.documents import Document

# Gather documents and IDs
documents = []
ids = []
for row in exploded_df.collect():
    documents.append(Document(page_content=row["chunk"], metadata={"source": row["file_name"]}))
    ids.append(row["chunk_unique_id"])

# Add the new document embeddings to the vector store.
vector_store.add_documents(documents=documents, ids=ids)

💡 Tip
Adding metadata to the Document object is optional, but it comes in handy when you’re returning search results and want to keep track of where each embedding came from.

Once this step is complete, the vector store will be populated with embeddings, making it ready for similarity search and retrieval 🎉

Conclusion

In this blog post, we covered the foundational steps for building a scalable Retrieval-Augmented Generation (RAG) application on Microsoft Fabric. We started by ingesting PDF documents, extracting their text, splitting them into chunks, and embedding those chunks into a vector store powered by Fabric’s SQL database.

In the next part of this series, we will focus on integrating the retrieval step with a large language model, allowing it to generate context-aware answers using the stored embeddings.

Stay tuned for Part 2 🚀

Resources

Here’s a list of resources that were useful in writing this blog post, and which you might find helpful as well: