Welcome to Data Toast 🍞 !

Your one-stop shop for all things data! I will be serving up a tasty menu of insights, tips, and best practices. Join me as we explore the world of data, toasting to innovation and unlocking its full potential!

Building a Scalable RAG App on Microsoft Fabric - Part 1 (The Vector Database)

📢 Disclaimer This blog post uses Microsoft Fabric features that are currently in preview. These functionalities may have limited capabilities, are not intended for production use, and are not subject to SLAs. To enable preview features in Fabric, your Fabric administrator must activate them in the Admin Portal. For more details, refer to the official documentation. Introduction If you’re reading this, chances are you’ve spent some time experimenting with Large Language Models (LLMs) like GPT, Claude or DeepSeek....

A Medley of Spark Tips and Tricks

After spending countless hours working with Spark, I’ve compiled some tips and tricks that have helped me improve my productivity and performance. Today, I’ll share some of my favorite ones with you. 1. Measuring DataFrame Size in Memory When working with Spark, knowing how much memory your DataFrame uses is crucial for optimization. This is especially important since compressed formats like Parquet can make memory usage unpredictable - a small file on disk might expand significantly when loaded into memory....

Boosting Spark SQL Performance with Adaptive Query Execution

Adaptive Query Execution (AQE) is a groundbreaking feature introduced in Spark 3.0 that dynamically optimizes query performance at runtime. By utilizing real-time statistics, AQE can adjust query plans based on the actual data characteristics encountered during execution, leading to more efficient and faster query processing. In this blog, I will explore the practical applications of AQE, demonstrating its benefits and capabilities. To illustrate these concepts, I will use Microsoft Fabric notebooks running on runtime 1....

Mastering chained transformations in Spark

When dealing with complex data transformation logic, the key is to break it down into small manageable and testable functional units, this ensures clarity and ease of maintenance throughout your project. The Spark Dataframe API offers a seamless way to manipulate structured data. One particularly handy method within this API is .transform(), which allows for concise chaining of custom transformations, thereby facilitating complex data processing pipelines. In this blog, we’ll embark on a journey to understand the bits and pieces of transformation chains using PySpark, starting from simple transformations and gradually delving into more advanced scenarios....