Monday, April 27, 2026

Designing Delta Tables with Liquid Clustering: Real-World Patterns for Data Engineers

As data volumes continue to explode, the way we design Delta tables has a massive impact on performance, cost, and long-term maintainability. In my latest article, I break down how Liquid Clustering is reshaping the way we think about data layout in the lakehouse — and why it’s becoming a go‑to strategy for modern data engineering teams.

🔍 What I cover in the article:

• Why traditional partitioning strategies often fall short in high‑scale environments
• How Liquid Clustering improves data skipping, reduces small files, and simplifies schema evolution
• Practical design patterns for fact, dimension, and event‑driven tables
• Real-world scenarios where Liquid Clustering delivers measurable performance gains
• Tips for choosing the right clustering keys and avoiding common pitfalls

💡 Whether you’re optimizing ingestion pipelines, tuning query performance, or modernizing your lakehouse architecture, Liquid Clustering offers a flexible and powerful alternative to rigid partitioning.

If you’re exploring ways to make your Delta Lake tables more efficient and future‑proof, this deep dive will give you actionable patterns you can apply right away.

👉 Read the full article:
Designing Delta Tables with Liquid Clustering: Real-World Patterns for Data Engineers
https://www.sqlservercentral.com/articles/designing-delta-tables-with-liquid-clustering-real-world-patterns-for-data-engineers

Friday, March 6, 2026

From the Field to the Cloud: How I Built SeedOps Savant for Corteva Agriscience on Azure AI Foundry

The Podcast Moment That Started It All

I recently had the privilege of joining Matthew Calder and Charles Maxson on the Microsoft Dev Radio podcast — one of the most exciting conversations I've had about enterprise AI in agriculture. We dug deep into a solution I architected called SeedOps Savant, an Azure AI Foundry–powered platform built for Corteva Agriscience, one of the world's leading agricultural science companies. If you haven't watched it yet, check out the full live stream here and come back — this post gives you the full behind-the-scenes story.

What Is SeedOps Savant?

SeedOps Savant is an enterprise-grade AI solution designed to bring intelligent, conversational access to seed operations data at Corteva. In an industry where seed production decisions can hinge on real-time field intelligence, agronomic research, and supply chain data, having a simple chat interface that synthesizes all of that context into an actionable answer is a game changer.

The name says it all: Seed Operations meet Savant — a system smart enough to serve agronomists, sales reps, and operational teams with fast, precise, grounded answers without digging through endless reports, spreadsheets, or documents.

Why Azure AI Foundry?

Azure AI Foundry is Microsoft's integrated platform for building, orchestrating, and managing enterprise AI solutions — from model selection and fine-tuning to deployment and observability. For SeedOps Savant, it was the natural choice for several reasons

End-to-end AI lifecycle management — I could go from model selection to deployment without stitching together disparate services
Enterprise security & governance — Corteva's data required strict access controls and data residency compliance that Azure AI Foundry handles natively
Seamless integration with the Azure ecosystem — connecting Azure AI Search for RAG, Azure OpenAI for generation, and Databricks for the data lakehouse was straightforward
Observability and monitoring — production-grade telemetry came built-in, critical for an enterprise rollout at scale

Agriculture is increasingly becoming a data-intensive industry, and Corteva is no exception — the company has built AI systems processing millions of data points across seeds, soil, weather, and genetics. SeedOps Savant needed to sit on top of that complexity and make it accessible.seedworld+1

The Architecture: RAG Meets AgriData

At its core, SeedOps Savant is a Retrieval-Augmented Generation (RAG) solution. Here's how the key layers fit together:

1. Data Foundation
Seed operations data — production schedules, agronomic research, product specifications, field trial outcomes — lives across multiple systems. We unified on Azure platform, bringing unstructured data together in a form that can be indexed and vector retrieved.

2. Intelligent Indexing with Azure AI Search
The heart of any RAG solution is its index. Azure AI Search provides hybrid search (keyword + vector), semantic ranking, and the ability to incorporate Corteva's proprietary data in a secure, governed way. This means that when a user asks a question, the retrieval step pulls back the most relevant context — not just keyword-matched documents.

3. Generative Answers via Azure OpenAI
Once the right context is retrieved, Azure OpenAI generates a grounded, human-readable response. The key here is grounded — SeedOps Savant doesn't hallucinate answers from its training data. Every response is anchored to Corteva's actual operational data.

4. Orchestration via Azure AI Foundry
Azure AI Foundry ties the prompt flow, model routing, and agent logic together, allowing the solution to handle complex multi-step queries — the kind that seed ops teams actually ask in the real world

The Real-World Impact

The agricultural AI space is exploding. SeedOps Savant brings that same paradigm to Corteva's seed operations specifically — giving the teams closest to production decisions fast access to enterprise knowledge.

For sales reps and agronomists in the field, having a system that can synthesize research, product data, and operational context into a single conversational interface isn't just a convenience — it's a competitive differentiator.agfundernews+1

Lessons Learned: Building Enterprise AI in AgriTech

A few key takeaways from building SeedOps Savant that I shared on the podcast:

Data quality is the foundation — no matter how powerful your LLM or search index, garbage in means garbage out. Invest early in data curation and governance.
Domain specificity matters — generic AI models need to be grounded in domain-specific data to be genuinely useful to agronomists and seed ops professionals.
Security and access control aren't optional — in enterprise agriculture, data is highly proprietary. Azure AI Foundry's built-in governance and role-based access made it possible to deploy with confidence.
Start with the user's workflow — the most impactful RAG solutions I've built are designed around how people actually work, not how the technology wants them to work.
Hybrid search wins — pure vector search is not enough for enterprise RAG. Combining semantic vector search with keyword search and re-ranking delivers meaningfully better results for domain-specific queries.

Watch the Full Podcast

If you want to hear me walk through the full story — the architecture decisions, the challenges of enterprise-scale RAG, and what's next for AI in agricultural operations — watch the Microsoft Dev Radio episode live below:

🎥 Watch on YouTube →

What's Next?

SeedOps Savant is one chapter in a much larger story about how Azure AI Foundry is enabling enterprise-grade AI solutions across industries that were previously underserved by technology. I'm actively documenting patterns, architectures, and implementation strategies like this in my upcoming book on Enterprise RAG with Azure technologies.

If you're building something similar — whether in agriculture, manufacturing, or any data-intensive enterprise — I'd love to connect. Drop a comment below or reach out directly.

Mehul Bhuva is a Senior Enterprise Architect, Microsoft Azure Developer Influencer, and author of an upcoming book on Enterprise RAG. He writes at sharepointfix.com.

Tuesday, December 16, 2025

Databricks Delta Sharing (D2O) with Open Delta Sharing – A Practical, Step‑by‑Step Guide for Data Engineers

Data products only create value when they can be shared and consumed easily, securely, and at scale. Delta Sharing was designed exactly for that: an open, cross‑platform protocol that lets you share live data from your databricks lakehouse with any downstream tool or platform over HTTPS, without copies or custom integrations.

In this blog post, I walk through Databricks‑to‑Open (D2O) Delta Sharing using Open Delta Sharing in a practical, step‑by‑step way. The focus is on helping data teams move from theory to a concrete implementation pattern that works in real projects.

What the article covers:

How Delta Sharing fits into a modern data collaboration strategy and when to choose Open Sharing (D2O) over Databricks‑to‑Databricks (D2D).
The core workflow: creating recipients, configuring authentication (bearer token or federated/OIDC), defining shares in Unity Catalog, and granting access to tables and views.
How external consumers can connect using open connectors (Python/pandas, Apache Spark, Power BI, Tableau, Excel and others) without needing a Databricks workspace.
Security, governance, and operational considerations such as token TTL, auditing activity, and avoiding data duplication by sharing from your existing Delta Lake and Parquet data.
Whether you are building a data‑as‑a‑service offering, exposing governed data products to partners, or just trying to simplify ad‑hoc external access, D2O can significantly reduce friction and integration work

Here is a step-by-step guide to Databricks Delta Sharing using Open Delta Sharing (D2O).

1. Create Recipient

2. Create Delta Share and assign Recipients

You can create a OIDC Federation Authentication or Token based authentication for your recipients.

Tables with RLS and column masks cannot be shared using Delta Sharing.

Select Recipient you had created prior.

Additional Information:

Thursday, December 11, 2025

Databricks Training Notes - Compute

All purpose compute -R/W/X - More expensive

Serverless version of all purpose compute

All purpose is also known as Classic Compute.

Classic Compute - VMs, Databricks Consumption DBU/hr.

Job Compute - R/X - Cheaper

Serverless version of Job Compute

You can't run Scala/R on Serverless compute.

Serverless DBU cost is higher as VM is in-built into it.

RDD - Resilient, Dataset, Distributed

Worker dies, it can recreate data partition and keep running. RDD keeps extra RAM available.

Vector Search - Word embeddings. Array of floats. Specialized engine to build index of those numbers.

Pools - Pool of VMs that you need to be paying for. Classic compute scenario. Pools have gone away.

Serverless Compute - cheaper version

Serverless Compute - performance optimized version - usually 5 seconds

Cluster - Drivers and Worker Nodes. Single node cluster - driver is the worker. SkLearn, Pandas consume driver memory.

Use Job or Serverless clusters in production. Avoid interactive clusters in prod. Enable Photon for faster and cheaper execution. Reuse clusters to reduce startup time and cost.

Serverless - Photon engine, Predictive IO, Intelligent Workload Management

Pro - Photon, Predictive IO

Classic - Photon engine

Performance considerations: SKEW/SPILL/STORAGE/SHUFFLE/SERIALIZATION

Adaptive Query Execution helps code optimization

Row Filter:

CREATE OR REPLACE FUNCTION device_filter(device_id INT)
  RETURN IF(IS_ACCOUNT_GROUP_MEMBER('admin'), true, device_id < 30);

ALTER TABLE silver 
SET ROW FILTER device_filter ON (device_id);

SELECT * 
FROM silver
ORDER BY device_id DESC;

SELECT
  *,
  cast(from_unixtime(user_first_touch_timestamp/1000000) AS DATE) AS first_touch_date
FROM read_files(
  "/Volumes/dbacademy_ecommerce/v01/raw/users-historical",
  format => 'parquet')
LIMIT 10;

Thursday, April 17, 2025

Setting Up AI Foundry with ChatGPT and RAG-Based Chat: A Comprehensive Guide

Introduction

In the rapidly evolving landscape of artificial intelligence, setting up an efficient and scalable AI system is crucial for businesses looking to leverage the power of AI. This blog post will guide you through the process of setting up AI Foundry using the ChatGPT model and implementing a Retrieval-Augmented Generation (RAG) based chat approach.

What is AI Foundry?

AI Foundry is a comprehensive platform provided by Azure that allows you to design, customize, and manage AI applications at scale. It offers a unified SDK, access to over 200 Azure services, and more than 1,800 models, making it a powerful tool for building AI-driven applications.

Understanding ChatGPT

ChatGPT, developed by OpenAI, is a conversational AI model that interacts in a dialogue format. It can answer follow-up questions, admit mistakes, and reject inappropriate requests. This model is trained using Reinforcement Learning from Human Feedback (RLHF), making it highly effective for generating coherent and contextually relevant responses.

What is RAG-Based Chat?

Retrieval-Augmented Generation (RAG) is an architecture that enhances the capabilities of a Large Language Model (LLM) like ChatGPT by integrating an information retrieval system. This system provides grounding data, ensuring that the AI's responses are accurate and relevant. RAG is particularly useful for enterprise solutions, as it allows the AI to access and utilize proprietary content.

Step-by-Step Guide to Setting Up AI Foundry with ChatGPT and RAG

Prerequisites
- Azure account with access to AI Foundry.
- OpenAI API key for ChatGPT.
- Basic understanding of Python and Azure services.
Setting Up AI Foundry
- Sign In: Log into your Azure account and navigate to AI Foundry.
- Create a New Project: Start a new project and select the necessary services and models.
- Configure SDK: Install the AI Foundry SDK and set up your development environment.
pip install azure-ai-foundry
Integrating ChatGPT
- API Access: Obtain your OpenAI API key and integrate it into your project.
- Model Configuration: Configure the ChatGPT model within AI Foundry.
import openai
openai.api_key = 'your-api-key'
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "How can I set up AI Foundry?"}
    ]
)
print(response.choices[0].message["content"])
Implementing RAG-Based Chat
- Data Retrieval System: Set up Azure AI Search to index and retrieve relevant data.
- Integration with ChatGPT: Combine the retrieval system with ChatGPT to enhance response accuracy.
  
  from azure.ai.search import SearchClient
  from azure.core.credentials import AzureKeyCredential
  
  search_client = SearchClient(endpoint="your-search-endpoint", credential=AzureKeyCredential("your-key"))
  
  def retrieve_data(query):
      results = search_client.search(query)
      return results
  
  def generate_response(query):
      data = retrieve_data(query)
      response = openai.ChatCompletion.create(
          model="gpt-4",
          messages=[
              {"role": "system", "content": "You are a helpful assistant."},
              {"role": "user", "content": query},
              {"role": "assistant", "content": data}
          ]
      )
      return response.choices[0].message["content"]
  
  print(generate_response("Tell me about AI Foundry"))
Testing and Deployment
- Evaluation: Test the system using ground truth data to ensure coherence and relevance.
- Deployment: Deploy your AI application using Azure's scalable infrastructure.

Conclusion

Setting up AI Foundry with ChatGPT and implementing a RAG-based chat approach can significantly enhance the capabilities of your AI applications. By following this guide, you can create a robust and scalable AI system that leverages the latest advancements in AI technology.