Ingesting Data into Databricks Unity Catalog via Apache Airflow with Daft

Written by Robin | 06 February 2025

Openness at the core of the Data Lakehouse

Data Lakehouse is the future architectural pattern to enable large-scale enterprise data management, analytics and anything “AI”. The Data Lakehouse combines the SQL-based logical and semantic structures of the classic Data Warehouse with the scalable compute and storage technology of the Data Lake. Originally a byproduct of this technological paradigm shift we see across all major data platform vendors and products, openness is promoted as one major structural advantage: Choose your own compute engine, BI tool, machine learning or GenAI application to access your data where it resides instead of moving around large, unwieldy duplicates.

Behind the scenes, Apache Spark is currently the de-facto standard engine required to access any Data Lakehouse architecture. Which means, to access data residing in a lakehouse catalog in Databricks, Dremio, Trino, etc., you would first need access to a Spark cluster or at least a single node engine. Queries written in Python are mapped to the underlying Java-based engine and executed against the data, introducing a certain overhead and cost for the used compute resources.

We now see native Python libraries like Daft surfacing that interface with Data Lakehouse protocols and truly enable the promised accessibility and interoperability. Here, we demonstrate how to leverage Apache Airflow worker nodes to access data from Azure Databricks Unity Catalog.

A Modern, Lightweight Approach to Data Ingestion

Data ingestion is a fundamental part of any data platform, yet traditional methods often introduce unnecessary complexity and overhead. A common way to tackle these challenges is to rely on Apache Spark for transformation, even if the feature and performance scope of Spark clusters is disproportionate to the lightweight processing that is actually needed. This can create inefficiencies where compute costs are increased and additional specific technical knowledge might be required, causing implementation to slow down and be less cost effective.

With the rise of Daft, a new open source high-performance Data engine, organizations now have a lightweight, Spark-free alternative for data ingestion. Combined with Apache Airflow’s scheduling framework, Daft enables efficient, scalable ingestion into a variety of sources, among them the Databricks Unity Catalog without requiring a full Spark cluster for processing.

This blog post explores how engineering teams can leverage Daft with Airflow to ingest data into Databricks Unity Catalog, maximizing efficiency and minimizing infrastructure costs.

Why Use Daft for Data Ingestion?

Many data engineering teams default to using Spark-based ingestion methods in Databricks, even when transformation needs are minimal. This leads to unnecessary resource consumption and high operational complexity. Daft offers a compelling alternative:

Lightweight & Spark-Free: Daft enables data transformations without requiring a Spark cluster, reducing infrastructure overhead.
Seamless Airflow Integration: Daft operates natively within Python, making it easy to integrate with Apache Airflow DAGs.
Efficient Processing: Daft optimizes I/O and computation, making it ideal for streaming and batch ingestion without the heavyweight processing of Spark.
Catalog Integration: Data can be directly ingested into Databricks Unity Catalog, ensuring governance and accessibility.

By using Daft, data teams can reduce costs, streamline operations, and improve the efficiency of data ingestion pipelines.

Implementing Daft-Based Ingestion in Apache Airflow

Using Airflow’s DAG framework, data engineers can orchestrate Daft-based ingestion into Databricks Unity Catalog without relying on Spark for processing. The workflow follows these key steps:

Extract Data: Daft reads raw data from various sources such as S3, Azure Blob Storage, or on-prem databases.
Transform Data (if needed): Lightweight transformations (e.g., filtering, deduplication, schema validation) are performed within Daft.
Load into Unity Catalog: The processed data is written into Databricks Unity Catalog as a Delta Lake table.

Effective workflow management with Apache Airflow 2.0

A simplified Airflow DAG for Daft-based ingestion using its DataFrame library might look like this:

from airflow import DAG

from datetime import datetime 

from airflow.decorators import dag, task

from airflow.operators.python import PythonOperator

import daft

from daft.unity_catalog import UnityCatalog, UnityCatalogTable


DATABRICKS_URL = "https://adb-1234.azuredatabricks.net"

DATABRICKS_TOKEN = "..."

STORAGE_BASE = "abfss://databricks-metastore@databricksunicat.dfs.core.windows.net/ingest_test"


@dag(
  dag_id="daft_ingestion_pipeline",
  schedule_interval="@daily", 
  start_date=datetime(2025, 1, 1), 
  catchup=False 
) 
def daft_ingestion_pipeline():

    @task

    def ingest_data():

        unity = UnityCatalog(endpoint=DATABRICKS_URL, token=DATABRICKS_TOKEN)

        # Load data from local file

        df = daft.read_csv("./myfile.csv")

        df = df.drop_nulls()

        delta_table = unity \

                .load_table("nextlytics.nextlytics-demo.daft_test_consumption", \

                new_table_storage_path=STORAGE_BASE + "/consumption")

        df.write_deltalake(delta_table, mode="overwrite")


    ingest_data()


daft_ingestion_dag = daft_ingestion_pipeline()

This approach allows ingestion workflows to run efficiently within Airflow, without requiring Databricks clusters to be active throughout the pipeline execution.

One drawback our tests have revealed is that currently, the object storage path for a newly created Delta table must be pre-configured, even when the Unity Catalog schema has a default storage path assigned. We attribute this inconvenience to both Daft and the Unity

Catalog Python libraries still being in early development phases. At the rate of improvement we see in this ecosystem, such inconveniences will be ironed out in the near future.

Business Benefits for Engineering Managers

From an engineering leadership perspective, adopting Daft with Airflow for Databricks ingestion unlocks tangible business advantages:

Cost Savings: Eliminates unnecessary Spark cluster usage, reducing Databricks compute costs.
Scalability & Performance: Lightweight execution allows faster ingestion of large datasets with fewer bottlenecks.
Less Operational complexity: Seamless Airflow integration enables centralized scheduling and monitoring, improving data pipeline reliability.
Governance & Security: Direct ingestion into Unity Catalog ensures compliance with data lineage, access control, and auditing.
Ease of implementation: With the Daft building on the commonly used Python construct of DataFrames initially introduced by the Pandas library, the actual implementation of the code is made significantly easier.

By shifting ingestion workloads to Daft-powered Airflow DAGs, organizations can optimize their data architecture while maintaining Databricks as a powerful analytics platform.

Our Conclusion: Leverage Expert Guidance from NextLytics

Integrating Daft with Airflow for Databricks Unity Catalog ingestion provides a modern, cost-effective alternative to traditional Spark-based ingestion methods. By adopting this approach, engineering teams can reduce infrastructure costs, improve ingestion speed, and simplify pipeline management.

For organizations looking to implement or optimize Databricks and Airflow-based ingestion strategies, NextLytics provides expert guidance in data engineering, workflow orchestration, and scalable cloud architectures. Reach out to our team to explore how we can help streamline your data ingestion pipelines and maximize efficiency in your data platform.

View full post