Data Lakehouse is the future architectural pattern to enable large-scale enterprise data management, analytics and anything “AI”. The Data Lakehouse combines the SQL-based logical and semantic structures of the classic Data Warehouse with the scalable compute and storage technology of the Data Lake. Originally a byproduct of this technological paradigm shift we see across all major data platform vendors and products, openness is promoted as one major structural advantage: Choose your own compute engine, BI tool, machine learning or GenAI application to access your data where it resides instead of moving around large, unwieldy duplicates.
Behind the scenes, Apache Spark is currently the de-facto standard engine required to access any Data Lakehouse architecture. Which means, to access data residing in a lakehouse catalog in Databricks, Dremio, Trino, etc., you would first need access to a Spark cluster or at least a single node engine. Queries written in Python are mapped to the underlying Java-based engine and executed against the data, introducing a certain overhead and cost for the used compute resources.
We now see native Python libraries like Daft surfacing that interface with Data Lakehouse protocols and truly enable the promised accessibility and interoperability. Here, we demonstrate how to leverage Apache Airflow worker nodes to access data from Azure Databricks Unity Catalog.
Data ingestion is a fundamental part of any data platform, yet traditional methods often introduce unnecessary complexity and overhead. A common way to tackle these challenges is to rely on Apache Spark for transformation, even if the feature and performance scope of Spark clusters is disproportionate to the lightweight processing that is actually needed. This can create inefficiencies where compute costs are increased and additional specific technical knowledge might be required, causing implementation to slow down and be less cost effective.
With the rise of Daft, a new open source high-performance Data engine, organizations now have a lightweight, Spark-free alternative for data ingestion. Combined with Apache Airflow’s scheduling framework, Daft enables efficient, scalable ingestion into a variety of sources, among them the Databricks Unity Catalog without requiring a full Spark cluster for processing.
This blog post explores how engineering teams can leverage Daft with Airflow to ingest data into Databricks Unity Catalog, maximizing efficiency and minimizing infrastructure costs.
Many data engineering teams default to using Spark-based ingestion methods in Databricks, even when transformation needs are minimal. This leads to unnecessary resource consumption and high operational complexity. Daft offers a compelling alternative:
By using Daft, data teams can reduce costs, streamline operations, and improve the efficiency of data ingestion pipelines.
Using Airflow’s DAG framework, data engineers can orchestrate Daft-based ingestion into Databricks Unity Catalog without relying on Spark for processing. The workflow follows these key steps:
A simplified Airflow DAG for Daft-based ingestion using its DataFrame library might look like this:
from airflow import DAG
from datetime import datetime
from airflow.decorators import dag, task
from airflow.operators.python import PythonOperator
import daft
from daft.unity_catalog import UnityCatalog, UnityCatalogTable
DATABRICKS_URL = "https://adb-1234.azuredatabricks.net"
DATABRICKS_TOKEN = "..."
STORAGE_BASE = "abfss://databricks-metastore@databricksunicat.dfs.core.windows.net/ingest_test"
@dag(
dag_id="daft_ingestion_pipeline",
schedule_interval="@daily",
start_date=datetime(2025, 1, 1),
catchup=False
)
def daft_ingestion_pipeline():
@task
def ingest_data():
unity = UnityCatalog(endpoint=DATABRICKS_URL, token=DATABRICKS_TOKEN)
# Load data from local file
df = daft.read_csv("./myfile.csv")
df = df.drop_nulls()
delta_table = unity \
.load_table("nextlytics.nextlytics-demo.daft_test_consumption", \
new_table_storage_path=STORAGE_BASE + "/consumption")
df.write_deltalake(delta_table, mode="overwrite")
ingest_data()
daft_ingestion_dag = daft_ingestion_pipeline()
This approach allows ingestion workflows to run efficiently within Airflow, without requiring Databricks clusters to be active throughout the pipeline execution.
One drawback our tests have revealed is that currently, the object storage path for a newly created Delta table must be pre-configured, even when the Unity Catalog schema has a default storage path assigned. We attribute this inconvenience to both Daft and the Unity
Catalog Python libraries still being in early development phases. At the rate of improvement we see in this ecosystem, such inconveniences will be ironed out in the near future.
From an engineering leadership perspective, adopting Daft with Airflow for Databricks ingestion unlocks tangible business advantages:
By shifting ingestion workloads to Daft-powered Airflow DAGs, organizations can optimize their data architecture while maintaining Databricks as a powerful analytics platform.
Integrating Daft with Airflow for Databricks Unity Catalog ingestion provides a modern, cost-effective alternative to traditional Spark-based ingestion methods. By adopting this approach, engineering teams can reduce infrastructure costs, improve ingestion speed, and simplify pipeline management.
For organizations looking to implement or optimize Databricks and Airflow-based ingestion strategies, NextLytics provides expert guidance in data engineering, workflow orchestration, and scalable cloud architectures. Reach out to our team to explore how we can help streamline your data ingestion pipelines and maximize efficiency in your data platform.