Skip to content
NextLytics
Megamenü_2023_Über-uns

Shaping Business Intelligence

Whether clever add-on products for SAP BI, development of meaningful dashboards or implementation of AI-based applications - we shape the future of Business Intelligence together with you. 

Megamenü_2023_Über-uns_1

About us

As a partner with deep process know-how, knowledge of the latest SAP technologies as well as high social competence and many years of project experience, we shape the future of Business Intelligence in your company too.

Megamenü_2023_Methodik

Our Methodology

The mixture of classic waterfall model and agile methodology guarantees our projects a high level of efficiency and satisfaction on both sides. Learn more about our project approach.

Products
Megamenü_2023_NextTables

NextTables

Edit data in SAP BW out of the box: NextTables makes editing tables easier, faster and more intuitive, whether you use SAP BW on HANA, SAP S/4HANA or SAP BW 4/HANA.

Megamenü_2023_Connector

NextLytics Connectors

The increasing automation of processes requires the connectivity of IT systems. NextLytics Connectors allow you to connect your SAP ecosystem with various open-source technologies.

IT-Services
Megamenü_2023_Data-Science

Data Science & Engineering

Ready for the future? As a strong partner, we will support you in the design, implementation and optimization of your AI application.

Megamenü_2023_Planning

SAP Planning

We design new planning applications using SAP BPC Embedded, IP or SAC Planning which create added value for your company.

Megamenü_2023_Dashboarding

Dashboarding

We help you with our expertise to create meaningful dashboards based on Tableau, Power BI, SAP Analytics Cloud or SAP Lumira. 

Megamenü_2023_Data-Warehouse-1

SAP Data Warehouse

Are you planning a migration to SAP HANA? We show you the challenges and which advantages a migration provides.

Business Analytics
Megamenü_2023_Procurement

Procurement Analytics

Transparent and valid figures are important, especially in companies with a decentralized structure. SAP Procurement Analytics allows you to evaluate SAP ERP data in SAP BI.

Megamenü_2023_Reporting

SAP HR Reporting & Analytics

With our standard model for reporting from SAP HCM with SAP BW, you accelerate business activities and make data from various systems available centrally and validly.

Megamenü_2023_Dataquality

Data Quality Management

In times of Big Data and IoT, maintaining high data quality is of the utmost importance. With our Data Quality Management (DQM) solution, you always keep the overview.

Career
Megamenü_2023_Karriere-2b

Working at NextLytics

If you would like to work with pleasure and don't want to miss out on your professional and personal development, we are the right choice for you!

Megamenü_2023_Karriere-1

Senior

Time for a change? Take your next professional step and work with us to shape innovation and growth in an exciting business environment!

Megamenü_2023_Karriere-5

Junior

Enough of grey theory - time to get to know the colourful reality! Start your working life with us and enjoy your work with interesting projects.

Megamenü_2023_Karriere-4-1

Students

You don't just want to study theory, but also want to experience it in practice? Check out theory and practice with us and experience where the differences are made.

Megamenü_2023_Karriere-3

Jobs

You can find all open vacancies here. Look around and submit your application - we look forward to it! If there is no matching position, please send us your unsolicited application.

Blog
NextLytics Newsletter Teaser
Sign up now for our monthly newsletter!
Sign up for newsletter
 

Ingesting Data into Databricks Unity Catalog via Apache Airflow with Daft

Openness at the core of the Data Lakehouse

Data Lakehouse is the future architectural pattern to enable large-scale enterprise data management, analytics and anything “AI”. The Data Lakehouse combines the SQL-based logical and semantic structures of the classic Data Warehouse with the scalable compute and storage technology of the Data Lake. Originally a byproduct of this technological paradigm shift we see across all major data platform vendors and products, openness is promoted as one major structural advantage: Choose your own compute engine, BI tool, machine learning or GenAI application to access your data where it resides instead of moving around large, unwieldy duplicates.

Behind the scenes, Apache Spark is currently the de-facto standard engine required to access any Data Lakehouse architecture. Which means, to access data residing in a lakehouse catalog in Databricks, Dremio, Trino, etc., you would first need access to a Spark cluster or at least a single node engine. Queries written in Python are mapped to the underlying Java-based engine and executed against the data, introducing a certain overhead and cost for the used compute resources.

We now see native Python libraries like Daft surfacing that interface with Data Lakehouse protocols and truly enable the promised accessibility and interoperability. Here, we demonstrate how to leverage Apache Airflow worker nodes to access data from Azure Databricks Unity Catalog.

A Modern, Lightweight Approach to Data Ingestion

Data ingestion is a fundamental part of any data platform, yet traditional methods often introduce unnecessary complexity and overhead. A common way to tackle these challenges is to rely on Apache Spark for transformation, even if the feature and performance scope of Spark clusters is disproportionate to the lightweight processing that is actually needed. This can create inefficiencies where compute costs are increased and additional specific technical knowledge might be required, causing implementation to slow down and be less cost effective.

With the rise of Daft, a new open source high-performance Data engine, organizations now have a lightweight, Spark-free alternative for data ingestion. Combined with Apache Airflow’s scheduling framework, Daft enables efficient, scalable ingestion into a variety of sources, among them the Databricks Unity Catalog without requiring a full Spark cluster for processing.

graphic_spark_daft_ingestion_Databricks Unity Catalog

This blog post explores how engineering teams can leverage Daft with Airflow to ingest data into Databricks Unity Catalog, maximizing efficiency and minimizing infrastructure costs.

Why Use Daft for Data Ingestion?

Many data engineering teams default to using Spark-based ingestion methods in Databricks, even when transformation needs are minimal. This leads to unnecessary resource consumption and high operational complexity. Daft offers a compelling alternative:

  • Lightweight & Spark-Free: Daft enables data transformations without requiring a Spark cluster, reducing infrastructure overhead.
  • Seamless Airflow Integration: Daft operates natively within Python, making it easy to integrate with Apache Airflow DAGs.
  • Efficient Processing: Daft optimizes I/O and computation, making it ideal for streaming and batch ingestion without the heavyweight processing of Spark.
  • Catalog Integration: Data can be directly ingested into Databricks Unity Catalog, ensuring governance and accessibility.

By using Daft, data teams can reduce costs, streamline operations, and improve the efficiency of data ingestion pipelines.

Implementing Daft-Based Ingestion in Apache Airflow

Using Airflow’s DAG framework, data engineers can orchestrate Daft-based ingestion into Databricks Unity Catalog without relying on Spark for processing. The workflow follows these key steps:

  1. Extract Data: Daft reads raw data from various sources such as S3, Azure Blob Storage, or on-prem databases.
  2. Transform Data (if needed): Lightweight transformations (e.g., filtering, deduplication, schema validation) are performed within Daft.
  3. Load into Unity Catalog: The processed data is written into Databricks Unity Catalog as a Delta Lake table.

Effective workflow management with Apache Airflow 2.0

NextLytics Whitepaper Apache Airflow


A simplified Airflow DAG for Daft-based ingestion using its DataFrame library might look like this:

from airflow import DAG

from datetime import datetime 

from airflow.decorators import dag, task

from airflow.operators.python import PythonOperator

import daft

from daft.unity_catalog import UnityCatalog, UnityCatalogTable


DATABRICKS_URL = "https://adb-1234.azuredatabricks.net"

DATABRICKS_TOKEN = "..."

STORAGE_BASE = "abfss://databricks-metastore@databricksunicat.dfs.core.windows.net/ingest_test"


@dag(
  dag_id="daft_ingestion_pipeline",
  schedule_interval="@daily",
  start_date=datetime(2025, 1, 1),
  catchup=False
)
def daft_ingestion_pipeline():


    @task

    def ingest_data():

        unity = UnityCatalog(endpoint=DATABRICKS_URL, token=DATABRICKS_TOKEN)

        # Load data from local file

        df = daft.read_csv("./myfile.csv")

        df = df.drop_nulls()

        delta_table = unity \

                .load_table("nextlytics.nextlytics-demo.daft_test_consumption", \

                new_table_storage_path=STORAGE_BASE + "/consumption")

        df.write_deltalake(delta_table, mode="overwrite")


   ingest_data()


daft_ingestion_dag = daft_ingestion_pipeline()

This approach allows ingestion workflows to run efficiently within Airflow, without requiring Databricks clusters to be active throughout the pipeline execution.

One drawback our tests have revealed is that currently, the object storage path for a newly created Delta table must be pre-configured, even when the Unity Catalog schema has a default storage path assigned. We attribute this inconvenience to both Daft and the Unity

Catalog Python libraries still being in early development phases. At the rate of improvement we see in this ecosystem, such inconveniences will be ironed out in the near future.

Business Benefits for Engineering Managers

From an engineering leadership perspective, adopting Daft with Airflow for Databricks ingestion unlocks tangible business advantages:

    • Cost Savings: Eliminates unnecessary Spark cluster usage, reducing Databricks compute costs.
    • Scalability & Performance: Lightweight execution allows faster ingestion of large datasets with fewer bottlenecks.
    • Less Operational complexity: Seamless Airflow integration enables centralized scheduling and monitoring, improving data pipeline reliability.
    • Governance & Security: Direct ingestion into Unity Catalog ensures compliance with data lineage, access control, and auditing.
    • Ease of implementation: With the Daft building on the commonly used Python construct of DataFrames initially introduced by the Pandas library, the actual implementation of the code is made significantly easier.

By shifting ingestion workloads to Daft-powered Airflow DAGs, organizations can optimize their data architecture while maintaining Databricks as a powerful analytics platform.

Our Conclusion: Leverage Expert Guidance from NextLytics

Integrating Daft with Airflow for Databricks Unity Catalog ingestion provides a modern, cost-effective alternative to traditional Spark-based ingestion methods. By adopting this approach, engineering teams can reduce infrastructure costs, improve ingestion speed, and simplify pipeline management.

For organizations looking to implement or optimize Databricks and Airflow-based ingestion strategies, NextLytics provides expert guidance in data engineering, workflow orchestration, and scalable cloud architectures. Reach out to our team to explore how we can help streamline your data ingestion pipelines and maximize efficiency in your data platform.

Learn more about Apache Airflow

 

,

avatar

Robin

Robin Brandt is a consultant for Machine Learning and Data Engineering. With many years of experience in software and data engineering, he has expertise in automation, data transformation and database management - especially in the area of open source solutions. He spends his free time making music or creating spicy dishes.

Got a question about this blog?
Ask Robin

Ingesting Data into Databricks Unity Catalog via Apache Airflow with Daft
7:54

Blog - NextLytics AG 

Welcome to our blog. In this section we regularly report on news and background information on topics such as SAP Business Intelligence (BI), SAP Dashboarding with Lumira Designer or SAP Analytics Cloud, Machine Learning with SAP BW, Data Science and Planning with SAP Business Planning and Consolidation (BPC), SAP Integrated Planning (IP) and SAC Planning and much more.

Subscribe to our newsletter

Related Posts

Recent Posts