Skip to content
NextLytics
Megamenü_2023_Über-uns

Shaping Business Intelligence

Whether clever add-on products for SAP BI, development of meaningful dashboards or implementation of AI-based applications - we shape the future of Business Intelligence together with you. 

Megamenü_2023_Über-uns_1

About us

As a partner with deep process know-how, knowledge of the latest SAP technologies as well as high social competence and many years of project experience, we shape the future of Business Intelligence in your company too.

Megamenü_2023_Methodik

Our Methodology

The mixture of classic waterfall model and agile methodology guarantees our projects a high level of efficiency and satisfaction on both sides. Learn more about our project approach.

Products
Megamenü_2023_NextTables

NextTables

Edit data in SAP BW out of the box: NextTables makes editing tables easier, faster and more intuitive, whether you use SAP BW on HANA, SAP S/4HANA or SAP BW 4/HANA.

Megamenü_2023_Connector

NextLytics Connectors

The increasing automation of processes requires the connectivity of IT systems. NextLytics Connectors allow you to connect your SAP ecosystem with various open-source technologies.

IT-Services
Megamenü_2023_Data-Science

Data Science & Engineering

Ready for the future? As a strong partner, we will support you in the design, implementation and optimization of your AI application.

Megamenü_2023_Planning

SAP Planning

We design new planning applications using SAP BPC Embedded, IP or SAC Planning which create added value for your company.

Megamenü_2023_Dashboarding

Dashboarding

We help you with our expertise to create meaningful dashboards based on Tableau, Power BI, SAP Analytics Cloud or SAP Lumira. 

Megamenü_2023_Data-Warehouse-1

SAP Data Warehouse

Are you planning a migration to SAP HANA? We show you the challenges and which advantages a migration provides.

Business Analytics
Megamenü_2023_Procurement

Procurement Analytics

Transparent and valid figures are important, especially in companies with a decentralized structure. SAP Procurement Analytics allows you to evaluate SAP ERP data in SAP BI.

Megamenü_2023_Reporting

SAP HR Reporting & Analytics

With our standard model for reporting from SAP HCM with SAP BW, you accelerate business activities and make data from various systems available centrally and validly.

Megamenü_2023_Dataquality

Data Quality Management

In times of Big Data and IoT, maintaining high data quality is of the utmost importance. With our Data Quality Management (DQM) solution, you always keep the overview.

Career
Megamenü_2023_Karriere-2b

Working at NextLytics

If you would like to work with pleasure and don't want to miss out on your professional and personal development, we are the right choice for you!

Megamenü_2023_Karriere-1

Senior

Time for a change? Take your next professional step and work with us to shape innovation and growth in an exciting business environment!

Megamenü_2023_Karriere-5

Junior

Enough of grey theory - time to get to know the colourful reality! Start your working life with us and enjoy your work with interesting projects.

Megamenü_2023_Karriere-4-1

Students

You don't just want to study theory, but also want to experience it in practice? Check out theory and practice with us and experience where the differences are made.

Megamenü_2023_Karriere-3

Jobs

You can find all open vacancies here. Look around and submit your application - we look forward to it! If there is no matching position, please send us your unsolicited application.

Blog
NextLytics Newsletter Teaser
Sign up now for our monthly newsletter!
Sign up for newsletter
 

Data Platform Orchestration: Apache Airflow vs Databricks Jobs

In the age of data-driven business, keeping information in sync and providing the required analysis for critical decisions is a major concern of data teams. For business intelligence to work effectively, analysts rely on the precise scheduling implemented by the data engineering team or provided by the business intelligence platform. In practice, the leading open-source orchestration and workflow management platform Apache Airflow often plays a significant role in such infrastructure. But as cloud data platforms based on the Databricks ecosystem and Lakehouse architecture become more and more popular, is the dedicated orchestration service even still required?

Databricks Jobs is a built-in scheduling service that can be used as an alternative or in conjunction with Airflow. Today, we give an overview of the two services’ strengths and weaknesses, and discuss in which situations one may be favored over the other.

A small introduction to Databricks Jobs

Databricks is a cloud-based data analysis platform that can provide all services needed for modern data management, from storage over highly scalable processing to machine learning application development. Built around the Delta table format for efficient storage of and access to huge amounts of data, Databricks has created a mature ecosystem and is established as one of the most refined platform solutions available today.

Databricks Jobs are a component of the Databricks platform that extends its capabilities with a mechanism for scheduling and orchestrating data processing tasks. These tasks can be any of the supported types available to Databricks users, like Jupyter Notebooks, SQL queries, Python scripts, Apache Spark jobs, and even JAR files. The fact that it is part of the Databricks platform allows users to orchestrate and run data engineering, machine learning and analytics workflows in their Databricks cluster. Databricks Jobs offers a user interface to schedule the frequency of job execution, configure the job related parameters and also specify dependencies between the jobs. Finally, job execution can be monitored in variable granularity, giving access to job status as well as actual log messages.

2 - databricks code & graph (1)

A small introduction to Airflow

Airflow was first developed by Airbnb as an internal orchestration platform and later contributed as a fully open-source project to the Apache Software Foundation. Apache Airflow provides a Python code–based interface to schedule, manage, and scale workflows and a user interface to monitor the status of all workflows and tasks. Due to its code-first nature, Airflow is highly customizable and extensible with the ability for users to add their own operators and hooks. As a system of micro services, Apache Airflow can be scaled to suit any amount of workflows and can enhance resource efficiency and workflow optimization with parallel execution of tasks. Though often used to orchestrate data processing pipelines, Airflow is generally agnostic of what kinds of tasks it schedules and can be used for any imaginable purpose where digital workloads need scheduling and orchestration.

1- airflow  code & graph (1)

Similarities

As we have seen, processing tasks can be scheduled and orchestrated by both Apache Airflow and with the Databricks Jobs component. Both systems offer mechanisms for scaling processing power - in various ways with Airflow depending on workflow design and through scaling of the Apache Spark resources used for a job in Databricks. Integration with a plethora of popular third party systems is provided by both systems, interfacing with different databases and storage systems or external services used for processing. Finally, monitoring and notification capabilities are given in both environments and cover the usual needs of process owners to keep track of process status.

Differences

While Apache Airflow and Databrick serve the same purposes to some degree, understanding the conceptual and technological differences is key to deciding in which situation one would rather use one over the other. First on this list of differences is the fact that Airflow serves a much wider spectrum of tasks and types of workflows it may schedule and orchestrate. If you need to orchestrate tasks that are not natively supported by Databricks use cases, e.g. triggering routines from system operations, backup, clean-up, etc., Airflow should be your choice.

If your scheduled tasks are actually implemented in Databricks but require orchestration with external pre- or post-conditions, again Airflow might be the better suited option as the leading system. Combination of both systems is actually a viable solution for these kinds of scenarios, provided by the Apache Airflow Databrick integration provider package. A variety of Airflow Operators are available for communication with Databricks. Jobs can be immediately triggered with the DatabricksRunNowOperator class or submitted for asynchronous execution with the DatabrickSubmitRunOperator.

 

3 - airflow_code_databricks_UI (1)

Databricks has a much wider scope than Airflow when it comes to housing the interactive development of data processing and data analysis including AI and machine learning scenarios. When these processes are your main focus and you’re looking for ways to transition from manual execution to routine operations, Databricks Jobs is the obvious choice to manage your workflows. By this nature, Databricks Jobs are closer to the data they process than what you would achieve with native Apache Airflow components. Implementing and scheduling data pipelines natively in Databrick may be a perfect fit if the bulk of your data is already part of a Databricks-powered Lakehouse environment or if jobs mainly ingest data for it. The number of tasks that can be scheduled with Databricks is technically limited at roughly 1.000 different Jobs though, so very high scale scheduling needs might require external scheduling or a mixed approach.


Effective workflow management with Apache Airflow 2.0 

NextLytics Whitepaper Apache Airflow


In scenarios where you have a multi-cloud infrastructure, Airflow can orchestrate all the existing workflows across the various domains thanks to a wide array of connector modules. Central management and monitoring can be a strong argument to use such an existing service for scheduling rather than Databricks’ own.

Finally, Databricks Jobs offer native support for continuously running jobs and real-time processing pipelines with the Delta Live Tables feature. Similar behavior could be achieved with high frequency triggering through Airflow but would introduce more latency and communication overhead into the process.

Comparison factors

Databricks Jobs

Apache Airflow

Pricing

The cost of Databricks Jobs depends on some factors like:

  • The plan that you will select (Standard, Premium or Enterprise),
  • The cloud provider that you will choose (AWS, Azure or Google cloud) and
  • The region of deployment (US East or West, North or West Europe etc)

Databricks Jobs are billed based on active compute resource usage for Job executions, going for roughly $0,15 per Databricks Unit* (DBU) - around 1/5th of the rate of normal interactively used compute resources on Databricks.

Job execution comes at a discount in comparison to the active development which usually precedes it and the general operation costs for the existing Databricks system which you need to even consider using its scheduling subsystem.

* DBU is an abstract, normalized unit of processing power; 1 DBU roughly aligns with the use of a single compute node with 4 CPU cores and 16GB memory for 1 hour.

The operational cost of Apache Airflow can depend on a couple of factors, the most important is the infrastructure cost. It needs servers or cloud resources to host it as well as expert knowledge and support from a service provider like NextLytics, if these cannot be provided internally.

Supported programming languages

Databricks supports many languages, primarily those compatible with the underlying Apache Spark distributed compute engine. Some of the available languages that are available in Databricks are Python, Scala, R, Java, and SQL.

When you create Databricks Jobs, you can choose the language that best suits your needs for the specific task

Apache Airflow’s core programming language is Python and it is used for defining and executing workflows. Despite the fact that Python is the main language of Airflow, each task within the workflow can run scripts or just commands in other languages.The above described flexibility gives the ability to users to write code in SQL, Java, Go, Rust and more.

Apache Airflow is language-agnostic when it comes to the task level since it is capable of running any command in any language that is compatible with the environment where Airflow runs. 

Target Group

It is designed for programmers that are working in the data science field. Databricks Jobs fits perfectly for developers that are leveraging Apache Spark. As part of the Databricks unified platform, it is a good fit for companies or organizations that deal with big data analytics and machine learning use cases and looking for a robust workflow orchestration solution

Apache Airflow caters to data engineers, DevOps professionals, and workflow automation teams. It provides a flexible and extensible platform for orchestrating and scheduling complex workflows across various systems and technologies. Airflow accommodates ideally to teams that are looking for a reliable open source solution.

 

Performance

Offers high-performance query execution, thanks to its optimized query engine and distributed computing capabilities. It's suitable for processing large-scale data and complex analytics workloads.

Apache Airflow has remarkable performance in orchestrating and managing workflows. It also supports parallel execution of tasks. Furthermore its distributed architecture empowers the program’s scalability.

Community

Databricks has a strong user community with active forums, knowledge sharing, and resources. The Databricks Community Edition is a free option that allows users to explore the platform and get support from the community. Paid plans offer additional support and services. Databricks official documentation is vast and easily accessible, providing technical references alongside practical examples for many use cases.

As a popular, widely-adopted open-source project, Apache Airflow boasts a thriving community of users and contributors. With an ever-expanding user base, Airflow provides a dedicated documentation site that offers comprehensive instructions and valuable information for both beginners and experienced users. Additionally, the fast-growing community can be easily accessed on platforms such as GitHub, Slack, and Stack Overflow. These channels foster collaboration, provide support, and facilitate collaboration among the users

 

Data Platform Orchestration - Our Conclusion

Apache Airflow and Databricks Jobs offer robust solutions for orchestrating data workflows but the unique strengths of each tool lie in different areas. On the one hand, Databricks Jobs is an excellent solution for companies that have already invested in Databricks since it integrates seamlessly with the platform and provides an easily accessible scheduling mechanism. On the other hand, Apache Airflow’s open source nature and its extensive library of operators makes it an all-around choice for orchestrating diverse tasks across various platforms. Ultimately, the choice between Databricks Jobs and Apache Airflow depends on the specific needs and preconditions as any combination may be valuable and provide the best possible solution for a very specific case.

If you are not sure which option is right for you, we’re happy to discuss and help you find the perfect fit. Simply get in touch with us - we look forward to exchanging ideas with you!

Learn more about Apache Airflow

,

avatar

Georgios

Georgios joined NextLytics in 2021 and he is working as a data engineer. He is mostly focused in technologies like docker, apache airflow, REST APIs and python programming. As a former national swimming champion, in his free time he likes to be active doing sports or hiking as well as doing barbeque with his friends.

Got a question about this blog?
Ask Georgios

Data Platform Orchestration: Apache Airflow vs Databricks Jobs
12:22

Blog - NextLytics AG 

Welcome to our blog. In this section we regularly report on news and background information on topics such as SAP Business Intelligence (BI), SAP Dashboarding with Lumira Designer or SAP Analytics Cloud, Machine Learning with SAP BW, Data Science and Planning with SAP Business Planning and Consolidation (BPC), SAP Integrated Planning (IP) and SAC Planning and much more.

Subscribe to our newsletter

Related Posts

Recent Posts