Skip to content
NextLytics
Megamenü_2023_Über-uns

Shaping Business Intelligence

Whether clever add-on products for SAP BI, development of meaningful dashboards or implementation of AI-based applications - we shape the future of Business Intelligence together with you. 

Megamenü_2023_Über-uns_1

About us

As a partner with deep process know-how, knowledge of the latest SAP technologies as well as high social competence and many years of project experience, we shape the future of Business Intelligence in your company too.

Megamenü_2023_Methodik

Our Methodology

The mixture of classic waterfall model and agile methodology guarantees our projects a high level of efficiency and satisfaction on both sides. Learn more about our project approach.

Products
Megamenü_2023_NextTables

NextTables

Edit data in SAP BW out of the box: NextTables makes editing tables easier, faster and more intuitive, whether you use SAP BW on HANA, SAP S/4HANA or SAP BW 4/HANA.

Megamenü_2023_Connector

NextLytics Connectors

The increasing automation of processes requires the connectivity of IT systems. NextLytics Connectors allow you to connect your SAP ecosystem with various open-source technologies.

IT-Services
Megamenü_2023_Data-Science

Data Science & Engineering

Ready for the future? As a strong partner, we will support you in the design, implementation and optimization of your AI application.

Megamenü_2023_Planning

SAP Planning

We design new planning applications using SAP BPC Embedded, IP or SAC Planning which create added value for your company.

Megamenü_2023_Dashboarding

Dashboarding

We help you with our expertise to create meaningful dashboards based on Tableau, Power BI, SAP Analytics Cloud or SAP Lumira. 

Megamenü_2023_Data-Warehouse-1

SAP Data Warehouse

Are you planning a migration to SAP HANA? We show you the challenges and which advantages a migration provides.

Business Analytics
Megamenü_2023_Procurement

Procurement Analytics

Transparent and valid figures are important, especially in companies with a decentralized structure. SAP Procurement Analytics allows you to evaluate SAP ERP data in SAP BI.

Megamenü_2023_Reporting

SAP HR Reporting & Analytics

With our standard model for reporting from SAP HCM with SAP BW, you accelerate business activities and make data from various systems available centrally and validly.

Megamenü_2023_Dataquality

Data Quality Management

In times of Big Data and IoT, maintaining high data quality is of the utmost importance. With our Data Quality Management (DQM) solution, you always keep the overview.

Career
Megamenü_2023_Karriere-2b

Working at NextLytics

If you would like to work with pleasure and don't want to miss out on your professional and personal development, we are the right choice for you!

Megamenü_2023_Karriere-1

Senior

Time for a change? Take your next professional step and work with us to shape innovation and growth in an exciting business environment!

Megamenü_2023_Karriere-5

Junior

Enough of grey theory - time to get to know the colourful reality! Start your working life with us and enjoy your work with interesting projects.

Megamenü_2023_Karriere-4-1

Students

You don't just want to study theory, but also want to experience it in practice? Check out theory and practice with us and experience where the differences are made.

Megamenü_2023_Karriere-3

Jobs

You can find all open vacancies here. Look around and submit your application - we look forward to it! If there is no matching position, please send us your unsolicited application.

Blog
NextLytics Newsletter Teaser
Sign up now for our monthly newsletter!
Sign up for newsletter
 

Apache Airflow - Next Level Cron Alternative for ETL Workflows

The business world communicates, thrives and operates in the form of data. The new life essence that connects tomorrow with today must be masterfully kept in motion. This is where state-of-the-art workflow management provides a helping hand. Digital processes are executed, various systems are orchestrated and data processing is automated. In this article, we will show you how all this can be done comfortably with the open-source workflow management platform Apache Airflow. Here you will find important functionalities, components and the most important terms explained for a trouble-free start.

Why implement digital workflow management?

The manual execution of workflows and pure startup with cron jobs is no longer up to date. Many companies are therefore looking for a cron alternative. As soon as digital tasks - or entire processes - are to be executed repetitively and reliably, an automated solution is needed. In addition to the pure execution of work steps, other aspects are important:

  • Troubleshooting
    The true greatness of a workflow management platform becomes apparent when unforeseen errors occur. In addition to notification and detailed localization of errors in the process, automatic documentation is also part of the process. Ideally, a retry should be initiated automatically after a given time window, so that short-term system reachability problems are resolved on their own. Task-specific system logs should be available to the user for quick troubleshooting.

  • Flexibility in the design of the workflow
    The modern challenges of workflow management go beyond hard-coded workflows. To allow workflows to adapt dynamically to the current execution interval, for example, the execution context should be callable via variables at the time of execution. Concepts such as conditions are also enjoying increasing user benefit in the design of flexible workflows.

  • Monitoring execution times
    A workflow management system is a central point, which tracks not only the status but also the execution time of the workflows. Execution times can be monitored automatically by means of service level agreements (SLA). Unexpectedly long execution times due to an unusually large amount of data are thus detected and can optionally trigger a notification.

Out of the challenges, Airflow was developed in 2014 as AirBnB's internal workflow management platform to successfully manage the complex, numerous workflows. Apache Airflow was open-source from the beginning and is now available to users free of charge under the Apache License.

Mobile_Airflow

Apache Airflow Features

Since Airflow became a top-level project of the Apache Software Foundation in 2019, the contributing community got a gigantic growth boost. As a result, the feature set has grown a lot over time, with regular releases to meet the current heartfelt needs of users.

Rich web interface

Compared to other workflow management platforms, the rich web interface is particularly impressive. The status of execution processes, resulting runtimes and, of course, log files are directly accessible via the elegantly designed web interface. Important functions for managing workflows, such as starting, pausing and deleting a workflow, can be realized directly from the start menu without any detours. This ensures intuitive usability, even without any programming knowledge. Access is best via a desktop, but is also possible via mobile devices with comfort restrictions.

Command line interface and API

Apache Airflow is not only available for clicking. For technical users there is a command line interface (CLI) which also covers the main functions. Through the redesigned REST API, even other systems access Airflow with secure authentication through the interface. This enables a number of new use cases and system integrations.

Realization of complex workflows with internal and external dependencies

In Apache Airflow, workflows are defined by Python code. The order of tasks can be easily customized. Predecessors, successors and parallel tasks can be defined. In addition to these internal dependencies, external dependencies can also be implemented. For example, it is possible to wait with the continuation of the workflow until a file appears on a cloud storage or an SQL statement provides a valid result. Advanced functions, such as the reuse of workflow parts (TaskGroups) and conditional branching, delight even demanding users.

Scalability and containerization

As it is deployed, Apache Airflow can initially run on a single server and then scale horizontally as tasks grow. Deployment on distributed systems is mature and different architecture variants (Kubernetes, Celery, Dask) are supported.

Customizability with plug-ins and macros

Many integrations to Apache Hive, Hadoop Distributed File System (HDFS), Amazon S3, etc. are provided in the default installation. Others can be added through custom task classes. Due to its open-source nature, even the core of the application is customizable and the community provides well-documented plug-ins for most requirements.


Optimize your workflow management
with Apache Airflow 

NextLytics Whitepaper Apache Airflow


Components in Apache Airflow

The many functions of Airflow are determined by the perfect interaction of its components. The architecture can vary depending on the application. It is thus possible to scale flexibly from a single machine to an entire cluster. The graphic shows a multi-node architecture with several machines.

mulit-node architecture airflow

The picture shows a multi-node architecture of Airflow. Compared to a single node architecture, the workers are placed in their own nodes


  • A scheduler along with the attached executor takes care of tracking and triggering the stored workflows. While the scheduler keeps track of which task can be executed next, the executor takes care of the selection of the worker and the following communication. Since Apache Airflow 2.0 it is possible to use multiple schedulers. For particularly large numbers of tasks, this reduces latency.

  • As soon as a workflow is started, a worker takes over the execution of the stored commands. For special requirements regarding RAM and GPU etc., workers with specific environments can be selected.

  • The web server allows easy user interaction in a graphical interface. This component runs separately. If required, the web server can be omitted, but the monitoring functions are very popular in everyday business.

  • Among other things, the metadata database securely stores statistics about workflow runs and connection data to external databases.

With this setup, Airflow is able to reliably execute its data processes. In combination with the Python programming language, it is now easy to determine what should run in the workflow and how. Before creating the first workflows, you should have heard certain terms.

Apache Airflow terminology

Important terminology in Apache Airflow

The term DAG (Directed Acyclic Graph) is often used in connection with Apache Airflow. This is the internal storage form of a workflow. The term DAG is used synonymously to workflow and is probably the most central term in Airflow. Accordingly, a DAG run denotes a workflow run and the workflow files are stored in the DAG bag. The following graphic shows such a DAG. This schematically describes a simple Extract-Transform-Load (ETL) workflow.

ETL workflow

With Python, associated tasks are combined into a DAG. This DAG serves programmatically as a container to keep the tasks, their order and information about the execution (interval, start time, retries in case of errors,..) together. With the definition of the relations (predecessor, successor, parallel) even complex workflows are modelable. There can be several start and end items. Only cycles are not allowed. Even conditional branching is possible.

In the DAG tasks can be formulated either as operators or as sensors. While operators execute the actual commands, a sensor interrupts the execution until a certain event occurs. Both basic types are specialized for specific applications in numerous community developments. Plug-and-play operators are essential for easy integration with Amazon Web Service, Google Cloud Platform, and Microsoft Azure, among many others. The specialization goes from the simple BashOperator for executing Bash commands to the GoogleCloudStorageToBigQueryOperator. The long list of available operators can be seen in the Github repository.

In the web interface, the DAGs are represented graphically. In the graph view (upper figure) the tasks and their relationships are clearly visible. The status colors of the edges symbolize the state of the task in the selected workflow run. In the tree view (following graphic), past runs are also displayed. Here, too, the intuitive color scheme indicates possible errors directly at the associated task. With just two clicks, the log files can be conveniently read out. Monitoring and troubleshooting were definitely among Airflow's strengths.

DAG exemple complex

Whether machine learning workflow or ETL process, a look at Airflow is always worthwhile. Feel free to contact us if you need support with the custom-fit configuration or want to upgrade an existing installation. We are also happy to share our knowledge in hands-on workshops.

Learn more about Apache Airflow

,

avatar

Luise

Luise Wiesalla joined NextLytics AG in 2019 as a working student / student consultant in the field of data analytics and machine learning. She has experience with full-stack data science projects and using the open-source workflow management solution Apache Airflow. She likes to spend her free time exploring her surroundings and being on the move.

Got a question about this blog?
Ask Luise

Blog - NextLytics AG 

Welcome to our blog. In this section we regularly report on news and background information on topics such as SAP Business Intelligence (BI), SAP Dashboarding with Lumira Designer or SAP Analytics Cloud, Machine Learning with SAP BW, Data Science and Planning with SAP Business Planning and Consolidation (BPC), SAP Integrated Planning (IP) and SAC Planning and much more.

Subscribe to our newsletter

Related Posts

Recent Posts