Skip to content
NextLytics
Megamenü_2023_Über-uns

Shaping Business Intelligence

Whether clever add-on products for SAP BI, development of meaningful dashboards or implementation of AI-based applications - we shape the future of Business Intelligence together with you. 

Megamenü_2023_Über-uns_1

About us

As a partner with deep process know-how, knowledge of the latest SAP technologies as well as high social competence and many years of project experience, we shape the future of Business Intelligence in your company too.

Megamenü_2023_Methodik

Our Methodology

The mixture of classic waterfall model and agile methodology guarantees our projects a high level of efficiency and satisfaction on both sides. Learn more about our project approach.

Products
Megamenü_2023_NextTables

NextTables

Edit data in SAP BW out of the box: NextTables makes editing tables easier, faster and more intuitive, whether you use SAP BW on HANA, SAP S/4HANA or SAP BW 4/HANA.

Megamenü_2023_Connector

NextLytics Connectors

The increasing automation of processes requires the connectivity of IT systems. NextLytics Connectors allow you to connect your SAP ecosystem with various open-source technologies.

IT-Services
Megamenü_2023_Data-Science

Data Science & Engineering

Ready for the future? As a strong partner, we will support you in the design, implementation and optimization of your AI application.

Megamenü_2023_Planning

SAP Planning

We design new planning applications using SAP BPC Embedded, IP or SAC Planning which create added value for your company.

Megamenü_2023_Dashboarding

Dashboarding

We help you with our expertise to create meaningful dashboards based on Tableau, Power BI, SAP Analytics Cloud or SAP Lumira. 

Megamenü_2023_Data-Warehouse-1

SAP Data Warehouse

Are you planning a migration to SAP HANA? We show you the challenges and which advantages a migration provides.

Business Analytics
Megamenü_2023_Procurement

Procurement Analytics

Transparent and valid figures are important, especially in companies with a decentralized structure. SAP Procurement Analytics allows you to evaluate SAP ERP data in SAP BI.

Megamenü_2023_Reporting

SAP HR Reporting & Analytics

With our standard model for reporting from SAP HCM with SAP BW, you accelerate business activities and make data from various systems available centrally and validly.

Megamenü_2023_Dataquality

Data Quality Management

In times of Big Data and IoT, maintaining high data quality is of the utmost importance. With our Data Quality Management (DQM) solution, you always keep the overview.

Career
Megamenü_2023_Karriere-2b

Working at NextLytics

If you would like to work with pleasure and don't want to miss out on your professional and personal development, we are the right choice for you!

Megamenü_2023_Karriere-1

Senior

Time for a change? Take your next professional step and work with us to shape innovation and growth in an exciting business environment!

Megamenü_2023_Karriere-5

Junior

Enough of grey theory - time to get to know the colourful reality! Start your working life with us and enjoy your work with interesting projects.

Megamenü_2023_Karriere-4-1

Students

You don't just want to study theory, but also want to experience it in practice? Check out theory and practice with us and experience where the differences are made.

Megamenü_2023_Karriere-3

Jobs

You can find all open vacancies here. Look around and submit your application - we look forward to it! If there is no matching position, please send us your unsolicited application.

Blog
NextLytics Newsletter Teaser
Sign up now for our monthly newsletter!
Sign up for newsletter
 

Apache Airflow: Machine Learning Workflows in production

Before a machine learning model can provide continuous business value, it must first overcome the obstacle of practical implementation. In addition to prediction, other parts of the machine learning workflow, from data preparation to deployment of the trained model, should also be automated to keep the model always up-to-date.Many options are available from a technical perspective for the workflow management, i.e., the administration, scheduling and execution of tasks in the workflow.  In addition to the widespread cronjobs, the workflow management platform Apache Airflow is also very popular. In this article, we explain how Airflow optimally meets the challenges of machine learning workflows and present an architecture variant for small machine learning (ML) teams. 

Apache Airflow

In Airflow, workflows are defined, scheduled and executed as Python scripts. Dependencies between tasks and thus complex workflows can be mapped quickly and efficiently. The feature-rich web interface provides a good overview of the status of workflow runs and speeds up troubleshooting immensely. The framework is also open-source and free to use. 

Apache Airflow Machine Learning_Example_Front page

Apache Airflow web interface. The status of the workflow runs is visible on the left ( Runs). The status of the tasks of the last workflow is visible on the right (Recent Tasks).

 

Airflow performs the following tasks for machine learning workflows:

  • Define, execute, and monitor workflows
  • Orchestrate third-party systems to execute tasks 
  • Provide a web interface for excellent visibility and comprehensive management capabilities

If you would like to learn more about the underlying concepts and components of Apache Airflow, we recommend reading our whitepaper "Effective Workflow Management with Apache Airflow 2.0". There, the basic ideas are explained in detail and you will get practical application ideas regarding the new features in the major release.


Optimize your workflow management
with Apache Airflow  

NextLytics Whitepaper Apache Airflow


Why is Airflow particularly suitable for Machine Learning?

Machine learning workflows are usually more complex than ETL workflows due to the dependencies between the individual steps and the large number of data sources involved. In addition, the different models often have different hardware requirements (CPU vs. GPU). Therefore, simply starting the workflows with cron jobs often reaches its limits and is prone to errors due to the lack of dependencies between the individual tasks. Because of its ease of use and implementation, Apache Airflow is becoming widely accepted for workflow management of machine learning applications.

Machine Learning Workflows in Airflow

A machine learning workflow includes various task packages that are divided into data preparation, model training and evaluation, and model deployment. Airflow is responsible for the execution of the individual workflow runs. Depending on your needs, you can create a workflow with integrated steps (see figure) or you can define separate workflows for the subtasks, if individual components (e.g. data preparation) should follow their own time interval.

Machine_Learning_Workflow_integriert

Simple machine learning workflow with integrated steps

 

  • Data preparation
    In the course of preparation for training, various ETL processes take place. Within the workflow, important information about the data scope and the statistical properties of the features can be generated. These help to uncover potential sources of error.

  • Model training and evaluation
    In order for the model to be able to adapt to a new data situation, regular training on current data is recommended. This can be started chronologically or via REST API if the productive model reaches a certain error barrier.

  • Deployment of the model
    Depending on whether the model is used as a service within an application or is used as a generator of forecast tables, the deployment differs. If the model is to provide forecasts on a regular basis, even the forecast step can be scheduled from Airflow.

Machine Learning-architecture for the entry

The Airflow installation is flexibly configurable and allows scaling according to requirements. This is primarily implemented by selecting the appropriate execututer. An entry-level architecture is explained below. This meets the requirements of small teams that want to run workflows on a single server. In addition to Airflow, Git for version control and Anaconda for isolation play a role. For a fast, secure connection to SAP HANA, for example, the NextLytics Software Development Kit (NLY-SDK) can be used, which we would be happy to present to you in more detail upon request.

Apache Airflow Machine Learning_Architecture

ML architecture with Airflow (scheduling), Git (version control), Anaconda (isolation) and an Airflow DAG for deployment and connection to databases and file systems

  • Single-node architecture Airflow
    The individual tasks of the workflow are executed locally on the system on which Airflow is installed. Here a parallel execution of the tasks is given by the LocalExecuter. There are several worker instances of the web server. Since Airflow 2.0 it is also possible to use multiple scheduler instances.

Single_Node_Airflow

In the single node architecture the tasks are executed per worker, the web server and the scheduler on the server.

 

  • Anaconda Environments
    If the machine learning workflows are executed locally on a server, the architecture must provide a way to implement the different software-side requirements of the models. This is a particularly important point in machine learning applications, since specific versions of the individual program libraries are often required. Virtual environments are suitable for this purpose. Conda environments are often used here. The scripts are then executed via the interpreter of the desired environment.

  • Data exchange
    Airflow is conceptually not designed for "real" data pipelines. The generated data must be stored between the steps. This can be done on the server, in a database or in the cloud, for example. Airflow offers with XCOM a communication possibility between the different tasks. This can be easily defined via the TaskflowAPI.

  • Deployment of the workflow files
    The workflows are designed in the form of Python files. When a workflow is ready for production, it is stored in a Git repository. From there, it is transferred to the server via pull request. In this case, the periodical pull request is controlled via a separate Airflow DAG. An improvement of the continuous deployment with e.g. upstream automated tests can be realized with Jenkins, Drone or GitHub Actions.

Of course, the architecture presented is not suitable for all application scenarios. If the number of workflows increases rapidly, the architecture can no longer handle the requirements. Fortunately, Airflow can be scaled efficiently with Kubernetes, Mesos or Dask. But operating on distributed systems should be carefully planned. 

We are happy to support you as your competent project partner for your end-to-end machine learning projects and/or to evaluate your status quo. We help you to realize robust and scalable machine learning workflows on Apache Airflow. Contact us at any time! 

Learn more about Apache Airflow

,

avatar

Markus

Markus has been a Senior Consultant for Machine Learning and Data Engineering at NextLytics AG since 2022. With significant experience as a system architect and team leader in data engineering, he is an expert in micro services, databases and workflow orchestration - especially in the field of open source solutions. In his spare time he tries to optimize the complex system of growing vegetables in his own garden.

Got a question about this blog?
Ask Markus

Blog - NextLytics AG 

Welcome to our blog. In this section we regularly report on news and background information on topics such as SAP Business Intelligence (BI), SAP Dashboarding with Lumira Designer or SAP Analytics Cloud, Machine Learning with SAP BW, Data Science and Planning with SAP Business Planning and Consolidation (BPC), SAP Integrated Planning (IP) and SAC Planning and much more.

Subscribe to our newsletter

Related Posts

Recent Posts