Before a machine learning model can provide continuous business value, it must first overcome the obstacle of practical implementation. In addition to prediction, other parts of the machine learning workflow, from data preparation to deployment of the trained model, should also be automated to keep the model always up-to-date.Many options are available from a technical perspective for the workflow management, i.e., the administration, scheduling and execution of tasks in the workflow. In addition to the widespread cronjobs, the workflow management platform Apache Airflow is also very popular. In this article, we explain how Airflow optimally meets the challenges of machine learning workflows and present an architecture variant for small machine learning (ML) teams.
In Airflow, workflows are defined, scheduled and executed as Python scripts. Dependencies between tasks and thus complex workflows can be mapped quickly and efficiently. The feature-rich web interface provides a good overview of the status of workflow runs and speeds up troubleshooting immensely. The framework is also open-source and free to use.
Apache Airflow web interface. The status of the workflow runs is visible on the left ( Runs). The status of the tasks of the last workflow is visible on the right (Recent Tasks).
Airflow performs the following tasks for machine learning workflows:
If you would like to learn more about the underlying concepts and components of Apache Airflow, we recommend reading our whitepaper "Effective Workflow Management with Apache Airflow 2.0". There, the basic ideas are explained in detail and you will get practical application ideas regarding the new features in the major release.
Machine learning workflows are usually more complex than ETL workflows due to the dependencies between the individual steps and the large number of data sources involved. In addition, the different models often have different hardware requirements (CPU vs. GPU). Therefore, simply starting the workflows with cron jobs often reaches its limits and is prone to errors due to the lack of dependencies between the individual tasks. Because of its ease of use and implementation, Apache Airflow is becoming widely accepted for workflow management of machine learning applications.
A machine learning workflow includes various task packages that are divided into data preparation, model training and evaluation, and model deployment. Airflow is responsible for the execution of the individual workflow runs. Depending on your needs, you can create a workflow with integrated steps (see figure) or you can define separate workflows for the subtasks, if individual components (e.g. data preparation) should follow their own time interval.
Simple machine learning workflow with integrated steps
The Airflow installation is flexibly configurable and allows scaling according to requirements. This is primarily implemented by selecting the appropriate execututer. An entry-level architecture is explained below. This meets the requirements of small teams that want to run workflows on a single server. In addition to Airflow, Git for version control and Anaconda for isolation play a role. For a fast, secure connection to SAP HANA, for example, the NextLytics Software Development Kit (NLY-SDK) can be used, which we would be happy to present to you in more detail upon request.
ML architecture with Airflow (scheduling), Git (version control), Anaconda (isolation) and an Airflow DAG for deployment and connection to databases and file systems
In the single node architecture the tasks are executed per worker, the web server and the scheduler on the server.
Of course, the architecture presented is not suitable for all application scenarios. If the number of workflows increases rapidly, the architecture can no longer handle the requirements. Fortunately, Airflow can be scaled efficiently with Kubernetes, Mesos or Dask. But operating on distributed systems should be carefully planned.
We are happy to support you as your competent project partner for your end-to-end machine learning projects and/or to evaluate your status quo. We help you to realize robust and scalable machine learning workflows on Apache Airflow. Contact us at any time!