Skip to content
NextLytics
Megamenü_2023_Über-uns

Shaping Business Intelligence

Whether clever add-on products for SAP BI, development of meaningful dashboards or implementation of AI-based applications - we shape the future of Business Intelligence together with you. 

Megamenü_2023_Über-uns_1

About us

As a partner with deep process know-how, knowledge of the latest SAP technologies as well as high social competence and many years of project experience, we shape the future of Business Intelligence in your company too.

Megamenü_2023_Methodik

Our Methodology

The mixture of classic waterfall model and agile methodology guarantees our projects a high level of efficiency and satisfaction on both sides. Learn more about our project approach.

Products
Megamenü_2023_NextTables

NextTables

Edit data in SAP BW out of the box: NextTables makes editing tables easier, faster and more intuitive, whether you use SAP BW on HANA, SAP S/4HANA or SAP BW 4/HANA.

Megamenü_2023_Connector

NextLytics Connectors

The increasing automation of processes requires the connectivity of IT systems. NextLytics Connectors allow you to connect your SAP ecosystem with various open-source technologies.

IT-Services
Megamenü_2023_Data-Science

Data Science & Engineering

Ready for the future? As a strong partner, we will support you in the design, implementation and optimization of your AI application.

Megamenü_2023_Planning

SAP Planning

We design new planning applications using SAP BPC Embedded, IP or SAC Planning which create added value for your company.

Megamenü_2023_Dashboarding

Dashboarding

We help you with our expertise to create meaningful dashboards based on Tableau, Power BI, SAP Analytics Cloud or SAP Lumira. 

Megamenü_2023_Data-Warehouse-1

SAP Data Warehouse

Are you planning a migration to SAP HANA? We show you the challenges and which advantages a migration provides.

Business Analytics
Megamenü_2023_Procurement

Procurement Analytics

Transparent and valid figures are important, especially in companies with a decentralized structure. SAP Procurement Analytics allows you to evaluate SAP ERP data in SAP BI.

Megamenü_2023_Reporting

SAP HR Reporting & Analytics

With our standard model for reporting from SAP HCM with SAP BW, you accelerate business activities and make data from various systems available centrally and validly.

Megamenü_2023_Dataquality

Data Quality Management

In times of Big Data and IoT, maintaining high data quality is of the utmost importance. With our Data Quality Management (DQM) solution, you always keep the overview.

Career
Megamenü_2023_Karriere-2b

Working at NextLytics

If you would like to work with pleasure and don't want to miss out on your professional and personal development, we are the right choice for you!

Megamenü_2023_Karriere-1

Senior

Time for a change? Take your next professional step and work with us to shape innovation and growth in an exciting business environment!

Megamenü_2023_Karriere-5

Junior

Enough of grey theory - time to get to know the colourful reality! Start your working life with us and enjoy your work with interesting projects.

Megamenü_2023_Karriere-4-1

Students

You don't just want to study theory, but also want to experience it in practice? Check out theory and practice with us and experience where the differences are made.

Megamenü_2023_Karriere-3

Jobs

You can find all open vacancies here. Look around and submit your application - we look forward to it! If there is no matching position, please send us your unsolicited application.

Blog
NextLytics Newsletter Teaser
Sign up now for our monthly newsletter!
Sign up for newsletter
 

Track Apache Airflow Metrics and Statistics with Superset

Apache Airflow  is the leading open source workflow orchestration platform and drives day-to-day digital business all over the world. If your organization uses Airflow, the volume of daily log and metadata quickly becomes an issue. Purging the Airflow metadata database and logs is a common routine to minimize storage costs. But you still want to show long-term statistics of your operations: How many workflows (“DAGs” in Airflow terms, for directed acyclic graphs) is your system running per month, quarter, or year? How often do certain DAGs fail in production? How long before operations can be recovered?

To combine these two opposing interests, we propose the combination of archiving specific Airflow DAG run metadata to a dedicated archive database and using the open source business intelligence toolkit Apache Superset to create long-term monitoring dashboards.

This article consists of two parts: First, we will introduce our approach to archive useful data from these Airflow database tables and subsequently, we will create Superset charts and dashboards based on these tables.

Open Source Dashboards with Apache Superset

Apache Superset is a staple when it comes to data visualization, with a plethora of charts available out-of-the-box and the ability to create sophisticated dashboards, tailored to everyone's needs. An interesting use case of Superset is the monitoring and visualization of logs and metadata generated from Apache Airflow ETL pipeline runs. DAG run fail rates, execution times and aggregations by operator type are only a handful of statistics that can be calculated and visualized through Superset, which data engineering teams may find useful for drawing conclusions about the state of their infrastructure. 

Archiving Airflow DAG Run Metadata Airflow

Every Airflow system uses an operational database that retains information about the inner workings of the system. Two of the most important tables are “dag_run” and “task_instance” which contain details regarding the execution of DAGs and the individual tasks thereof. These details include start and finish timestamps, states of the DAGs, information about the different operators used for each task, the corresponding configuration, and more. The information found in these tables is more than enough in order to construct informative plots and diagrams with Apache Superset. Since the volume of data collected grows quickly when Airflow is running many DAGs on frequent schedules, purging older information to minimize storage consumption is a common maintenance routine.

To save long-term statistics, we implement a simple archiving DAG that collects the records from the “dag_run” and “task_instance” tables which are available in every Airflow database, and stores them into a dedicated archive database. With each DAG run, only the most recent entries are selected for insertion, avoiding any duplicates.

airflow-superset-dags_Airflow Metrics

Superset Charts & Dashboards

Apache Superset enables us to create plots and charts, and organize them into dashboards. Each chart depends on a dataset or an SQL query that is in turn, connected to a data source. Superset offers an abundance of connectors out-of-the-box for every major DBMS, although it is possible to connect to any data source via SQLAlchemy URIs. The first step is to connect the database that contains the “dag_run” and “task_instance” tables (extracted with the Airflow DAG) with Superset.

Screenshot 2023-06-26 at 11-57-51 Superset_Airflow Metrics

We provide the credentials and the connection is established. Superset uses the notion of “datasets” as sources for its charts. There are multiple options to create a dataset. We can specify a database table as a dataset from the source we just connected to through the UI. Another approach is to write an SQL query accessing the source and save the results as a dataset. While the former is easier to use, the latter gives more flexibility and allows for more complex data aggregations, like joining multiple tables. Basically, every valid "SELECT" query can be used as a dataset. To this end, we define a query through SQL Lab that calculates the execution time of each DAG.

Screenshot 2023-06-26 at 12-31-38 Superset_Airflow Metrics

Other necessary SQL queries may be defined in a similar fashion. Now we can start creating charts based on the datasets and queries just defined. Some of the most interesting are (but not limited to):

  • DAG run states.
  • Number of task operators.
  • DAG execution time (distribution & averages).
  • Task execution time (distribution & averages).

Effective workflow management with Apache Airflow 2.0

NextLytics Whitepaper Apache Airflow


DAG run states

This chart is pretty straightforward to create, due to the “dag_run” table including a “state” column, so no additional calculations are needed.

Screenshot 2023-06-26 at 12-55-52 DAG Run State_Airflow Metrics

This is a simple barplot, counting the occurrences of each possible state a DAG can be in. It is essentially equivalent to executing “SELECT state, COUNT(*) from dag_run GROUP BY state;”. This query counts the occurrences of every possible state (“success”, “failed” and “running”). Most of the DAGs execute successfully and a handful of DAGs are currently still running.

Number of task operators

A similar chart is the one that counts the total number of each task operator.

Screenshot 2023-06-26 at 12-58-35 Task Operators_Airflow Metrics

Again, this is equivalent to executing “SELECT operator, COUNT(*) FROM task_instance GROUP BY operator;”. Same as above, this query counts the occurrences of task operators. A large portion of the operators consist of Python decorator operators due to heavy use of the TaskFlow API.

DAG execution time

A useful metric to visualize is the distribution of DAGs execution time. To this end, we will use the results of the SQL query we defined earlier (image x).

Screenshot 2023-06-26 at 13-09-00 DAG Execution Time_Airflow Metrics

It is pretty evident that the majority of DAG executions are completed within 0 to 5 seconds. We can aggregate the execution time based on DAG id and get a glance of the average execution time per DAG.

Screenshot 2023-06-26 at 15-15-50 Average DAG Execution Time_ETL pipeline

The most time consuming DAG seems to be the “Test_SFTPOperator” with almost twice the average execution time compared to the second one.

Task execution time

The charts visualizing the task execution time are similar to the previous ones, but add a more granular insight as they indicate potential bottlenecks in the execution of specific tasks that may require attention.

Screenshot 2023-06-27 at 09-50-16 Task Execution Time_Airflow Metrics

Starting with the distribution of the task execution time, it is clear that almost every task is completed within 5 seconds. Aggregating by task id, we can see how individual tasks behave.

Screenshot 2023-06-27 at 09-53-25 Average Task Execution Time_ETL pipeline

The “create_testfile” task is by far the most time consuming, with an average execution time over 7 seconds. Superset offers an alert functionality, where an email can be sent if a monitored value exceeds a certain threshold. For example, if a task is running for an unusual amount of time, an email notification can be sent to the team that is responsible for Airflow.

Creating a dashboard

Individual charts are fine, but it’s hard to draw conclusions by looking only one at a time. We can compose them into a dashboard in order to have a broader view of the ETL pipelines infrastructure.

Screenshot 2023-06-27 at 09-55-21 Airflow Statistics_Airflow Metrics

The carts can be zoomed, rearranged and manipulated at the analyst’s discretion. The Airflow DAG that is responsible for loading the tables into the database, can be scheduled to run frequently in order to keep the dashboard up-to-date.

We described some potential use cases for Apache Superset and how it can be leveraged to create business insights and most specifically, how to monitor an Apache Airflow instance. If you would like to set up your own instance of Superset or want some expert feedback for your use case, get in touch with us today.

Learn more about Apache Airflow

, ,

avatar

Apostolos

Apostolos has been a Data Engineering Consultant for NextLytics AG since 2022. He holds experience in research projects regarding deep learning methodologies and their applications in Fintech, as well as background in backend development. In his spare time he enjoys playing the guitar and stay up to date with the latest news on technology and economics.

Got a question about this blog?
Ask Apostolos

Blog - NextLytics AG 

Welcome to our blog. In this section we regularly report on news and background information on topics such as SAP Business Intelligence (BI), SAP Dashboarding with Lumira Designer or SAP Analytics Cloud, Machine Learning with SAP BW, Data Science and Planning with SAP Business Planning and Consolidation (BPC), SAP Integrated Planning (IP) and SAC Planning and much more.

Subscribe to our newsletter

Related Posts

Recent Posts