Skip to content
NextLytics
Megamenü_2023_Über-uns

Shaping Business Intelligence

Whether clever add-on products for SAP BI, development of meaningful dashboards or implementation of AI-based applications - we shape the future of Business Intelligence together with you. 

Megamenü_2023_Über-uns_1

About us

As a partner with deep process know-how, knowledge of the latest SAP technologies as well as high social competence and many years of project experience, we shape the future of Business Intelligence in your company too.

Megamenü_2023_Methodik

Our Methodology

The mixture of classic waterfall model and agile methodology guarantees our projects a high level of efficiency and satisfaction on both sides. Learn more about our project approach.

Products
Megamenü_2023_NextTables

NextTables

Edit data in SAP BW out of the box: NextTables makes editing tables easier, faster and more intuitive, whether you use SAP BW on HANA, SAP S/4HANA or SAP BW 4/HANA.

Megamenü_2023_Connector

NextLytics Connectors

The increasing automation of processes requires the connectivity of IT systems. NextLytics Connectors allow you to connect your SAP ecosystem with various open-source technologies.

IT-Services
Megamenü_2023_Data-Science

Data Science & Engineering

Ready for the future? As a strong partner, we will support you in the design, implementation and optimization of your AI application.

Megamenü_2023_Planning

SAP Planning

We design new planning applications using SAP BPC Embedded, IP or SAC Planning which create added value for your company.

Megamenü_2023_Dashboarding

Dashboarding

We help you with our expertise to create meaningful dashboards based on Tableau, Power BI, SAP Analytics Cloud or SAP Lumira. 

Megamenü_2023_Data-Warehouse-1

SAP Data Warehouse

Are you planning a migration to SAP HANA? We show you the challenges and which advantages a migration provides.

Business Analytics
Megamenü_2023_Procurement

Procurement Analytics

Transparent and valid figures are important, especially in companies with a decentralized structure. SAP Procurement Analytics allows you to evaluate SAP ERP data in SAP BI.

Megamenü_2023_Reporting

SAP HR Reporting & Analytics

With our standard model for reporting from SAP HCM with SAP BW, you accelerate business activities and make data from various systems available centrally and validly.

Megamenü_2023_Dataquality

Data Quality Management

In times of Big Data and IoT, maintaining high data quality is of the utmost importance. With our Data Quality Management (DQM) solution, you always keep the overview.

Career
Megamenü_2023_Karriere-2b

Working at NextLytics

If you would like to work with pleasure and don't want to miss out on your professional and personal development, we are the right choice for you!

Megamenü_2023_Karriere-1

Senior

Time for a change? Take your next professional step and work with us to shape innovation and growth in an exciting business environment!

Megamenü_2023_Karriere-5

Junior

Enough of grey theory - time to get to know the colourful reality! Start your working life with us and enjoy your work with interesting projects.

Megamenü_2023_Karriere-4-1

Students

You don't just want to study theory, but also want to experience it in practice? Check out theory and practice with us and experience where the differences are made.

Megamenü_2023_Karriere-3

Jobs

You can find all open vacancies here. Look around and submit your application - we look forward to it! If there is no matching position, please send us your unsolicited application.

Blog
NextLytics Newsletter Teaser
Sign up now for our monthly newsletter!
Sign up for newsletter
 

Data engineering trends at the PyCon and PyData Conference 2024

PyCon DE and PyData Berlin are an established beacon in the Python community's annual calendar. Developers, experts and enthusiasts from a wide range of backgrounds came together this year for the joint conference in Berlin for three days. A total of 1500 people attended the event and took part in lectures and workshops from no less than seven mostly parallel tracks and exchanged knowledge.

We were there for you and would like to share our impressions of the conference: Today we are looking at innovations and trends in the field of data engineering and its areas of application. We have already summarized our impressions and highlights on the topic of machine learning in an earlier article.

impressions_conference_Data_engineering_trends

Pragmatic Data Engineering

What is data engineering anyway and why is it necessary? In his presentation on "pragmatic" data engineering with Python, Robson Junior provides an entertaining and beginner-friendly overview: Getting data from source systems to the right recipient in a timely manner and in the right format requires logically or technologically sophisticated processes. Robson presents a series of components that he believes are particularly valuable for a quick and scalable introduction to data engineering. Apache Arrow is highlighted as an (in-memory) data format that allows both transformation steps and analyses to be carried out efficiently on tabular data. He also provides brief insights into distributed computing with Apache Spark and the orchestration of all processes and process chains with Apache Airflow. Special attention should be paid to testing and monitoring data integrity and quality. The Python modules Great Expectations and Pydantic are mentioned here as pragmatic tools to create a good starting position.

robson_jr_pragmatic_data_engineering_Data_engineering_trends
Scene from Robson Junior's presentation, illustrating a typical data pipeline with Python:
Apache Arrow is used as a data format for transporting and analyzing data sets, Apache Spark as an engine for
distributed computation and transformations, Apache Airflow as an orchestrator. (Source: PyConDE/PyData Berlin 2024)

The presentation is a good introduction to the topic of data engineering because it contains some valuable statements that we can directly endorse based on our experience from many projects: Good data engineering is the foundation of functioning and efficient data platforms, be it for business intelligence, analytics or machine learning and AI applications. Robust, reliable processes require automation, structured testing and monitoring. From our point of view, it is particularly important that the tool fits the use case. In many contexts, "classic" approaches to data processing and medium-sized systems are completely sufficient to produce high-performance results. This is pragmatic and saves costs for infrastructure and operation.

The Q&A session accompanying the presentation showed that Apache Airflow is still an important building block for many projects and workflows. At the same time, operation, customizing and testing procedures for Airflow can be challenging. In addition to our blog posts on these topics, we also actively support companies and teams with the introduction, operation and further development of Airflow.


Optimize your workflow management with Apache Airflow - Download the Whitepaper here!

NextLytics Whitepaper Apache Airflow


Testing, testing, testing

Good data engineering primarily takes lessons from quality management and software engineering and applies them to data processing and analysis processes. A very important building block: automated, regular testing of components and their interaction. With regard to data pipelines, there is often no simple and perfect solution, as Tobias Lampert also explains in his presentation on unit testing of SQL statements. How can data warehouse processes be routinely tested automatically during the development process and as part of CI/CD pipelines without copying huge amounts of data from the production environment to a test environment? His team has written a Python framework that injects test data into (almost) any SQL queries and thus enables tests without actual test artifacts in a database. An exciting approach that can also be implemented with open source frameworks such as SQLGlot and SQL Mock.

pydata_lampert_sql-unit-testing_Data_engineering_trends
Tobias Lampert compares the advantages and disadvantages of manual tests and automated unit tests of SQL statements.
(Source: PyConDE/PyData Berlin 2024)

An inconspicuous and at the same time very important factor for an efficient development process is the local technical setup of all persons working on a system. For code-based processes (which we strongly recommend to all our customers!), this includes the development environment (IDE) used and also the software runtime environment set up on the computer for the project. Which Python version is used? Which version of Apache Airflow? Synchronizing your own local setup with the various system environments on the target infrastructure for development, testing and productive operation - possibly across several parallel projects - quickly becomes a challenge and a source of errors. The "Development Container" specification can provide a remedy, as Thomas Fraunholz presented at the conference. With a single configuration file in the project directory, the entire execution environment can be encapsulated in a Docker container - and thus reproduced on any computer in exactly the same way. This approach, originally developed for Visual Studio Code, is gradually being supported by other IDEs (and is not limited to Python as a programming language). A logical further development of virtual Python environments that will further simplify the everyday life of developers in the future and free up time for the essential tasks.

Big data, small footprint

The hype surrounding the buzzword "big data" dates back several years. However, the underlying challenge remains: meaningful analyses require the processing of data sets that far exceed the capacities of individual computers and servers. When this point is reached, scalable, distributed computer networks and the corresponding control software are required. Apache Airflow has been able to be used as a framework for some time thanks to dynamic task mappings. Apache Spark is the prominent representative on the big stage. Other frameworks have also established themselves, such as Polars as an extension of the popular Pandas library and the open source Python project Dask. Patrick Hoefler presented brand new performance optimizations to Dask in Berlin, the so-called "Dask DataFrame 2.0".

The main attraction for this presentation was the announced performance comparison with other analysis frameworks, namely Spark, Polars and DuckDB. The TPC-H benchmarks shown were calculated with data sets of 100 gigabytes as well as 1 and 10 terabytes in size and delivered a wild mix of results. Dask itself performs well and can show a significant improvement over previous versions. In addition, Dask promises easier handling and less friction losses compared to the top dog Spark, as the framework itself is written in Python. From our point of view, Dask can be a sensible alternative, especially for data set sizes in the medium range, and a stepping stone to working with parallel and distributed compute resources.

pydata_hoefler_dask-comparison_Data_engineering_trends
Patrick Hoefler introduces the presentation of his benchmark results with a presentation of the tested frameworks:
Apache Spark, Dask, DuckDB and Polars are compared in terms of their performance. (Source: PyConDE/PyData Berlin 2024)

Designing data engineering in such a way that a lightweight overall system emerges despite large volumes of data and a large number of sources and consumers is a steep challenge. The forge of the still young "Data Load Tool" (dlt) has dedicated itself to this approach. The Berlin-based startup dlthub invited attendees of the PyData conference to exchange ideas at their home base. Even though the announced rooftop party had to take place in the basement due to (inaccurate) weather forecasts, we gladly accepted the invitation and enjoyed a constructive exchange of ideas. Dlt promises to harmonize the loading of data from various sources with a lean and extensible Python library. No infrastructure needs to be set up, no platform needs to be operated. Dlt seems to us to be a perfect candidate for small to medium-sized data warehouses that are operated with Apache Airflow according to the ELT paradigm. At the conference itself, Anuun Chinbat and Hiba Jamal presented the project with a lot of humor.

See you next year!

The PyCon and PyData Conference 2024 had a colorful, highly interesting program with many highlights. As with our review focusing on machine learning, we were only able to experience a small section on the topic of data engineering live and recap it here. Overall, it is clear that organizational and technological challenges still need to be overcome in the engine room of an economy that is increasingly driven by data. At first glance, new tools, conceptual approaches or platforms promise solutions and better management of data pipelines and transformations. However, making smart, targeted and sustainable decisions when putting together components for your own use case requires a broad knowledge base and some intuition.

We are happy to support you in the design and operation of data and analysis platforms - whether cloud or on-prem, provider-independent, especially in the Python ecosystem. If you like, contact us for an exchange or find us next year at PyCon/PyData in Berlin!

Learn more about Apache Airflow

,

avatar

Markus

Markus has been a Senior Consultant for Machine Learning and Data Engineering at NextLytics AG since 2022. With significant experience as a system architect and team leader in data engineering, he is an expert in micro services, databases and workflow orchestration - especially in the field of open source solutions. In his spare time he tries to optimize the complex system of growing vegetables in his own garden.

Got a question about this blog?
Ask Markus

Data engineering trends at the PyCon and PyData Conference 2024
9:56

Blog - NextLytics AG 

Welcome to our blog. In this section we regularly report on news and background information on topics such as SAP Business Intelligence (BI), SAP Dashboarding with Lumira Designer or SAP Analytics Cloud, Machine Learning with SAP BW, Data Science and Planning with SAP Business Planning and Consolidation (BPC), SAP Integrated Planning (IP) and SAC Planning and much more.

Subscribe to our newsletter

Related Posts

Recent Posts