PyCon DE and PyData Berlin are an established beacon in the Python community's annual calendar. Developers, experts and enthusiasts from a wide range of backgrounds came together this year for the joint conference in Berlin for three days. A total of 1500 people attended the event and took part in lectures and workshops from no less than seven mostly parallel tracks and exchanged knowledge.
We were there for you and would like to share our impressions of the conference: Today we are looking at innovations and trends in the field of data engineering and its areas of application. We have already summarized our impressions and highlights on the topic of machine learning in an earlier article.
Pragmatic Data Engineering
What is data engineering anyway and why is it necessary? In his presentation on "pragmatic" data engineering with Python, Robson Junior provides an entertaining and beginner-friendly overview: Getting data from source systems to the right recipient in a timely manner and in the right format requires logically or technologically sophisticated processes. Robson presents a series of components that he believes are particularly valuable for a quick and scalable introduction to data engineering. Apache Arrow is highlighted as an (in-memory) data format that allows both transformation steps and analyses to be carried out efficiently on tabular data. He also provides brief insights into distributed computing with Apache Spark and the orchestration of all processes and process chains with Apache Airflow. Special attention should be paid to testing and monitoring data integrity and quality. The Python modules Great Expectations and Pydantic are mentioned here as pragmatic tools to create a good starting position.
Scene from Robson Junior's presentation, illustrating a typical data pipeline with Python:
Apache Arrow is used as a data format for transporting and analyzing data sets, Apache Spark as an engine for
distributed computation and transformations, Apache Airflow as an orchestrator. (Source: PyConDE/PyData Berlin 2024)
The presentation is a good introduction to the topic of data engineering because it contains some valuable statements that we can directly endorse based on our experience from many projects: Good data engineering is the foundation of functioning and efficient data platforms, be it for business intelligence, analytics or machine learning and AI applications. Robust, reliable processes require automation, structured testing and monitoring. From our point of view, it is particularly important that the tool fits the use case. In many contexts, "classic" approaches to data processing and medium-sized systems are completely sufficient to produce high-performance results. This is pragmatic and saves costs for infrastructure and operation.
The Q&A session accompanying the presentation showed that Apache Airflow is still an important building block for many projects and workflows. At the same time, operation, customizing and testing procedures for Airflow can be challenging. In addition to our blog posts on these topics, we also actively support companies and teams with the introduction, operation and further development of Airflow.
Optimize your workflow management with Apache Airflow - Download the Whitepaper here!
Testing, testing, testing
Good data engineering primarily takes lessons from quality management and software engineering and applies them to data processing and analysis processes. A very important building block: automated, regular testing of components and their interaction. With regard to data pipelines, there is often no simple and perfect solution, as Tobias Lampert also explains in his presentation on unit testing of SQL statements. How can data warehouse processes be routinely tested automatically during the development process and as part of CI/CD pipelines without copying huge amounts of data from the production environment to a test environment? His team has written a Python framework that injects test data into (almost) any SQL queries and thus enables tests without actual test artifacts in a database. An exciting approach that can also be implemented with open source frameworks such as SQLGlot and SQL Mock.
Tobias Lampert compares the advantages and disadvantages of manual tests and automated unit tests of SQL statements.
(Source: PyConDE/PyData Berlin 2024)
An inconspicuous and at the same time very important factor for an efficient development process is the local technical setup of all persons working on a system. For code-based processes (which we strongly recommend to all our customers!), this includes the development environment (IDE) used and also the software runtime environment set up on the computer for the project. Which Python version is used? Which version of Apache Airflow? Synchronizing your own local setup with the various system environments on the target infrastructure for development, testing and productive operation - possibly across several parallel projects - quickly becomes a challenge and a source of errors. The "Development Container" specification can provide a remedy, as Thomas Fraunholz presented at the conference. With a single configuration file in the project directory, the entire execution environment can be encapsulated in a Docker container - and thus reproduced on any computer in exactly the same way. This approach, originally developed for Visual Studio Code, is gradually being supported by other IDEs (and is not limited to Python as a programming language). A logical further development of virtual Python environments that will further simplify the everyday life of developers in the future and free up time for the essential tasks.
Big data, small footprint
The hype surrounding the buzzword "big data" dates back several years. However, the underlying challenge remains: meaningful analyses require the processing of data sets that far exceed the capacities of individual computers and servers. When this point is reached, scalable, distributed computer networks and the corresponding control software are required. Apache Airflow has been able to be used as a framework for some time thanks to dynamic task mappings. Apache Spark is the prominent representative on the big stage. Other frameworks have also established themselves, such as Polars as an extension of the popular Pandas library and the open source Python project Dask. Patrick Hoefler presented brand new performance optimizations to Dask in Berlin, the so-called "Dask DataFrame 2.0".
The main attraction for this presentation was the announced performance comparison with other analysis frameworks, namely Spark, Polars and DuckDB. The TPC-H benchmarks shown were calculated with data sets of 100 gigabytes as well as 1 and 10 terabytes in size and delivered a wild mix of results. Dask itself performs well and can show a significant improvement over previous versions. In addition, Dask promises easier handling and less friction losses compared to the top dog Spark, as the framework itself is written in Python. From our point of view, Dask can be a sensible alternative, especially for data set sizes in the medium range, and a stepping stone to working with parallel and distributed compute resources.
Patrick Hoefler introduces the presentation of his benchmark results with a presentation of the tested frameworks:
Apache Spark, Dask, DuckDB and Polars are compared in terms of their performance. (Source: PyConDE/PyData Berlin 2024)
Designing data engineering in such a way that a lightweight overall system emerges despite large volumes of data and a large number of sources and consumers is a steep challenge. The forge of the still young "Data Load Tool" (dlt) has dedicated itself to this approach. The Berlin-based startup dlthub invited attendees of the PyData conference to exchange ideas at their home base. Even though the announced rooftop party had to take place in the basement due to (inaccurate) weather forecasts, we gladly accepted the invitation and enjoyed a constructive exchange of ideas. Dlt promises to harmonize the loading of data from various sources with a lean and extensible Python library. No infrastructure needs to be set up, no platform needs to be operated. Dlt seems to us to be a perfect candidate for small to medium-sized data warehouses that are operated with Apache Airflow according to the ELT paradigm. At the conference itself, Anuun Chinbat and Hiba Jamal presented the project with a lot of humor.
See you next year!
The PyCon and PyData Conference 2024 had a colorful, highly interesting program with many highlights. As with our review focusing on machine learning, we were only able to experience a small section on the topic of data engineering live and recap it here. Overall, it is clear that organizational and technological challenges still need to be overcome in the engine room of an economy that is increasingly driven by data. At first glance, new tools, conceptual approaches or platforms promise solutions and better management of data pipelines and transformations. However, making smart, targeted and sustainable decisions when putting together components for your own use case requires a broad knowledge base and some intuition.
We are happy to support you in the design and operation of data and analysis platforms - whether cloud or on-prem, provider-independent, especially in the Python ecosystem. If you like, contact us for an exchange or find us next year at PyCon/PyData in Berlin!