PyCon DE and PyData Berlin are an established beacon in the Python community's annual calendar. Developers, experts and enthusiasts from a wide range of backgrounds came together this year for the joint conference in Berlin for 3 days. A total of 1500 people attended the event and took part in lectures and workshops from no less than 7 mostly parallel tracks and exchanged knowledge.
We were there for you and would like to share our impressions of the conference: Today we look at innovations and trends in the field of machine learning and its areas of application. Impressions and highlights from a data engineering perspective will follow in a later article.
The big buzzword of recent times is once again AI, artificial intelligence. The programme of the PyData Conference 2024 therefore also included many contributions on the implementation of "AI" using Python, on its application in various scenarios or on the technical optimization of so-called Generative AI (GenAI) and Large Language Models (LLM). From our perspective, Prof. Ricardo Baeza-Yates did make the most important contribution in his keynote speech: What questions to ask before jumping on the AI bandwagon. Legal, ecological and also engineering aspects should be taken into account when deciding whether the use of highly complex and extremely energy-intensive statistical models for a specific application makes sense at all.
In the wake of the AI hype, machine learning (ML) methods are getting better and better and are becoming increasingly accessible for use in everyday business. In a very entertaining presentation, John Sandall developed a lightweight tool for fully automated audio transcription and content summarization of voice recordings practically live on his own laptop. A great example of how state-of-the-art technologies can be used to optimize everyday working life without expensive cloud services that raise data protection concerns.
John Sandall demonstrates an open source audio transcription and summarization tool written in Python.
While he talks, the application actually transcribes and summarizes his words with little delay. (Source: PyConDE/PyData Berlin)
More and more database and data warehouse systems have now directly integrated typical machine learning methods or offer interfaces for defining and integrating your own methods and models. The trend has long been known: it is easier to bring the algorithm to the data than vice versa. Gregor Bauer presented this concept using the NoSQL database Couchbase as an example: Any Python code for an ML model can be hooked into the database via an interface and registered in the engine as a user-defined function. The function is then available in all SQL queries to the contents of the database. Sales forecasts and planning figures can thus be generated live from the database without lengthy transformation and calculation routines.
Many SQL and NoSQL Database systems support plug-in Machine Learning inference.
Instead of operating complex infrastructure like Feature Stores and Model Registry, an ML model is trained, packaged, and uploaded directly into the Database engine. Inference can then be queried like a native database function for consumption.
ML models can be integrated directly into more than just modern database systems. With the MicroPython framework, simple machine learning applications can even be executed on microcontrollers, i.e. the smallest processors used in the Internet of Things. Among other things, Jon Nordby presented the application in industrial sensors for vibration measurement. Using Python ML screening directly on the microcontroller on a turbine, sensor data is examined for patterns that could represent a malfunction. The volume of data that has to be transmitted to a central monitoring system can thus be drastically reduced.
ML applications are everywhere these days. They are directly available in cloud platforms, BI tools and database systems. At the same time, the old adage applies: the devil is in the model. Or something like that. At PyData Berlin 2024, we also saw a number of contributions that offered solutions to certain challenges and at the same time underlined the fact that these problems are difficult to generalize.
Miguel de Benito Delgado and Kristof Schröder, for example, presented current approaches from the still active research field of "data valuation": In the context of machine learning models, data valuation stands for the question of what information gain (or loss) certain data sets or data points mean for the prediction quality of a model. In practice, various mathematical approaches can be used here to identify possible errors or impurities in a training data set or to optimize feature engineering for an ML model.
Daria Mokrytska introduces the classical time-series forecasting challenge known as the Cold Start problem as the PyData conference:
No training data is available for a specific object of interest. (Source: PyConDE/PyData Berlin)
The cold start problem with time series forecasts describes the fact that real-world use cases, for example in sales or revenue forecasts, should always make statements for products for which little or no historical data is available. In principle, models cannot make any statements about the future behavior of previously unknown objects. Daria Mokrytska and Alexander Meier have presented some approaches on how knowledge about known objects can be transferred to previously unknown objects. There are no perfect solutions here, but good approximations can be achieved using relatively simple heuristics.
In addition to these purely technical challenges, Katherine Jarmul drew attention to the risks of deep learning models in terms of confidentiality in her presentation. In particular, training data that deviates significantly from the norm or the average is sometimes stored in full in the model and can therefore be reproduced using targeted queries. Models that are trained with sensitive personal data or business secrets therefore represent a potential security vulnerability. Technical solutions here are generally complicated. Dedicated AI models developed, trained and used in specific autonomous domains could represent a cultural alternative to the large, global models of the current generation.
Our visit to the PyCon / PyData Berlin conference this year gave us many exciting impressions. With a focus on the topic of machine learning in a business context, it can be said that a high level of maturity seems to have been achieved, although there is an ongoing need for optimization and challenges. Even in 2024, machine learning and AI applications will not be a sure-fire success. To use them profitably, you need specific expert knowledge about the methods used and their pitfalls and backdoors. Feel free to talk to us about your projects or current challenges in the fields of ML and AI.
A second part of our review will follow shortly, focusing on data engineering topics. However, there is far more to see and experience at the conference than can be captured in these short reports. Perhaps we will meet you in person at PyCon/PyData next year?