Garbage in, garbage out. This is a well-known guiding principle from IT that also applies to data analysis. Even an extensive analysis will not generate any added value if it is based on a faulty database with inconsistencies. When advanced analytics and forecasting are formed with machine learning (ML), this principle applies more than ever. A machine learning model makes decisions based on the known data. In model training, important relationships are extracted for this purpose. In addition to the data quality, the preparation of the influencing factors in the form of features is crucial for the model quality. The associated ML process for generating the influencing factors is called feature engineering.
In this blog article, we will show you the importance of feature engineering for successful machine learning models. You will also learn about the possibilities and limitations of automated feature engineering.
In the process of feature engineering, concrete influencing factors (features) are formed for the later machine learning model based on the available database. These influencing factors represent an extension, simplification and transformation of the original data and are subsequently used in the machine learning project for training the model and forming predictions. The goal is to improve model performance through meaningful features.
The possibilities for generating new features depend on the dataset. Even though there are no limits to creativity, there are nevertheless some standard procedures:
Conversion of city names as categorical data into their geo-coordinates
Now that the possibilities of feature engineering are known, you will find out in the next section which interactions between the features and the model are of importance.
The goal of feature engineering is to improve model performance. For this purpose, the quality and, in particular, the relevance of the features are to be increased. A few relevant influencing factors lead to a better model result than many semi-relevant influencing factors. Quantity is therefore not a measure for the procedure.
The relevance can be evaluated via feature selection based on a specific model. Experienced data scientists use their knowledge of how the model works in advance to create an optimal representation of relevant influencing factors.
Example: The classic linear regression forms a simple linear model. If it is known in advance through the Data Exploration and Visualization that the influence of an influencing factor is logarithmic, however, a logarithmized version of the influencing factor can have a better fit in the linear model.
Logarithmization of the influence quantity to improve the linear relationship
After modeling, the importance of the influencing factors can be determined with the help of special feature importance libraries. If a feature has only a small influence, a model without this feature can outperform the full model under certain circumstances.
At any time it is important that the created influence factors can be generated from the new data in the running operation of the model. Otherwise, the model is not applicable and can only be used descriptively. A logarithmization is very clear and easy to reproduce. For standardization, on the other hand, the mean and variance of the original base feature of the training data must be stored. For this purpose, common frameworks offer special ML pipelines that store corresponding transformation values.
At the latest in productive environments, the greatest danger of feature engineering is uncovered. In many ML projects, data leakage leads to astonishing results during modeling, which, however, cannot be reproduced with new data. During data leakage, the created features reveal indirect information about the target value. For example, in a sales forecast, the "sales class" feature provides an indication of the rough range of the sales value. However, without the sales value, no sales class can be generated in productive operation. The sales value should be predicted by the model and is therefore not available. Therefore, an experienced Data Scientist checks the produced features with a critical eye.
In general, the features show a greater influence on the model performance than the optimization of the training process by hyperparameters. However, the optimization of hyperparameters can be automated very well. In the last step, we will therefore consider the question of the extent to which feature engineering can also be automated as a lever for project success.
In the course of automated machine learning (AutoML), model-specific parameters are optimized during model training and the best model is selected from various model types. It is desirable that feature engineering is also executed as a process step without manual intervention. In some subareas, this is already possible:
The limits of what can be automated arise from the use of domain knowledge. While transformations are easy to implement, mapping the known history in the form of indicators is an intellectually and code technically demanding implementation. The design is very dependent on the use case and is designed using domain knowledge. If, for example, customer-specific features are already available in a database, the selection of these features is again easy to automate. In this case, it can be worthwhile to keep general features in good quality throughout the company and to make them available centrally for machine learning projects.
In summary, feature engineering is a target-oriented method for improving model success. Influencing factors can be mapped in a model- and domain-oriented manner using partly simple and partly complex methods. The automated creation and selection of features saves additional manual effort for simple use cases.
Do you have further questions about the process of feature engineering in ML projects? Are you trying to build up the necessary know-how in your department or do you need support with a specific question?
We are happy to help you. Request a non-binding consulting offer today!