Machine learning, or ML, is becoming increasingly important for enterprises looking to use their data to improve their customer experience, develop better products, and more. But before a company can make good use of machine learning technology, it needs to make sure it has good data to use in artificial intelligence and ML models.
Jump to:
What is data preparation?
Data preparation involves cleaning, transforming, and structuring data to prepare it for further processing and analysis. Data usually does not reach companies in a standardized format and thus must be prepared for business use.
TO SEE: The Machine Learning Masterclass Bundle (TechRepublic Academy)
Before data scientists can use machine learning models to gain insights, they must first transform — reformat or perhaps correct — the data so that it is in a consistent format that meets their needs. In fact, as much as 80% of a data scientist’s time is spent on data preparation. Given how costly it can be to recruit and retrain data science talent, this is an indication of how important data preparation is to data science.
Why is data preparation important for machine learning?
ML models will always require specific data formats to function properly. Data preparation can recover missing or incomplete information so that the models can be applied to good data.
Some of the data a company collects in its data lake or elsewhere is structured, such as customer names, addresses, and product preferences, while most of it is almost certainly unstructured, such as geospatial data, product reviews, mobile activity, and tweet data. Either way, this raw data is basically useless to the company’s data science team until it’s formatted in standardized, consistent ways.
TO SEE: 4 steps to remove big data from unstructured data lakes (TechRepublic)
talenta company that provides tools to help businesses manage data integrity, has suggested some key benefits of data preparation, including the ability to quickly fix errors by “catch”[ing] errors before processing” and the reduction in data management costs that can incur when you try to apply bad data to otherwise good machine learning (ML) models.
Best practices for data preparation in machine learning
For a broad overview you can view this top five tips for data preparation; these more general tips usually apply to ML data preparation as well. However, there are some specific nuances to preparing ML data that are worth investigating.
Prepare your data according to a plan
You probably know in advance what you want your ML model to predict, so it pays to prepare for it. When you have a good idea of the outcome you hope to achieve, you can better define what types of data you want to collect and how you want to clean it.
This also allows you to better respond to missing or incomplete data. A common approach to missing data is to replace the null value. For example, if you are an airline with passenger data, you may choose to put a null value in the field that tracks meal preferences.
But depending on your application, replacing the zero value can be a terrible approach. From our previous example, the airline should not include a null value for missing passenger nationality data, as this can cause serious problems with their travel experience. Knowing what data is critical and how to handle incomplete records is essential.
TO SEE: Recruitment Package: Data Scientist (Tech Republic Premium)
Think of the people involved in data collection
Although you should consider investing in robotic process automation doing simple, repetitive tasks so that your employees don’t become bored, people will continue to be your greatest asset and a barrier to good data preparation for ML. It is often true that, even within the same department, companies are overrun by data silos.
For example, a news organization can understand the interests of a reader on the web, but cannot personalize a mobile app run by another team with different underlying storage systems.
Helping employees become collectively data-driven means working on collecting and using data, as well as sharing that data in useful ways across departments and roles. Collective data collection and use processes are critical to ensure better data for ML models.
Avoid target leakage
Google, a leader in data science and ML, offers some smart advice when it comes to target leakage in ML training data, “Target leakage occurs when your training data contains predictive information that is not available when you ask for a prediction.”
Google’s experts further explained that this can cause ML models to perform poorly when moving from pure predictive evaluation metrics to real data. The important job here is to make sure you have all the historical data you need to make accurate predictions.
Share your data
Deep checksa company offering an open-source Python library for ML suggests that companies should split their data into training, validation, and test sets for better results.
By “develop”[ing] insights from the training data and then apply[ing] processing to all datasets”, gives you a good idea of how your model will perform against real world data. Usually it makes sense to have 80% of your data in the training set and 20% in the test set.
Beware of Bias
While we can assume that machines always make unbiased, right decisions, sometimes these machines are simply: more efficient at conveying our own prejudices. Because of the potential that bias can sneak into ML models, it’s essential to scrutinize the data sources you use to train models.
Machine learning models are only as smart as the data that feeds them, and that data is limited by the people who collect it. In turn, people are affected by the data coming from the machines and can get further and further away from raw data. As a whole, we are less and less able to provide good data to our models because we have come to trust them wholeheartedly.
A strong dose of humility and prudence is critical in preparing data for ML so that biases do not spread across generations of data and models. To ensure that your data team is not only tech-savvy, but also understands where problems might arise in preparing machine learning data, consider signing them up for a comprehensive machine learning course.
Make time for data exploration
It can be tempting to jump straight into modeling without first building a strong foundation through data exploration. Data exploration is an important first step because it allows you to explore the data distributions of individual variables or the relationships between variables. You can also check for things like collinearity, which can indicate variables moving together. Data exploration is a great way to get a good sense of where your data may be incomplete or where further transformation can help.
Disclosure: I work for MongoDB, but the views expressed herein are mine.