- Image: anything is possible/Shutterstock
Data is the lifeblood of machine learning models. But what happens when there is limited access to this coveted resource? As many projects and companies are starting to show, synthetic data can be a viable, if not superior, alternative here.
What is synthetic data?
Synthetic data can be defined as information that is artificially fabricated and not obtained by direct measurement. At its core, the idea of ”fake” data is not a new or revolutionary concept. It is actually another designation of a method of generating test or training data for models that do not have the available or necessary information required to function.
In the past, a lack of data has led to the convenient approach of using a randomly generated set of data points. While this may have been sufficient for educational and testing purposes, random data is not something you would want to train a prediction model from. This is where the idea of synthetic data differs; it is reliable.
Synthetic data is essentially the clear idea that we can be smart with the way we produce randomized data. Such an approach can therefore be applied to more advanced use cases rather than just tests.
How is synthetic data produced?
While synthetic data is not created differently from random data, only by more complex input strings, it serves a different purpose and therefore has unique requirements.
The synthetic approach is based on and limited to certain criteria that are provided in advance as input. In practice it is not arbitrary at all. It is actually based on a sample set of data with certain distributions and criteria that guide the possible range, distribution and frequency of the data points. Basically, the goal is to replicate real data to populate a larger data set, which will then be comprehensive enough to train machine learning models.
TO SEE: Ethical Policy for Artificial Intelligence (Tech Republic Premium)
This method becomes especially interesting when exploring the deep learning methods used to refine synthetic data. Algorithms can be pitted against each other with the aim of outperforming each other in their ability to produce and identify synthetic data. Essentially, the goal here is to create an artificial arms race for producing hyper-realistic data.
Why is synthetic data necessary in the first place?
If we can’t gather the valuable resources we need to advance our civilization, which applies to everything from growing food to generating fuel, we’ll find a way to create it. The same principle now applies in the field of data for machine learning and AI.
It is critical to have a very large sample size of data when training algorithms, otherwise the patterns identified by the algorithm risk being too simple for real world applications. It’s actually quite logical. Just as human intelligence chooses the easiest path to solve a problem, the same is constantly happening when training machine learning and AI.
For example, let’s apply this to an object recognition algorithm that can accurately identify a dog from a selection of cat images. With too little data, the AI runs the risk of relying on patterns that are not fundamental features of the objects it is trying to identify. In this case, it may still work, but when it encounters data that does not follow the pattern initially identified, it falls flat.
How is synthetic data used to train AI?
So the solution? We draw lots of animals that are slightly different to force the network to find the underlying structure of the image, not just the placement of certain pixels. But instead of hand-drawing a million dogs, it’s better to build a system designed solely to draw dogs that can be used for training the algorithm built for classification – which is essentially is what we do when providing synthetic data to train machine learning algorithms.
However, there are obvious pitfalls in this method. Simply generating data from scratch will not be representative of the real world and will therefore result in an algorithm that will most likely fail when encountering real data. The solution is to collect a subset of data, analyze and identify trends and ranges in it, and then use this data to generate a large set of random data that most likely represent what the data would look like if we would collect them all ourselves.
This is where the value of synthetic data really lies. We no longer have to tirelessly run around collecting data that must then be cleaned and processed before use.
How does synthetic data solve the growing focus on data privacy?
The world is currently experiencing a very strong shift, especially in the EU, towards better protection of the privacy and data we generate with our online presence. In the field of machine learning and AI, tightening data protection is proving to be a recurring hurdle. Often limited data is just what is needed to make training algorithms perform and provide value to end users, especially for B2C solutions.
In general, the privacy problem is overcome when an individual decides to use a solution and therefore consents to the use of their data. The problem here is that it is very difficult to get users to give you their private data before you have a solution that offers enough value to transfer it. As a result, providers can often get caught up in a chicken-and-egg dilemma.
TO SEE: How to choose the right data privacy software for your business (TechRepublic)
The solution can and will be the synthetic approach, where a company can obtain a subset of data through early adopters. From here, they can use this information as a basis for generating enough data to train their machine learning and AI. This approach can dramatically reduce the time-consuming and expensive need for private data while still working on developing algorithms for their actual users.
For certain industries embroiled in data red tape, such as healthcare, banking and legal, synthetic data offers an easier approach to accessing previously unreachable amounts of data, often limiting new and more sophisticated algorithms. taken away.
Can synthetic data replace real data?
The problem with real data is that it is not generated with the intent to train machine learning and AI algorithms; it’s just a byproduct of the events happening all around us. As mentioned, this obviously limits the availability and ease of collection, but also the parameters of the data and the potential for errors (outliers) that could distort the results. This is why synthetic data, which can be modified and controlled, is more efficient in training models.
Inevitably, despite the superior training applications, synthetic data will always rely on at least a small subset of real data for its own creation. So no, synthetic data will never replace the original data it should be based on. More realistically, it will significantly reduce the amount of real data needed to train the algorithms, a process that requires significantly more data than testing – generally 80% of the data goes to the training, while the other 20% goes to testing.
Ultimately, when properly approached, synthetic data provides a faster and more efficient way to get the data we need at a lower cost than getting it from the real world and with a reduced need to break the data privacy nest.

Christian Lawaetz Halvorsen is the chief technology officer and co-founder of Appraiser, the AI-powered platform that is revolutionizing the way companies obtain information critical to their strategies and decision-making. With an MSc in Engineering, Product Development and Innovation from the University of Southern Denmark, Christian continues to refine Valuer’s technical infrastructure using the most optimal combination of human and machine intelligence.