Data scientists must make decisions about what data to include in data repositories. To make this decision-making process easier, learn tips for staying in control of your data funnel.
From 2022, Every day 2.5 trillion bytes of new data is created worldwide† While some of this data is useful for analysis, it can be time consuming and difficult to search. Creating an effective data funnel makes it easier to filter the data you need.
SEE: Hiring Kit: Database Engineer (Tech Republic Premium)
What is a data funnel?
A data funnel refers to limiting the amount of data you allow in your master data repository.
A good way to think about a data funnel is to compare it to the hiring funnels that a human resources tool applies when it uses software to screen applicant resumes. HR inputs the requirements for a job opening into analytics software that screens incoming resumes to create a smaller inbound data funnel of applicants for a particular position. This allows HR and interview managers to focus on more important tasks instead of manually channeling resumes.
Funneling also works on data. In one case, a life sciences company studying a particular molecule for its disease-fighting potential eliminated all incoming research sources that didn’t name the molecule. The goal was to save storage and processing and to gain insights faster. While filtering out all that outside data that worked for this company, running a data funnel is a balancing act between how much data you need and how much data you can afford to store and process.
How do you determine which data is important?
The sheer cost of storage and processing, whether in-house or in the cloud, is forcing companies to evaluate how much data they need for business analytics.
In some cases, it’s easy to decide what data to discard. You probably don’t want noise from network and machine handshakes in your data, but deciding which topic-related data to exclude is more difficult. There is also the risk that analytics teams will miss an important insight because of excluded data.
For example, using the data it would normally collect, a UK retailer would not have found that housewives made the bulk of their online purchases at home while their husbands were away from football matches.
Examples of this unexpected yet impressive insight are why IT and end-business groups should exercise caution when making decisions about how much they narrow the funnel for incoming data.
3 best practices for managing a data funnel
Outline the use cases that support your analytics and the data you think you need
This should be a collaborative exercise between IT/data science and end users. Want to include social media product complaints when analyzing your sales and revenue data? And if you study disease rates in your New York medical service area, are you concerned about what’s happening in California?
Decide how accurate your analytics need to be
The gold standard for the accuracy of analysis is that analysis should achieve an accuracy of at least 95% compared to what experts in the human matter field would conclude, but do you always need 95%?
You may need 95% accuracy if you are assessing the likelihood of a medical diagnosis based on certain patient health conditions, but 70% accuracy may only be necessary if you are predicting what climate conditions could be like 20 years from now.
Accuracy requirements affect the data funnel, and you may be able to exclude more data and narrow your funnel if you’re just looking for general, longer-term trends.
Regularly test the accuracy of your analytics
If your analysis shows 95% accuracy when it’s first implemented, but drops to 80% over time, it makes sense to recheck the data you’re using and recalibrate the data funnel.
Perhaps new data sources that were not available originally are now available and should be used. Adding these data sources will increase the data funnel, but if it increases the accuracy levels, it will be worth the cost to expand the data funnel.