What is training data?
In machine learning, training data is the data you use to train a machine learning algorithm or model. Training data requires some human involvement to analyze or process the data for machine learning use. How people are involved depends on the type of machine learning algorithms you are using and the type of problem that they are intended to solve.
- With supervised learning, people are involved in choosing the data features to be used for the model. Training data must be labeled - that is, enriched or annotated - to teach the machine how to recognize the outcomes your model is designed to detect.
- Unsupervised learning uses unlabeled data to find patterns in the data, such as inferences or clustering of data points. There are hybrid machine learning models that allow you to use a combination of supervised and unsupervised learning.
Training data comes in many forms, reflecting the myriad potential applications of machine learning algorithms. Training datasets can include text (words and numbers), images, video, or audio. And they can be available to you in many formats, such as a spreadsheet, PDF, HTML, or JSON. When labeled appropriately, your data can serve as ground truth for developing an evolving, performant machine-learning formula.
What is labeled data?
Labeled data is annotated to show the target, which is the outcome you want your machine learning model to predict. Data labeling is sometimes called data tagging, annotation, moderation, transcription, or processing. The process of data labeling involves marking a dataset with key features that will help train your algorithm. Labeled data explicitly calls out features that you have selected to identify in the data, and that pattern trains the algorithm to discern the same pattern in unlabeled data.
Take, for example, you are using supervised learning to train a machine learning model to review incoming customer emails and send them to the appropriate department for resolution. One outcome for your model could involve sentiment analysis - or identifying language that could indicate a customer has a complaint, so you could decide to label every instance of the words “problem” or “issue” within each email in your dataset.
That, along with other data features you identify in the process of data labeling and model testing, could help you train the machine to accurately predict which emails to escalate to a service recovery team.
The way data labelers score, or assign weight, to each label and how they manage edge cases also affects the accuracy of your model. You may need to find labelers with domain expertise relevant to your use case. As you can imagine, the quality of the data labeling for your training data can determine the performance of your machine learning model.
Enter the human in the loop.
What is human in the loop?
“Human in the loop” applies the judgment of people who work with the data that is used with a machine learning model. When it comes to data labeling, the humans in the loop are the people who gather the data and prepare it for use in machine learning.
Gathering the data includes getting access to the raw data and choosing the important attributes of the data that would be good indicators of the outcome you want your machine learning model to predict.
This is an important step because the quality and quantity of data that you gather will determine how good your predictive model could be. Preparing the data means loading it into a suitable place and getting it ready to be used in machine learning training.
Consider datasets that include point-cloud data from lidar-derived images that must be labeled to train machine learning models that operate autonomous vehicle (AV) systems. People use advanced digital tools, such as 3-D cuboid annotation software, to annotate features within that data, such as the occurrence, location, and size of every stop sign in a single image.
This is not a one-and-done approach, because with every test, you will uncover new opportunities to improve your model. The people who work with your data play a critical role in the quality of your training data. Every incorrect label can have an effect on your model’s performance.
How is training data used in machine learning?
Unlike other kinds of algorithms, which are governed by pre-established parameters that provide a sort of “recipe,” machine learning algorithms improve through exposure to pertinent examples in your training data.
The features in your training data and the quality of labeled training data will determine how accurately the machine learns to identify the outcome, or the answer you want your machine learning model to predict.
For example, you could train an algorithm intended to identify suspicious credit card charges with cardholder transaction data that is accurately labeled for the data features, or attributes, you decide are key indicators for fraud.
The quality and quantity of your training data determine the accuracy and performance of your machine learning model. If you trained your model using training data from 100 transactions, its performance likely would pale in comparison to that of a model trained on data from 10,000 transactions. When it comes to the diversity and volume of training data, more is usually better – provided the data is properly labeled.
"As data scientists, our time is best spent fitting models. So we appreciate it when the data is well structured, labeled with high quality, and ready to be analyzed,” says Lander Analytics Founder and Chief Data Scientist Jared P. Lander. His full-service consulting firm helps organizations leverage data science to solve real-world challenges.
Training data is used not only to train but to retrain your model throughout the AI development lifecycle. Training data is not static: as real-world conditions evolve, your initial training dataset may be less accurate in its representation of ground truth as time goes on, requiring you to update your training data to reflect those changes and retrain your model.
What is the difference between training data and testing data?
It’s important to differentiate between training and testing data, though both are integral to improving and validating machine learning models. Whereas training data “teaches” an algorithm to recognize patterns in a dataset, testing data is used to assess the model’s accuracy.
More specifically, training data is the dataset you use to train your algorithm or model so it can accurately predict your outcome. Validation data is used to assess and inform your choice of algorithm and parameters of the model you are building. Test data is used to measure the accuracy and efficiency of the algorithm used to train the machine - to see how well it can predict new answers based on its training.
Take, for example, a machine learning model intended to determine whether or not a human being is pictured in an image. In this case, training data would include images, tagged to indicate the photo includes the presence or absence of a person. After feeding your model this training data, you would then unleash it on unlabeled test data, including images with and without people. The algorithm’s performance on test data would then validate your training approach – or indicate a need for more or different training data.
How can I get training data?
You can use your own data and label it yourself, whether you use an in-house team, crowdsourcing, or a data labeling service to do the work for you. You also can purchase training data that is labeled for the data features you determine are relevant to the machine learning model you are developing.
Auto-labeling features in commercial tools can help speed up your team, but they are not consistently accurate enough to handle production data pipelines without human review. Dataloop, Hivemind, and V7 Labs have auto-labeling features in their enrichment tools.
Your machine learning use case and goals will dictate the kind of data you need and where you can get it. If you are using natural language processing (NLP) to teach a machine to read, understand, and derive meaning from language, you will need a significant amount of text or audio data to train your algorithm.
You would need a different kind of training data if you are working on a computer vision project to teach a machine to recognize or gain understanding of objects that can be seen with the human eye. In this case, you would need labeled images or videos to train your machine learning model to “see” for itself.
There are many sources that provide open datasets, such as Google, Kaggle and Data.gov. Many of these open datasets are maintained by enterprise companies, government agencies, or academic institutions.
How much training data do I need?
There’s no clear answer - no magical mathematical equation to answer this question - but more data is better. The amount of training data you need to create a machine learning model depends on the complexity of both the problem you seek to solve and the algorithm you develop to do it. One way to discover how much training data you will need is to build your model with the data you have and see how it performs.