The Ultimate Guide to Data Labeling for Machine Learning

Everything you need to know before engaging a data labeling service. Act strategically, build high quality datasets, and reclaim valuable time to focus on innovation. 

If you have massive amounts of data you want to use for machine learning or deep learning, you'll need tools and people to enrich it so you can train, validate, and tune your model. If your team is like most, you’re doing most of the work in-house and you’re looking for a way to reclaim your internal team’s time to focus on more strategic initiatives. Are you ready to hire a data labeling service?

This guide will take you through the essential elements of successfully outsourcing this vital but time consuming work. From the technology available and the terminology used, to best practices and the questions you should ask a prospective data labeling service provider, it's here.

hivemind report

Read the full guide below, or download a PDF version of the guide you can reference later.

  1. Introduction
  2. Get Started
  3. Data Quality
  4. Scaling
  5. Pricing
  6. Security
  7. Tools
  8. Next Steps
  9. Contact
  10. FAQs

Introduction: Will this guide be helpful to me?

Let’s get a handle on why you’re here. This guide will be most helpful to you if you have data you can label for machine learning and you are dealing with one or more of the challenges below.

You have a lot of unlabeled data. Most data is not in labeled form, and that’s a challenge for most AI project teams. Fully 80% of AI project time is spent on gathering, organizing, and labeling data, according to analyst firm Cognilytica, and this is the time that teams can’t afford to spend because they are in a race to usable data, which is data that is structured and labeled properly in order to train and deploy models.

Percentage of Time Allocated to Machine Learning Project Tasks

Your data labels are low quality. There are a lot of reasons your data may be labeled with low quality, but usually the root causes can be found in the people, processes, or technology used in the data labeling workflow.

You want to scale your data labeling operations because your volume is growing and you need to expand your capacity. If you’re labeling data in house, it can be very difficult and expensive to scale.

Your data labeling process is inefficient or costly. If you’re paying your data scientists to wrangle data, it’s a smart move to look for another approach. Salaries for data scientists can cost up to $190,000/year. It’s expensive to have some of your highest-paid resources wasting time on basic, repetitive work.

You need to add quality assurance to your data labeling process or make improvements to the QA process already underway. This is an often-overlooked area of data labeling that can provide significant value, particularly during the iterative machine learning model testing and validation stages.

Let’s Get Started: Labeled data and ground truth

What is labeled data?

In machine learning, if you have labeled data, that means your data is marked up, or annotated, to show the target, which is the answer you want your machine learning model to predict. In general, data labeling can refer to tasks that include data tagging, annotation, classification, moderation, transcription, or processing.

What is data annotation?

Data annotation generally refers to the process of labeling data. Data annotation and data labeling are often used interchangeably, although they can be used differently based on the industry or use case.

Labeled data highlights data features - or properties, characteristics, or classifications - that can be analyzed for patterns that help predict the target. For example, in computer vision for autonomous vehicles, a data labeler can use frame-by-frame video labeling tools to indicate the location of street signs, pedestrians, or other vehicles.

What is ‘Human-in-the-Loop’ (HITL)?

HITL leverages both human and machine intelligence to create machine learning models. In a human-in-the-loop configuration, people are involved in a virtuous circle of improvement where human judgement is used to train, tune, and test a particular data model.

What are the labels in machine learning?

Labels are what the human-in-the-loop uses to identify and call out features that are present in the data. It’s critical to choose informative, discriminating, and independent features to label if you want to develop high-performing algorithms in pattern recognition, classification, and regression. Accurately labeled data can provide ground truth for testing and iterating your models. 

What is a “ground truth” in machine learning?

In machine learning, “ground truth” means checking the results of ML algorithms for accuracy against the real world. In essence, it’s a reality check for the accuracy of algorithms. The term is borrowed from meteorology, where "ground truth" refers to information obtained on the ground where a weather event is actually occurring, that data is then compared to forecast models to determine their accuracy.

What is “training data” in machine learning?

Training data is the enriched data you use to train a machine learning algorithm or model.

How are companies labeling their data today?

Organizations use a combination of software, processes, and people to clean, structure, or label data. In general, you have four options for your data labeling workforce:

  • Employees - They are on your payroll, either full-time or part-time. Their job description may not include data labeling. 
  • Managed teams - You use vetted, trained, and actively managed data labelers (e.g., CloudFactory).
  • Contractors - They are temporary or freelance workers.
  • Crowdsourcing - You use a third-party platform to access large numbers of workers at once.

managed-cloudworkers

Data labeling includes a wide array of tasks:

  • Using a tool to enrich data
  • Quality assurance for data labeling
  • Process iteration, such as changes in data feature selection, task progression, or QA
  • Management of data labelers
  • Training of new team members
  • Project planning, process operationalization, and measurement of success

We’ve been labeling data for a decade. Over that time, we’ve learned how to combine people, process,  and technology to optimize data labeling quality. Here are five essential elements you’ll want to consider when you need to label data for machine learning:

Essential 1: Data Quality and Accuracy - What affects quality and accuracy in data labeling?

While the terms are often used interchangeably, we’ve learned that accuracy and quality are two different things.

  1. Accuracy in data labeling measures how close the labeling is to ground truth, or how well the labeled features in the data are consistent with real-world conditions. This is true whether you’re building computer vision models (e.g., putting bounding boxes around objects on street scenes) or natural language processing (NLP) models (e.g., classifying text for social sentiment).

  2. Quality in data labeling is about accuracy across the overall dataset. Does the work of all of your labelers look the same? Is labeling consistently accurate across your datasets? This is relevant whether you have 29, 89, or 999 data labelers working at the same time.

Low-quality data can actually backfire twice: first during model training and again when your model consumes the labeled data to inform future decisions. To create, validate, and maintain production for high-performing machine learning models, you have to train and validate them using trusted, reliable data.

4 Workforce Traits that Affect Quality in Data Labeling

In our decade of experience providing managed data labeling teams for startup to enterprise companies, we’ve learned four workforce traits affect data labeling quality for machine learning projects: knowledge and context, agility, relationship, and communication.

What affects data quality in labeling?

1. Knowledge and context

In data labeling, basic domain knowledge and contextual understanding is essential for your workforce to create high quality, structured datasets for machine learning. We’ve learned workers label data with far higher quality when they have context, or know about the setting or relevance of the data they are labeling. For example, people labeling your text data should understand when certain words may be used in multiple ways, depending on the meaning of the text. To tag the word “bass” accurately, they will need to know if the text relates to fish or music. They might need to understand how words may be substituted for others, such as “Kleenex” for “tissue.”

For highest quality data, labelers should know key details about the industry you serve and how their work relates to the problem you are solving. It’s even better when a member of your labeling team has domain knowledge, or a foundational understanding of the industry your data serves, so they can manage the team and train new members on rules related to context, what business or product does, and edge cases. For example, the vocabulary, format, and style of text related to healthcare can vary significantly from that for the legal industry.

2. Agility

Machine learning is an iterative process. Data labeling evolves as you test and validate your models and learn from their outcomes, so you’ll need to prepare new datasets and enrich existing datasets to improve your algorithm’s results.

Your data labeling team should have the flexibility to incorporate changes that adjust to your end users’ needs, changes in your product, or the addition of new products. A flexible data labeling team can react to changes in data volume, task complexity, and task duration. The more adaptive your labeling team is, the more machine learning projects you can work through.

As you develop algorithms and train your models, data labelers can provide valuable insights about data features - that is, the properties, characteristics, or classifications - that will be analyzed for patterns that help predict the target, or answer what you want your model to predict.

3. Relationship

In machine learning, your workflow changes constantly. You need data labelers who can respond quickly and make changes in your workflow, based on what you’re learning in the model testing and validation phase.

To do that kind of agile work, you need flexibility in your process, people who care about your data and the success of your project, and a direct connection to a leader on your data labeling team so you can iterate data features, attributes, and workflow based on what you’re learning in the testing and validation phases of machine learning.

4. Communication

You’ll need direct communication with your labeling team. A closed feedback loop is an excellent way to establish reliable communication and collaboration between your project team and data labelers. Labelers should be able to share what they’re learning as they label the data, so you can use their insights to adjust your approach.

To learn more about quality and context, check out our Lessons Learned: 3 Essentials for Your NLP Data Workforce.

How is quality measured in data labeling?

There are four ways we measure data labeling quality from a workforce perspective:

  1. Gold standard - There’s a correct answer for the task. Measure quality based on correct and incorrect tasks.
  2. Sample review - Select a random sample of completed tasks. A more experienced worker, such as a team lead or project manager, reviews the sample for accuracy.
  3. Consensus - Assign several people do the same task, and the correct answer is the one that comes back from the majority of labelers. 
  4. Intersection over union (IoU) - This is a consensus model often used in object detection within images. It combines people and automation to compare the bounding boxes of your hand-labeled, ground truth images with the predicted bounding boxes from your model.
You will want the freedom to choose from these quality assurance methods instead of being locked into a single model to measure quality. At CloudFactory, we use one or more of these methods on each project to measure the work quality of our own data labeling teams.
To learn more about measuring quality in data labeling, check out Scaling Quality Training Data: The Hidden Costs of the Crowd.

Critical Questions to Ask Your Data Labeling Service About Data Quality

  • How will our team communicate with your data labeling team?
  • Will we work with the same data labelers over time? If workers change, who trains new team members? Describe how you transfer context and domain expertise as team members transition on/off the data labeling team. 
  • Is your data labeling process flexible? How will you manage changes or iterations from our team that impact data features for labeling?
  • What standard do you use to measure quality? How do you share quality metrics with our team? What happens when quality measures aren’t met?

Essential 2: Scale - What happens when my data labeling volume increases?

The second essential for data labeling for machine learning is scale. What you want is elastic capacity to scale your workforce up or down, according to your project and business needs, without compromising data quality.

Data labeling is a time consuming process, and it’s even more so in machine learning, which requires you to iterate and evolve data features as you train and tune your models to improve data quality and model performance. As the complexity and volume of your data increase, so will your need for labeling. Video annotation is especially labor intensive: each hour of video data collected takes about 800 human hours to annotate. A 10-minute video contains somewhere between 18,000 and 36,000 frames, about 30-60 frames per second.

How do I know when it’s time to scale and hire a data labeling service?

If your most expensive resources like data scientists or engineers  are spending significant time wrangling data for machine learning or data analysis, you’re ready to consider scaling with a data labeling service. Increases in data labeling volume, whether they happen over weeks or months, will become increasingly difficult to manage in-house.

They also drain the time and focus of some of your most expensive human resources: data scientists and machine learning engineers. If your data scientist is labeling or wrangling data, you’re paying up to $90 an hour. It’s better to free up such a high-value resource for more strategic and analytical work that will extract business value from your data.

5 Steps to Scale Data Labeling

1. Design for workforce capacity.

A data labeling service can provide access to a large pool of workers. Crowdsourcing can too, but research by data science tech developer Hivemind found anonymous workers delivered lower quality data than managed teams on identical data labeling tasks. 

Your best bet is working with the same team of labelers, because as their familiarity with your business rules, context, and edge cases increases, data quality improves over time. They also can train new people as they join the team. This is especially helpful with data labeling for machine learning projects, where quality and flexibility to iterate are essential.

2. Look for elasticity

Look for elasticity to scale labeling up or down. You may have to label data in real time, based on the volume of incoming data generated. Perhaps your business has seasonal spikes in purchase volume over certain weeks of the year, as some companies do in advance of gift-giving holidays. We have also found that product launches can generate spikes in data labeling volume. You will want a workforce that can adjust scale based on your needs. 

CloudFactory took on a huge project to assist a client with a product launch in early 2019. Completing the related data labeling tasks required 1,200 hours over 5 weeks. We completed that intense burst of work and continue to label incoming data for that product. Unfettered by data labeling burdens, our client has time to innovate post-processing workflows.

3. Choose smart tooling.

Whether you buy it or build it yourself, the data enrichment tool you choose will significantly influence your ability to scale data labeling. Keep in mind, it’s a progressive process: your data labeling tasks today may look different in a few months, so you will want to avoid decisions that lock you into a single direction that may not fit your needs in the near future.

Whether you’re growing or operating at scale, you’ll need a tool that gives you the flexibility to make changes to your data features, labeling process, and data labeling service. Commercially available tools give you more control over workflow, features, security, and integration than tools built in-house. They also give you the flexibility to make changes.

4. Measure worker productivity.

Productivity can be measured in a variety of ways, but in our experience we’ve found that three measures in particular provide a helpful view into worker productivity; 1) the volume of completed work, 2) quality of the work (accuracy plus consistency), and 3) worker engagement.

On the worker side, strong processes lead to greater productivity. Combining technology, workers, and coaching shortens labeling time, increases throughput, and minimizes downtime. We have found data quality is higher when we place data labelers in small teams, train them on your tasks and business rules, and show them what quality work looks like.

Team leaders encourage collaboration, peer learning, support, and community building. Workers’ skills and strengths are known and valued by their team leads, who provide opportunities for workers to grow professionally. We've found that this small-team approach, combined with a smart tooling environment, results in high-quality data labeling.

5. Streamline communication between your project and data labeling teams.

Organized, accessible communication with your data labeling team makes it easier to scale the process. Based on our experience, we recommend a tightly closed feedback loop for communication with your labeling team so you can make impactful changes fast, such as changing your labeling workflow or iterating data features.

When data labeling directly powers your product features or customer experience, labelers’ response time needs to be fast, and communication is key. Data labeling service providers should be able to work across time zones and optimize your communication for the time zone that affects the end user of your machine learning project.

To learn more about scale, download our Scaling Quality Training Data report.

Critical Questions to Ask Your Data Labeling Service About Scale

  • Describe the scalability of your workforce. How many workers can we access at any one time? Can we scale data labeling volume up or down, based on our needs? How often can we do that?
  • How do you measure worker productivity? How long does it take a team of your data labelers to reach full throughput? Is task throughput impacted as your data labeling team scales?  Do increases in throughput as the team scales impact data quality?
  • How do you handle iterations in our data labeling features and operations as we scale?
  • Tell us about the client support we can expect once we engage with your team. How often will we meet? How much time should my team plan to spend managing the project?

Essential 3: Pricing - Should I pay by the hour or by task?

The third essential for data labeling for machine learning is pricing. The model a data labeling service uses to calculate pricing can have implications for your overall cost and for your data quality.

What does a data labeling service cost?

Typically, data labeling services charge by the task or by the hour, and the model you choose can create different incentives for labelers. If you pay data labelers per task, it could incentivize them to rush through as many tasks as they can, resulting in poor quality data that will delay deployments and waste crucial time.

By contrast, managed workers are paid for their time, and are incentivised to get tasks right, especially tasks that are more complex and require higher-level subjectivity. This difference has important implications for data quality, and in the next section we’ll present evidence from a recent study that highlights some key differences between the two models.

A Study on Data Labeling Quality and Cost

Data science tech developer Hivemind conducted a study on data labeling quality and cost. They enlisted a managed workforce, paid by the hour, and a leading crowdsourcing platform’s anonymous workers, paid by the task, to complete a series of identical tasks. Hivemind’s goal for the study was to understand these dynamics in greater detail - to see which team delivered the highest-quality data and at what relative cost.

Same Tasks, Two Data Labeling Workforces

Tasks were text-based and ranged from basic to more complicated. Hivemind sent tasks to the crowdsourced workforce at two different rates of compensation, with one group receiving more, to determine how cost might affect data quality.

Task A: Easy Transcription

Crowdsourced workers transcribed at least one of the numbers incorrectly in 7% of cases. When they were paid double, the error rate fell to just under 5%, which is a significant improvement. The managed workers only made a mistake in 0.4% of cases, an important difference given its implication for data quality. Overall, on this task, the crowdsourced workers had an error rate of more than 10x the managed workforce.

task-transcription

Task B:  Sentiment Analysis

Workers received text of a company review from a review website and were to rate the sentiment of the review from one to five. Actual ratings, or ground truth, were removed. Managed workers had consistent accuracy, getting the rating correct in about 50% of cases. Crowdsourced workers had a problem, particularly with poor reviews. Accuracy was almost 20%, essentially the same as guessing, for 1- and 2-star reviews. For 4- and 5-star reviews, there was little difference between the workforce types.

Task C:  Extracting Information from Unstructured Text

Workers used a title and description of a product recall to classify the recall by hazard type, choosing one of 11 options, including “other” and “not enough information provided.” The crowdsourced workers’ accuracy was 50% to 60%, regardless of word count. Managed workers achieved higher accuracy, 75% to 85%. Managed workers’ accuracy was 25% higher than that of the crowdsourced team.

To learn more about data labeling quality and cost, download the study results: Crowd vs. Managed Team: A Study on Quality Data Processing at Scale.

Data Labeling Pricing: 3 Critical Considerations

Look for a data labeling service with realistic, flexible terms and conditions. Specifically, you’re looking for:

  1. Predictable cost structure, so you know what data labeling will cost as you scale and throughput increases
  2. Pricing that fits your purpose, where you pay only for what you need to get high-quality datasets
  3. Flexibility to make changes as your data features and labeling requirements change. Avoid contracts that lock you into several months of service, platform fees, or other restrictive terms.
To learn more about pricing for your data labeling workforce, check out The 3 Hidden Costs of Crowdsourcing for Data Labeling.

Critical Questions to Ask Your Data Labeling Service About Pricing

  • Will we pay by the hour or per task? Why did you structure your pricing model that way? Will our work become more cost-effective as we scale (increase volume or throughput)?
  • Are we required to sign a multi-month contract for data labeling services?
  • What is the cost of your solution compared to our doing the work in-house?
  • Do you incentivize workers to label data with high quality or greater volume? How?

Essential 4: Security - How will my data be protected?

The fourth essential for data labeling for machine learning is security. A data labeling service should comply with regulatory or other requirements, based on the level of security your data requires.

What are the security risks of outsourcing data labeling?

Your data labeling service can compromise security when their workers:

  1. Access your data from an insecure network or using a device without malware protection
  2. Download or save some of your data (e.g., screen captures, flash drive)
  3. Label your data as they sit in a public place
  4. Don’t have training, context, or accountability related to security rules for your work
  5. Work in a physical or digital environment that is not certified to comply with data regulations your business must observe (e.g., HIPAA, SOC 2).

Security and Your Data Labeling Workforce

If data security is a factor in your machine learning process, your data labeling service must have a facility where the work can be done securely, the right training, policies, and processes in place - and they should have the certifications to show their process has been reviewed.

Most importantly, your data labeling service must respect data the way you and your organization do. They also should have a documented data security approach in all of these three areas:

  • People and Workforce: This could include background checks for workers and may require labelers to sign a non-disclosure agreement (NDA) or similar document outlining your data security requirements. The workforce could be managed or measured for compliance. It may include worker training on security protocols related to your data.
  • Technology and Network: Workers may be required to turn in devices they bring into the workplace, such as a mobile phone or tablet. Download or storage features may be disabled on devices workers use to label data. There’s likely to be significantly enhanced network security.
  • Facilities and Workspace: Workers may sit in a space that blocks others from viewing their work. They may work in a secure location, with badged access that allows only authorized personnel to enter the building or room where data is being labeled. Video monitoring may be used to enhance physical security for the building and the room where work is done.

Security concerns shouldn’t stop you from using a data labeling service that will free up you and your team to focus on the most innovative and strategic part of machine learning: model training, tuning, and algorithm development.

To learn more about CloudFactory’s secure data labeling service, download our Security Datasheet.

Critical Questions to Ask Your Data Labeling Service About Security

  • Will you use my labeled datasets to create or augment datasets and make them available to third parties?
  • Do you have secure facilities? How do you screen and approve workers to work in those facilities? What kind of data security training do you provide to workers? What happens when new people join the team?
  • What measures will you take to secure the facilities where our work is done? Do you use video monitoring in work for projects that require higher levels of security?
  • How do you protect data that’s subject to regulatory requirements, such as HIPAA or GDPR? What about personally identifiable information (PII)?

Essential 5: Tools - Do I need a tooling platform for data labeling?

The fifth essential for data labeling in machine learning is tooling, which you will need whether you choose to build it yourself or to buy it from a third party. Why? Because labeling production-grade training data for machine learning requires smart software tools and skilled humans in the loop. A data labeling service should be able to provide recommendations and best practices in choosing and working with data labeling tools. Ideally, they will have partnerships with a wide variety of tooling providers to give you choices and to make your experience virtually seamless. They will also provide the expertise needed to assign people tasks that require context, creativity, and adaptability while giving machines the tasks that require speed, measurement, and consistency.

Task Progression

Tasking people and machines with assignments is easier to do with user-friendly tools that break down data labeling work into atomic, or smaller, tasks. By transforming complex tasks into a series of atomic components, you can assign machines tasks that tools are doing with high quality and involve people for the tasks that today’s tools haven’t mastered.

Breaking work into atomic components also makes it easier to measure, quantify, and maximize quality for each task. Each kind of task may have its own quality assurance (QA) layer, and that process can be broken into atomic tasks as well.

Task Progression

Every machine learning modeling task is different, so you may move through several iterations simply to come up with good test definitions and a set of instructions, even before you start collecting your data. If you can efficiently transform domain knowledge about your model into labeled data, you've solved one of the hardest problems in machine learning.

After a decade of providing teams for data labeling, we know it’s a progressive process. The labeling tasks you start with are likely to be different in a few months. Along the way, you and your data labeling team can adapt your process to label for high quality and model performance.

Choosing a Data Labeling Tool: 5 Steps

We’ve learned these five steps are essential in choosing your data labeling tool to maximize data quality and optimize your workforce investment:

1. Narrow tooling based on your use case.

Your data type will determine the tools available to use. Tools vary in data enrichment features, quality (QA) capabilities, supported file types, data security certifications, storage options, and much more. Features for labeling may include bounding boxes, polygon, 2-D and 3-D point, semantic segmentation, and more.

2. Compare the benefits of build vs. buy.

Building your own tool can offer valuable benefits, including more control over the labeling process, software changes, and data security. You also can more easily address and mitigate unintended bias in your labeling. However, buying a commercially available tool is often less costly in the long run because your team can focus on their core mission rather than supporting and extending software capabilities, freeing up valuable capital for other aspects of your machine learning project. When you buy you can configure the tool for the features you need, and user support is provided.

There is more than one commercially available tool available for any data labeling workload, and teams are developing new tools and advanced features all the time. When you buy, you’re essentially leasing access to the tools, which means:

  1. There are funded entities that are vested in the success of that tool;
  2. You have the flexibility to use more than one tool, based on your needs; and
  3. Your tool provider supports the product, so you don’t have to spend valuable engineering resources on tooling.

3. Consider your organization’s size and growth stage.

We’ve found company stage to be an important factor in choosing your tool.

  1. Getting started: There are several ways to get started on the path to choosing the right tool. This is where the critical question of build or buy comes into play. You’ll want to assess the commercially available options, including open source, and determine the right balance of features and cost to get your process started. While some crowdsourcing vendors offer tooling platforms, they often fall behind in the feature maturity curve as compared to commercial providers who are focused purely on best-in-class data labeling tools as their core capability. Also, keep in mind that crowdsourced data labelers will be anonymous, so context and quality are likely to be pain points.
  2. Scaling the process: If you are in the growth stage, commercially-viable tools are likely your best choice. You can lightly customize, configure, and deploy features with little to no development resources. If you prefer, open source tools can give you more control over security, integration, and flexibility to make changes. Remember, building a tool is a big commitment: you’ll invest in maintaining that platform over time, and that can be costly.
  3. Sustaining scale: If you are operating at scale and want to sustain that growth over time, you can get commercially-viable tools that are fully customized and require few development resources. If you go the open source route, be sure to create long-term processes and stack integrations that will allow you to leverage any security or agility advantages you want to leverage.
growth-stage

4. Don’t let your workforce choice lock you into a tool.

For the most flexibility and control over your process, don’t tie your workforce to your tool. Your workforce choice can make or break data quality, which is at the heart of your model’s performance, so it’s important to keep your tooling options open. The best data labeling teams can adopt any tool quickly and help you adapt it to better meet your labeling needs. 

5. Factor in your data quality requirements.

Quality assurance features are built in to some tools, and you can use them to automate a portion of your QA process. However, these QA features will likely be insufficient on their own, so look to managed workforce providers who can provide trained workers with extensive experience with labeling tasks, which produces higher quality training data.

Beware of contract lock-in: Some data labeling service providers require you to sign a multi-year contract for their workforce or their tools. If your data labeling service provider isn’t meeting your quality requirements, you will want the flexibility to test or select another provider without penalty, yet another reason that pursuing a smart tooling strategy is so critical as you scale your data labeling process.

Critical Questions to Ask Your Data Labeling Service About Tools

  • Do you provide a data labeling tool? Can I access your workforce without using the tool?
  • What labeling tools, use cases, and data features does your team have experience with?
  • How would you handle data labeling tool changes as our data enrichment needs change?  Does that have an adverse impact on your data labeling team?
  • Describe how you handle quality assurance and how you layer it into the data labeling task progression. How involved in QA will my team need to be?

To learn more about choosing or building your data labeling tool, read 5 Strategic Steps for Choosing Your Data Labeling Tool.

Next Steps

Now that we’ve covered the essential elements of data labeling for machine learning, you should know more about the technology available, best practices, and questions you should ask your prospective data labeling service provider. Here’s a quick recap of what we’ve covered, with reminders about what to look for when you’re hiring a data labeling service.

1) Data quality and accuracy: The quality of your data determines model performance. Consider how important quality is for your tasks today and how that could evolve over time. Revisit the four workforce traits that affect data labeling quality for machine learning projects: knowledge and context, agility, relationship, and communication. Think about how you should measure quality, and be sure you can communicate with data labelers so your team can quickly incorporate changes or iterations to data features being labeled.

Keep in mind, teams that are vetted, trained, and actively managed deliver higher skill levels, engagement, accountability, and quality. When you choose a managed team, the more they work with your data, the more context they establish and the better they understand your model. This continuity leads to more productive workflows and higher quality training data.

2) Scale: Design your workforce model for elasticity, so you can scale the work up or down according to your project and business needs without compromising data quality. In general, you will want to assign people tasks that require domain subjectivity, context, and adaptability. Give machines tasks that are better done with repetition, measurement, and consistency.

If you use a data labeling service, find out how many workers you can access at a time and how the service measures worker productivity. Make sure your workforce provider can provide the agility you need to iterate your process and data features as you learn more about your model’s performance. Be sure to ask about client support and how much time your team will have to spend managing the project.

3) Pricing:  The model your data labeling service uses to calculate pricing can have implications for your overall cost and data quality. Consider whether you want to pay for data labeling by the hour or by the task, and whether it’s more cost effective to do the work in-house.

Look for pricing that fits your purpose and provides a predictable cost structure. Find out if the work becomes more cost-effective as you increase data labeling volume. Be sure to ask your data labeling service if they incentivize workers to label data with high quality or greater volume, and how they do it.

4) Security:  A data labeling service should comply with regulatory or other requirements, based on the level of security your data requires. If you use a data labeling service, they should have a documented data security approach for their workforce, technology, network, and workspaces.

Be sure to find out if your data labeling service will use your labeled data to create or augment datasets they make available to third parties. Dig in and find out how they secure their facilities and screen workers. Through the process, you’ll learn if they respect data the way your company does.

5) Tools:  Choosing your data labeling tool is an important strategic decision that will have a profound impact on your labeling process and data quality. Will you build or buy your data labeling tool? If you outsource your data labeling, look for a service that can provide best practices in choosing and working with data labeling tools. It’s even better if they have partnerships with tooling providers and can make recommendations based on your use case.

Are you ready to talk about your data labeling operation?

Contact

Frequently Asked Questions

When creating training datasets for natural language based applications, it is especially important to evaluate labeler experience level, language proficiency, and quality assurance processes of different data labeling solutions. CloudFactory’s workers combine business context with their task experience to accurately parse and tag text according to clients’ unique specifications.

Crowdsourcing solutions, like CrowdFlower, can be a good option for simple tasks that have a low likelihood for error, but if you want high-quality data outputs for tasks require any level of training or experience you will need a vetted, managed workforce. CloudFactory provides an extension to your team that gets your data work right the first time, delivering the highest-quality data work that impacts your most important business goals.

Crowdsourcing is just one way to get your data labeled, but is often not the best solution for tasks that require any level of training or experience due to inefficient processes, lack of management, and risk of inexperience labelers. Alternatively, CloudFactory provides a team of vetted and managed data labelers that can deliver the highest-quality data work to support your key business goals.

The best outcomes will come from working with a partner that can provide a vetted and managed workforce to help you complete your data entry tasks. CloudFactory provides flexible workforce solutions to accurately process high-volume, routine tasks and training datasets that power core business and bring AI to life through computer vision, NLP, and predictive analytics applications.

Labeling images to train machine learning models is a critical step in supervised learning. You can use different approaches, but the people that label the data must be extremely attentive and knowledgeable on specific business rules because each mistake or inaccuracy will negatively affect dataset quality and overall performance of your predictive model. To achieve a high-level of accuracy without distracting internal team members from more important tasks, you should leverage a trusted partner that can provide vetted and experienced data labelers trained on your specific business requirements and invested in your desired outcomes.

The training dataset you use for your machine learning model will directly impact the quality of your predictive model, so it is extremely important that you use a dataset applicable to your AI initiative and labeled with your specific business requirements in mind. While you could leverage one of the many open source datasets available, your results will be biased towards the requirements used to label that data and the quality of the people labeling it. To get the best results, you should gather a dataset aligned with your business needs and work with a trusted partner that can provide a vetted and scalable team trained on your specific business requirements.

Data labeling requires a collection of data points such as images, text, or audio and a qualified team of people to tag or label each of the input points with meaningful information that will be used to train a machine learning model. There are different techniques to label data and the one used would depend on the specific business application, for example: bounding box, semantic segmentation, redaction, polygonal, keypoint, cuboidal and more. Engaging with an experienced data labeling partner can ensure that your dataset is being labeled properly based on your requirements and industry best practices.

Autonomous driving systems require massive amounts of high-quality labeled image, video, 3-D point cloud, and/or sensor fusion data. Companies developing these systems compete in the marketplace based on the proprietary algorithms that operate the systems, so they collect their own data using dashboard cameras and lidar sensors. Depending on the system they are designing and the location where it will be used, they may gather data on multiple street scene types, in one or more cities, across different weather conditions and times of day.

Teams of hundreds, sometimes thousands, of people use advanced software to transform the raw data into video sequences and break them down for labeling, sometimes frame by frame. Then, they label data features as prescribed by the business rules set by the project team designing the autonomous driving system. That data is used to train the system how to drive. Quality training data is crucial in designing high-performing autonomous vehicle systems, so many of the companies that develop these systems work with one or more data labeling services and have particularly high standards for measuring and maintaining data quality.

There are many image annotation tools on the market. Some examples are: Labelbox, Dataloop, Deepen, Foresight, Supervisely, OnePanel, Annotell, Superb.ai, and Graphotate.

Many tools could help develop excellent objection detection. Quality object detection is dependant on optimal model performance within a well-designed software/hardware system. High-quality models need high-quality training data, which requires people (workforce), process (the annotation guidelines and workflow) and technology (labeling tool). Therefore the image labeling tool is merely a means to an end.

The ingredients for high quality training data are people (workforce), process (annotation guidelines and workflow, quality control) and technology (input data, labeling tool). An easy way to get images labeled is to partner with a managed workforce provider that can provide a vetted team that is trained to work in your tool and within your annotation parameters.

You can use automated image tagging via API (such as Clarif.ai) or manual tagging via crowdsourcing or managed workforce solutions. API tagging maximizes response speed but is not tailored to each dataset or use case, reducing overall dataset quality. It is possible to get usable results from crowdsourcing in some instances, but a managed workforce solution will provide the highest quality tagging outcomes and allows for the greatest customization and adaptation over time.