What’s data annotation?
In machine learning, data annotation is the process of labeling data to show the outcome you want your machine learning model to predict. You are marking - labeling, tagging, transcribing, or processing - a dataset with the features you want your machine learning system to learn to recognize. Once your model is deployed, you want it to recognize those features on its own and make a decision or take some action as a result.
Annotated data reveals features that will train your algorithms to identify the same features in data that has not been annotated. Data annotation is used in supervised learning and hybrid, or semi-supervised, machine learning models that involve supervised learning.
What’s a data annotation tool?
A data annotation tool is a cloud-based, on-premise, or containerized software solution that can be used to annotate production-grade training data for machine learning. While some organizations take a do-it-yourself approach and build their own tools, there are many data annotation tools available via open source or freeware.
They are also offered commercially, for lease and purchase. Data annotation tools are generally designed to be used with specific types of data, such as image, video, text, audio, spreadsheet, or sensor data. They also offer different deployment models, including on-premise, container, SaaS (cloud), and Kubernetes.
6 Important Data Annotation Tool Features
1) Dataset management
Annotation begins and ends with a comprehensive way of managing the dataset you plan to annotate. As a critical part of your workflow, you need to ensure that the tool you are considering will actually import and support the high volume of data and file types you need to label. This includes searching, filtering, sorting, cloning, and merging of datasets.
Different tools can save the output of annotations in different ways, so you’ll need to make sure the tool will meet your team’s output requirements. Finally, your annotated data must be stored somewhere. Most tools will support local and network storage, but cloud storage - especially your preferred cloud vendor - can be hit or miss, so confirm support-file storage targets.
2) Annotation methods
This is obviously the core feature of data annotation tools - the methods and capabilities to apply labels to your data. But not all tools are created equal in this regard. Many tools are narrowly optimized to focus on specific types of labeling, while others offer a broad mix of tools to enable various types of use cases.
Nearly all offer some type of data or document classification to guide how you identify and sort your data. Depending on your current and anticipated future needs, you may wish to focus on specialists or go with a more general platform. The common types of annotation capabilities provided by data annotation tools include building and managing ontologies or guidelines, such as label maps, classes, attributes, and specific annotation types.
Here are just a few examples:
- Image or video: Bounding boxes, polygons, polylines, classification, 2-D and 3-D points, or segmentation (semantic or instance), tracking, transcription, interpolation, or transcription.
- Text: Transcription, sentiment analysis, net entity relationships (NER), parts of speech (POS), dependency resolution, or coreference resolution.
- Audio: Audio labeling, audio to text, tagging, time labeling
An emerging feature in many data annotation tools is automation, or auto-labeling. Using AI, many tools will assist your human labelers to improve their annotations (e.g. automatically convert a four-point bounding box to a polygon), or even automatically annotate your data without a human touch. Additionally, some tools can learn from the actions taken by your human annotators, to improve auto-labeling accuracy.
Some annotation tasks are ripe for automation. For example, if you use pre-annotation to tag images, a team of data labelers can determine whether to resize or delete a bounding box. This can shave time off the process for a team that needs images annotated at pixel-level segmentation. Still, there will always be exceptions, edge cases, and errors with automated annotations, so it is critical to include a human-in-the-loop approach for both quality control and exception handling.
Automation also can refer to the availability of developer interfaces to run the automations. That is, an application programming interface (API) and software development kit (SDK) that allow access to and interaction with the data.
3) Data quality control
The performance of your machine learning and AI models will only be as good as your data. Data annotation tools can help manage the quality control (QC) and verification process. Ideally, the tool will have embedded QC within the annotation process itself.
For example, real-time feedback and initiating issue tracking during annotation is important. Additionally, workflow processes such as labeling consensus, may be supported. Many tools will provide a quality dashboard to help managers view and track quality issues, and assign QC tasks back out to the core annotation team or to a specialized QC team.
4) Workforce management
Every data annotation tool is meant to be used by a human workforce - even those tools that may lead with an AI-based automation feature. You still need humans to handle exceptions and quality assurance as noted before. As such, leading tools will offer workforce management capabilities, such as task assignment and productivity analytics measuring time spent on each task or sub-task.
Your data labeling workforce provider may bring their own technology to analyze data that is associated with quality work. They may use technology, such as webcams, screenshots, inactivity timers, and clickstream data to identify how they can support workers in delivering quality data annotation.
Most importantly, your workforce must be able to work with and learn the tool you plan to use. Further, your workforce provider should be able to monitor worker performance and work quality and accuracy. It’s even better when they offer you direct visibility, such as a dashboard view, into the productivity of your outsourced workforce and the quality of the work performed.
5) Security
Whether annotating sensitive protected personal information (PPI) or your own valuable intellectual property (IP), you want to make sure that your data remains secure. Tools should limit an annotator’s viewing rights to data not assigned to her, and prevent data downloads. Depending on how the tool is deployed, via cloud or on-premise, a data annotation tool may offer secure file access (e.g., VPN).
For use cases that fall under regulatory compliance requirements, many tools will also log a record of annotation details, such as date, time, and the annotation author. However, if you are subject to HIPAA, SOC 1, SOC 2, PCI DSS, or SSAE 16 regulations, it is important to carefully evaluate whether your data annotation tool partner can help you maintain compliance.
6) Integrated labeling services
As mentioned earlier, every tool requires a human workforce to annotate data, and the people and technology elements of data annotation are equally important. As such, many data annotation tool providers offer a workforce network to provide annotation as a service. The tool provider either recruits the workers or provides access to them via partnerships with workforce providers.
While this feature makes for convenience, any workforce skill and capability should be evaluated separately from the tool capability itself. The key here is that any data annotation tool should offer the flexibility to use the tool vendor’s workforce or the workforce of your choice, such as a group of employees or a skilled, professionally managed data annotation team.
Download the PDF version here