Data Labeling, the key to useful and efficient AI

May 24, 2024

José Luis Marina

Data Labeling, the key to useful and efficient AI

You have data, you have AIs to train, but you don’t have labels: You have nothing.

For example, you may have files with customer requests to your support service, but you don’t have:

Labels for the topics discussed in the calls to know which department to direct them to.
Labels for customer emotions to know if they are happy or angry.
Whether the problem has been resolved or not.

An AI that truly helps the business is often not viable due to the time and cost required to manually label data, whether by internal experts or through crowdsourcing. The fastest way to solve this problem is by creating data labeling functions that can capture expert knowledge and apply it automatically at scale.

Despite the complexity of AI and Machine Learning algorithms, the real protagonist is data labeling with the highest quality.

When our clients ask us “what AI do I need”, we first work on the data and then we’ll see which AI or algorithm is most suitable.

What data labeling consists of

It is the process by which we identify and classify sample data so that ML algorithms can learn from them. In other words, we give AI context so they can learn and make predictions about data they haven’t seen before.

Depending on the type of data we work with, we can be talking about:

Text labeling: Text classification, entity extraction, sentiment analysis.
For example, if we are working with emails, we can label them according to topic, urgency, tone, etc.
For example, identifying whether a news headline manipulates us or not.
Image labeling: Image classification, object detection, image segmentation.
For example, if we are working with satellite images, we can label them according to terrain type, presence of vegetation, presence of water.
For example, given an image of a construction project, identify whether deadlines are being met according to its progress percentage.
Audio labeling: Audio transcription, sentiment analysis, audio classification. For example, if we are working with customer call recordings, we can label them according to tone of voice, duration, reason for the call.
Video labeling: Video classification, object detection, video segmentation.
For example, if we are working with security camera videos, we can label them according to the presence of people, presence of vehicles, presence of suspicious objects.
For example, labeling people appearing in a video, and the same with objects, texts, brands.
Time series labeling: Time series prediction, time series classification.
For example, if we are working with sales time series, we can label them according to trend, seasonality, presence of outliers.
For example, if we have movement data from a wristband on Parkinson’s patients, we can label them according to the intensity of their movements, duration of episodes, frequency of episodes, relationship with medication intake.

If you reread the list above and think about doing it manually on thousands or millions of data points, you’ll see it’s a gigantic and tedious job.

In all the cases mentioned, we have done at least one project at taniwa, and in all of them the basis of success has been data labeling.

Data labeling process

Manual: Performed by human experts who review and classify data one by one.

Pros: High precision / You detect and classify “rare” cases well
Cons: Slow / Expensive / Not scalable / Finding experts can be complicated

Semi-automatic: Performed by human experts who review and correct labels generated by AI algorithms.

Pros: Faster / Cheaper / More scalable / Humans can correct algorithm errors.
Cons: Less precise / Humans can still be a scalability problem.

Automatic: Performed by AI algorithms that label data automatically.

Pros: Fast / Cheap / Scalable / You can label millions of data points in a short time.
Cons: Less precise / Algorithms may not be able to label “rare” cases well.

Labeling type	Precision	Speed	Cost	Scalability	Rare cases
Manual	High	Slow	High	Low	OK
Semi-automatic	Medium	Fast	Medium	Medium	OK
Automatic	Low	Fast	Low	High	X
What we want	High	Fast	Low	High	OK

Our approach to data labeling

Normally, at taniwa, we work with a semi-automatic approach to data labeling, and we use what is called “weak supervision” to train the AI algorithms that automatically label data.

Of course, we are talking about massive data labeling problems, where a good approach can take you from months to days in data labeling.

The general steps are:

Data study: Understand the data and labeling needs from the business perspective.
Define labels: What we want to label and how we are going to do it.
Create labeling functions: Functions that automatically label data.
Train labeling functions: Use manually labeled data to train labeling functions.
Label data: Automatically label data with labeling functions.
Review data: Review automatically labeled data and correct errors and return to point 3 of defining and improving labeling functions.

Labeling functions can be as simple as a rule that labels emails according to the presence of certain keywords, or as complex as an AI model that labels images according to their content. In the project of labeling news headlines as clickbait or not, we used:

Keyword rules: If “surprising” or “incredible” appears, it’s clickbait.
Word patterns: If “verb + number + noun” appears, it’s clickbait.
AI models: If the AI model says it’s clickbait, it’s clickbait.

And we also tracked data sources with already classified headlines or downloaded news from clearly manipulative sites.

The iterative process of reviewing and improving labeling functions is key to obtaining high-quality labels and saving time and money.

Conclusions

Data labeling is the key to useful and efficient AI. It is an iterative process in which you improve the quality of your corpus.

In the end, what is worth “gold” is that labeled database that allows you to train your AI models.

Photo by Brett Jordan

AI
Data Curation
Automatic Labeling
Business Development