Logo Taniwa | Product factory Logo Taniwa | Product factory

Logo Taniwa | Product factory

How We Built an AI Headline Classification Model (in One Week)

Apr 10, 2024
|
José Luis Marina
work-single-image

Introduction

Within our learning-by-doing philosophy, a week ago we set out to build an AI model that would classify news headlines as clickbait or not.

You can try it here: News Headline Clickbait Prediction

In this post, we’re going to share how we did it and what we learned in the process.

Phase 1: Collecting Data

Without a doubt, this is the most important part along with labeling and cleaning the corpus that we’ll later use to train the model. To get the data, we’ve searched and dived through the web, specifically at:

Kaggle:

A classic for finding datasets of all kinds, specifically this one: Clickbait News Detection Competition

This dataset has 24,000 news items with headline, text, and label, meaning “someone” or “something” has labeled them as clickbait or not, so what might be clickbait for you may not be for the dataset and vice versa.

In any case, we brought them in and cleaned them up a bit to keep only the headlines and labels, and translated them to save in a csv.

huggingface:

Another source of datasets and AI models. Specifically, we found a dataset of headlines.

Same as in the previous case, we brought in these 32,000 records, cleaned and translated them to save in another csv.

github

Searching around, we found a dataset of Spanish news headlines in the project clickbait headline generator by Praveen

SerpApi

SERPAPI is an API that allows you to extract data from Google search results. We used this API to extract news headlines from different media outlets and label them as clickbait or not. Specifically, we extracted news for Spain from Google News

  • AI
  • BERT
  • BETO
  • Classification
  • NLP
  • Clickbait