Introduction
Within our learning-by-doing philosophy, a week ago we set out to build an AI model that would classify news headlines as clickbait or not.
You can try it here: News Headline Clickbait Prediction
In this post, we’re going to share how we did it and what we learned in the process.
Phase 1: Collecting Data
Without a doubt, this is the most important part along with labeling and cleaning the corpus that we’ll later use to train the model. To get the data, we’ve searched and dived through the web, specifically at:
Kaggle:
A classic for finding datasets of all kinds, specifically this one: Clickbait News Detection Competition
This dataset has 24,000 news items with headline, text, and label, meaning “someone” or “something” has labeled them as clickbait or not, so what might be clickbait for you may not be for the dataset and vice versa.
In any case, we brought them in and cleaned them up a bit to keep only the headlines and labels, and translated them to save in a csv.
huggingface:
Another source of datasets and AI models. Specifically, we found a dataset of headlines.
Same as in the previous case, we brought in these 32,000 records, cleaned and translated them to save in another csv.
github
Searching around, we found a dataset of Spanish news headlines in the project clickbait headline generator by Praveen
SerpApi
SERPAPI is an API that allows you to extract data from Google search results. We used this API to extract news headlines from different media outlets and label them as clickbait or not. Specifically, we extracted news for Spain from Google News

