URLs Can Facilitate Machine Learning Classification of News Stories Across Languages and Contexts

Ernesto de León; Susan Vermeer; Damian Trilling

doi:10.5117/CCR2023.2.4.DELE

E-ISSN: 2665-9085

oa URLs Can Facilitate Machine Learning Classification of News Stories Across Languages and Contexts
Authors: Ernesto de León¹, Susan Vermeer² & Damian Trilling³
View Affiliations Hide Affiliations

Affiliations: ¹ PhD Student ² Amsterdam School of Communication Research (ASCoR), University of Amsterdam, the Netherlands ³ Amsterdam School of Communication Research (ASCoR), University of Amsterdam, the Netherlands
Publisher: Amsterdam University Press
Source: Computational Communication Research, Volume 5, Issue 2, Jan 2023, p. 1
DOI: https://doi.org/10.5117/CCR2023.2.4.DELE
Language: English

Abstract

Comparative scholars studying political news content at scale face the challenge of addressing multiple languages. While many train individual supervised machine learning classifiers for each language, this is a costly and time-consuming process. We propose that instead of relying on thematic labels generated by manual coding, researchers can use ‘distant’ labels created by cues in article URLs. Sections reflected in URLs (e.g., nytimes.com/politics/) can therefore help create training material for supervised machine learning classifiers. Using cues provided by news media organizations, such an approach allows for efficient political news identification at scale while facilitating implementation across languages. Using a dataset of approximately 870,000 URLs of news-related content from four countries (Italy, Germany, Netherlands, and Poland), we test this method by providing a comparison to ‘classical’ supervised machine learning and a multilingual BERT model, across four news topics. Our results suggest that the use of URL section cues to distantly annotate texts provides a cheap and easy-to- implement way of classifying large volumes of news texts that can save researchers many valuable resources without having to sacrifice quality.

Article metrics loading...

/content/journals/10.5117/CCR2023.2.4.DELE

2023-01-01

2024-12-26

Full text loading...

/content/journals/10.5117/CCR2023.2.4.DELE

Article Type: Research Article

Keyword(s): distant classification; machine learning; multilingual data; political news; text classification

oa URLs Can Facilitate Machine Learning Classification of News Stories Across Languages and Contexts

Abstract

Most Read This Month

Most Cited Most Cited RSS feed

A framework for privacy preserving digital trace data collection through data donation

Fifteen Seconds of Fame: TikTok and the Supply Side of Social Video

OSD2F: An Open-Source Data Donation Framework

Conversational Agent Research Toolkit

Computational observation

The 4CAT Capture and Analysis Toolkit: A Modular Tool for Transparent and Traceable Social Media Research

Detecting Impoliteness and Incivility in Online Discussions

Four best practices for measuring news sentiment using ‘off-the-shelf’ dictionaries: a large-scale p-hacking experiment

How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models

Opinion-based Homogeneity on YouTube