Download dataset. (3 November 2016)

Online news editors ask themselves the same question many times: what is missing in this news article to go online? This is not an easy question to be answered by computational linguistic methods. In this dataset, we address this important question and characterise the constituents of news article editorial quality. More specifically, we identify 14 aspects related to the content of news articles.


The dataset comprises 500 news articles, fully annotated with 14 aspects defining a linguistic benchmark for assessing the quality of online news articles.

Please, cite our paper if you use the dataset:

I. Arapakis, F. Peleja, B. Berkant and J. Magalhaes, Linguistic Benchmarks of Online News Article Quality, ACL 2016.

Novaemötions dataset

Download dataset

This dataset contains the facial expression images captured using the novaemötions game. It contains over 40,000 images, labeled with the challenged expression and the expression recognized by the game algorithm, augmented with labels obtained through crowdsourcing.

BBC cross-media dataset

This is a dataset used for cross-media data analysis. It contains a set of news articles with the text corpus and the corresponding image illustrations. The dataset was used in a Web news classification task.

If you are interested in obtaining the dataset, contact

If you use this dataset, please cite this article:

Web news categorization using a cross-media document graph
José Iria, Fabio Ciravegna, João Magalhães
Proceedings of the ACM international conference on Image and Video Retrieval (ACM CIVR 2009).