“Bend the truth”: Benchmark dataset for fake news detection in Urdu language and its evaluation

Amjad, Maaz; Sidorov, Grigori; Zhila, Alisa; Gómez-Adorno, Helena; Voronkov, Ilia; Gelbukh, Alexander

doi:10.3233/JIFS-179905

“Bend the truth”: Benchmark dataset for fake news detection in Urdu language and its evaluation

Issue title: Special section: Selected papers of LKE 2019

Guest editors: David Pinto, Vivek Singh and Fernando Perez

Article type: Research Article

Authors: Amjad, Maaz^a | Sidorov, Grigori^{a; *} | Zhila, Alisa^a | Gómez-Adorno, Helena^b | Voronkov, Ilia^c | Gelbukh, Alexander^a

Affiliations: [a] Centro de Investigación en Computación (CIC), Instituto Politécnico Nacional, Mexico | [b] Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas (IIMAS), Universidad Nacional Autónoma de México, Mexico | [c] Moscow Institute of Physics and Technology, Russia

Correspondence: [*] Corresponding author. Grigori Sidorov, Mexico City, Mexico. E-mail: sidorov@cic.ipn.mx.

Abstract: The paper presents a new corpus for fake news detection in the Urdu language along with the baseline classification and its evaluation. With the escalating use of the Internet worldwide and substantially increasing impact produced by the availability of ambiguous information, the challenge to quickly identify fake news in digital media in various languages becomes more acute. We provide a manually assembled and verified dataset containing 900 news articles, 500 annotated as real and 400, as fake, allowing the investigation of automated fake news detection approaches in Urdu. The news articles in the truthful subset come from legitimate news sources, and their validity has been manually verified. In the fake subset, the known difficulty of finding fake news was solved by hiring professional journalists native in Urdu who were instructed to intentionally write deceptive news articles. The dataset contains 5 different topics: (i) Business, (ii) Health, (iii) Showbiz, (iv) Sports, and (v) Technology. To establish our Urdu dataset as a benchmark, we performed baseline classification. We crafted a variety of text representation feature sets including word n-grams, character n-grams, functional word n-grams, and their combinations. After applying a variety of feature weighting schemes, we ran a series of classifiers on the train-test split. The results show sizable performance gains by AdaBoost classifier with 0.87 F1Fake and 0.90 F1Real. We provide the results evaluated against different metrics for a convenient comparison of future research. The dataset is publicly available for research purposes.

Keywords: Fake news detection, urdu corpus, language resources, benchmark dataset, classification, machine learning

DOI: 10.3233/JIFS-179905

Journal: Journal of Intelligent & Fuzzy Systems, vol. 39, no. 2, pp. 2457-2469, 2020

Published: 31 August 2020

Price: EUR 27.50

North America

IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA

Tel: +1 703 830 6300
Fax: +1 703 830 2300
sales@iospress.com

For editorial issues, like the status of your submitted paper or proposals, write to editorial@iospress.nl

Europe

IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands

Tel: +31 20 688 3355
Fax: +31 20 687 0091
info@iospress.nl

For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office info@iospress.nl

Asia

Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China

Free service line: 400 661 8717
Fax: +86 10 8446 7947
china@iospress.cn

For editorial issues, like the status of your submitted paper or proposals, write to editorial@iospress.nl

如果您在出版方面需要帮助或有任何建, 件至: editorial@iospress.nl

Share this:

North America

Europe

Asia