Integration of ICT survey data and Internet data from enterprises websites at the Italian National Institute of Statistics

Barcaroli, Giulio; Scannapieco, Monica

doi:10.3233/SJI-190553

Integration of ICT survey data and Internet data from enterprises websites at the Italian National Institute of Statistics

Article type: Research Article

Authors: Barcaroli, Giulio^{a; *} | Scannapieco, Monica^b

Affiliations: [a] Via Monte Delle Gioie 29, Roma 00199, Italy | [b] Italian National Institute of Statistics, Roma 00184, Italy

Correspondence: [*] Corresponding author: Giulio Barcaroli, Via Monte Delle Gioie 29, Roma 00199, Italy. E-mail: gbarcaroli@gmail.com.

Abstract: Since 2013, the Italian National Institute of Statistics (Istat) has been investigating the potential of Big Data sources for Official Statistics. Among such sources, Internet data originated by websites content has been considered as one of the most important to produce information about enterprises. In 2018, Istat started producing experimental statistics on the activities that enterprises carry out through their websites (web ordering, job vacancy advertisement, link to social media, etc.). They are a subset of the statistics currently produced by the “Survey on ICT usage and e-Commerce in Enterprises” and are computed starting from enterprise websites’ contents, acquired by web scraping tools and processed with text mining techniques. A machine learning approach is adopted to estimate models in the subset of enterprises for which the survey and the web sources are both available, with survey data serving as training set for the machine learning task. The content scraped from successfully reached websites is used as input to predict the target values by applying the model fitted in the previous step. The experimental statistics are obtained using two different estimators: (i) a full model based estimator; (ii) an estimator that combines model and survey based estimates. Considering the various domains for which they have been calculated, the three sets of estimates (survey, model and combined) in most cases are not significantly different (i.e. model and combined estimated values lay in the confidence intervals of survey estimates). Simulations have demonstrated that the Mean Square Errors of these new estimates are competitive as compared to those produced in the traditional way.

Keywords: Big Data, Internet data, web scraping, text mining, machine learning, experimental statistics

DOI: 10.3233/SJI-190553

Journal: Statistical Journal of the IAOS, vol. 35, no. 4, pp. 643-656, 2019

Published: 10 December 2019

Price: EUR 27.50

North America

IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA

Tel: +1 703 830 6300
Fax: +1 703 830 2300
sales@iospress.com

For editorial issues, like the status of your submitted paper or proposals, write to editorial@iospress.nl

Europe

IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands

Tel: +31 20 688 3355
Fax: +31 20 687 0091
info@iospress.nl

For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office info@iospress.nl

Asia

Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China

Free service line: 400 661 8717
Fax: +86 10 8446 7947
china@iospress.cn

For editorial issues, like the status of your submitted paper or proposals, write to editorial@iospress.nl

如果您在出版方面需要帮助或有任何建, 件至: editorial@iospress.nl

Share this:

North America

Europe

Asia