A joint model of extended LDA and IBTM over streaming Chinese short texts

Zhu, Longxia; Xu, Hua; Xu, Yunfeng; Xiao, Yi; Li, Jia; Deng, Junhui; Sun, Xiaomin; Bai, Xiaoli

doi:10.3233/IDA-183836

A joint model of extended LDA and IBTM over streaming Chinese short texts

Article type: Research Article

Authors: Zhu, Longxia^{a; b} | Xu, Hua^{a; *} | Xu, Yunfeng^{a; b} | Xiao, Yi^c | Li, Jia^a | Deng, Junhui^a | Sun, Xiaomin^a | Bai, Xiaoli^d

Affiliations: [a] State Key Laboratory of Intelligent Technology and Systems, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China | [b] School of Information Science and Engineering, Hebei University of Science and Technology, Shijiazhuang, Hebei 050018, China | [c] Jiangxi Samton Technology Development Co. LTD, Jiangxi 330013, China | [d] Shijiazhuang Preschool Teachers College, Shijiazhuang, Hebei 050228, China

Correspondence: [*] Corresponding author: Hua Xu, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China. Tel.: +86 1062796450; Fax: +86 1062771792; E-mail: xuhua@tsinghua.edu.cn.

Abstract: With the prevalent of short texts, discovering the topics within them has become an important task. Biterm Topic Model (BTM) is more suitable to discover topics on short texts than traditional topic models. However, there are still some challenges that dealing short texts with BTM will always ignore the document-topic semantic information and lack the true intentions of users. In addition, it is a static method and can not manage streaming short texts when a new one arrives immediately. In order to keep document-topic information and get the topic distribution of a new short text at once, we propose a joint model based on online algorithms of Latent Dirichlet Allocation (LDA) and BTM, which combines the merits of both models. Not only does it alleviate the sparsity when addressing short texts with the online algorithm of BTM, namely Incremental Biterm Topic Model (IBTM), but also keeps document-topic information with extended LDA. And considering the differences between English and Chinese text in writing, we use combined words in short texts as key words to extend the length of short texts and keep the true intensions of users. As shown in the experiment results on two real world datasets, our method is better than other baseline methods. In the end, we explain an application of our method in the task of discovering user interest tags.

Keywords: Streaming chinese short text, topic discovery, topic models, online algorithms

DOI: 10.3233/IDA-183836

Journal: Intelligent Data Analysis, vol. 23, no. 3, pp. 681-699, 2019

Published: 28 April 2019

Price: EUR 27.50

North America

IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA

Tel: +1 703 830 6300
Fax: +1 703 830 2300
sales@iospress.com

For editorial issues, like the status of your submitted paper or proposals, write to editorial@iospress.nl

Europe

IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands

Tel: +31 20 688 3355
Fax: +31 20 687 0091
info@iospress.nl

For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office info@iospress.nl

Asia

Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China

Free service line: 400 661 8717
Fax: +86 10 8446 7947
china@iospress.cn

For editorial issues, like the status of your submitted paper or proposals, write to editorial@iospress.nl

如果您在出版方面需要帮助或有任何建, 件至: editorial@iospress.nl

Share this:

North America

Europe

Asia