Affiliations: Institute of Informatics and Telecommunications, NCSR “Demokritos”, Patriarhou Grigoriou and Neapoleos St., 15310, Athens, Greece E-mail: {izavits,paliourg,petridis}@iit.demokritos.gr | Department of Information and Communication Systems Engineering, University of Aegean, AI-Lab, 83200 Karlovassi, Samos, Greece E-mail: georgev@aegean.gr
Note: [] Corresponding author.
Abstract: This paper proposes a method for learning ontologies given a corpus of text documents. The method identifies concepts in documents and organizes them into a subsumption hierarchy, without presupposing the existence of a seed ontology. The method uncovers latent topics for generating document text. The discovered topics form the concepts of the new ontology. Concept discovery is done in a language neutral way, using probabilistic space reduction techniques over the original term space of the corpus. Furthermore, the proposed method constructs a subsumption hierarchy of the concepts by performing conditional independence tests among pairs of latent topics, given a third one. The paper provides experimental results on the Genia and the Lonely Planet corpora from the domains of molecular biology and tourism respectively.