Bilingual Corpus-based Hybrid POS Tagger for Low Resource Tamil Language: A Statistical approach

Senthamizh Selvi, S.; Anitha, R.

doi:10.3233/JIFS-221278

Bilingual Corpus-based Hybrid POS Tagger for Low Resource Tamil Language: A Statistical approach

Article type: Research Article

Authors: Senthamizh Selvi, S.^{a; *} | Anitha, R.^b

Affiliations: [a] Department of Computer Science and Engineering, Sri Venkateswara College of Engineering, Tamil Nadu, India | [b] Department of Computer Science and Engineering, Sri Venkateswara College of Engineering, Tamil Nadu, India

Correspondence: [*] Corresponding author. S. Senthamizh Selvi, Research Scholar, Anna University, Department of Computer Science and Engineering, Sri Venkateswara College of Engineering, Tamil Nadu, India. E-mail: senthamizhselvi@svce.ac.in.

Abstract: In India, most of the Science and Technology resources available are in English. Developing an Automatic Language Translation Engine from English (source language) to Tamil (target language) is very essential for the people who need to get technical resources in their native language. The challenges in designing such engines using Natural Language Processing (NLP) tools include Lexical, Structural, and Syntax level ambiguity. To solve these challenges, the development of a Part-Of-Speech (POS) tagger is essential. The Verb-Framed languages like Tamil, Japanese, and many languages in Romance, Semitic, and Mayan languages families have high morphological richness but lack either a large volume of annotated corpora or manually constructed linguistic resources for building POS tagger. Moreover, the Tamil Language has a low resource, high word sense ambiguity, and word-free order form giving rise to challenges in designing Tamil POS taggers. In this paper, we postulate a Hybrid POS tagger algorithm for Tamil Language using Cross-Lingual Transformation Learning Techniques. It is a novel Mining-based algorithm (MT), which finds equivalent words of Tamil in English on less volume of English-Tamil bilingual unannotated parallel corpus. To enhance the performance of MT, we developed Tamil language-specific auxiliary algorithms such as Keyword-based tagging algorithm (KT) and Verb pattern-based tagging algorithm (VT). We also developed a Unique pair occurrence-tagging algorithm (UT) to find the one-time occurrence of Tamil-English pair words. Our experiments show that by improving Context-based Bilingual Corpus to Bilingual parallel corpus and after leaving one-time occurrence words, the proposed Hybrid POS tagger can predict 81.15% words, with 73.51% accuracy and 90.50% precision. Evaluations prove our algorithms can generate language resources, which can improve the performance of NLP tasks in Tamil.

Keywords: Natural language processing, part-of-speech tagger, sandhi, bilingual parallel corpus, cross-lingual transformation learning

DOI: 10.3233/JIFS-221278

Journal: Journal of Intelligent & Fuzzy Systems, vol. 43, no. 6, pp. 8329-8348, 2022

Published: 11 November 2022

Price: EUR 27.50

North America

IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA

Tel: +1 703 830 6300
Fax: +1 703 830 2300
sales@iospress.com

For editorial issues, like the status of your submitted paper or proposals, write to editorial@iospress.nl

Europe

IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands

Tel: +31 20 688 3355
Fax: +31 20 687 0091
info@iospress.nl

For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office info@iospress.nl

Asia

Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China

Free service line: 400 661 8717
Fax: +86 10 8446 7947
china@iospress.cn

For editorial issues, like the status of your submitted paper or proposals, write to editorial@iospress.nl

如果您在出版方面需要帮助或有任何建, 件至: editorial@iospress.nl

Share this:

North America

Europe

Asia