Automated transcription of conversational Call Center speech – with respect to non-verbal acoustic events

Sárosi, Gellért; Tarján, Balázs; Fegyó, Tibor; Mihajlik, Péter

doi:10.3233/IDT-140195

Automated transcription of conversational Call Center speech – with respect to non-verbal acoustic events

Issue title: Communicative social signals: Computational and behavioural aspects of human-human and human-machine interaction

Guest editors: Klára Vicsix and Anna Espositoy

Article type: Research Article

Authors: Sárosi, Gellért^{a; *} | Tarján, Balázs^a | Fegyó, Tibor^{a; b} | Mihajlik, Péter^{a; c}

Affiliations: [a] Department of Telecommunication and Media Informatics, Budapest University of Technology and Economics, Hungary | [b] Aitia International Inc., Hungary | [c] THINKTech Research Center Nonprofit LLC, Hungary | [x] Laboratory of Speech Acoustics, Department of Telecommunications and Media-Informatics, Budapest University of Technology and Economics, Budapest, Hungary | [y] Dipartimento di Psicologia and IIASS, Seconda Università di Napoli, Vietri Sul Mare, Salerno, Italy

Correspondence: [*] Correspoding author: Gellért Sárosi, Department of Telecommunication and Media Informatics, Budapest University of Technology and Economics, Hungary. E-mail: sarosi@mit.bme.hu

Abstract: This paper summarizes our recent efforts made to transcribe real-life Call Center conversations automatically with respect to non-verbal acoustic events, as well. Future Call Centers – as cognitive infocom systems – must respond automatically not only for well formed utterances but also for spontaneous and non-word speaker manifestations and must be robust against sudden noises. Conversational telephony speech transcription itself is a big challenge, primarily we address this issue on real-life (Bank and Insurance) tasks. In addition, we introduce several non-word acoustic modeling approaches and their integration to LVCSR (Large Vocabulary Continuous Speech Recognition). In the experiments, one and two channel (client and agent speech merged into one or left in two separate audio stream) transcription results, cross-task results and the handling of transcription data insufficiency are investigated – in parallel with the non-verbal acoustic event modeling. On the agent side less than 15% word error rate could be achieved and the best error rate reduction is 20% (relative) due to the inclusion of various written corpora and due to acoustic event handling.

Keywords: Speech, call centers, LVCSR, transcription

DOI: 10.3233/IDT-140195

Journal: Intelligent Decision Technologies, vol. 8, no. 4, pp. 265-275, 2014

Published: 27 June 2014

Price: EUR 27.50

North America

IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA

Tel: +1 703 830 6300
Fax: +1 703 830 2300
sales@iospress.com

For editorial issues, like the status of your submitted paper or proposals, write to editorial@iospress.nl

Europe

IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands

Tel: +31 20 688 3355
Fax: +31 20 687 0091
info@iospress.nl

For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office info@iospress.nl

Asia

Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China

Free service line: 400 661 8717
Fax: +86 10 8446 7947
china@iospress.cn

For editorial issues, like the status of your submitted paper or proposals, write to editorial@iospress.nl

如果您在出版方面需要帮助或有任何建, 件至: editorial@iospress.nl

Share this:

North America

Europe

Asia