Recent Advance of Thai Open-Vocabulary  Automatic Speech Recognition

Chai Wutiwiwatchai; Vataya Chunwijitra; Sila Chunwijitra; Phuttapong Sertsi; Sawit Kasuriya; Patcharika Chootrakool; Kwanchiva Thangthai; Chanchai Junlouchai; Kamthorn Krairaksa

doi:10.14456/jiist.2017.1

Authors

Chai Wutiwiwatchai National Electronics and Computer Technology Center, National Science and Technology Development Agency, Thailand
Vataya Chunwijitra National Electronics and Computer Technology Center, National Science and Technology Development Agency, Thailand
Sila Chunwijitra National Electronics and Computer Technology Center, National Science and Technology Development Agency, Thailand
Phuttapong Sertsi National Electronics and Computer Technology Center, National Science and Technology Development Agency, Thailand
Sawit Kasuriya National Electronics and Computer Technology Center, National Science and Technology Development Agency, Thailand
Patcharika Chootrakool National Electronics and Computer Technology Center, National Science and Technology Development Agency, Thailand
Kwanchiva Thangthai National Electronics and Computer Technology Center, National Science and Technology Development Agency, Thailand
Chanchai Junlouchai National Electronics and Computer Technology Center, National Science and Technology Development Agency, Thailand
Kamthorn Krairaksa Department of Computer Engineering, Faculty of Engineering, Mahidol University, Thailand

DOI:

https://doi.org/10.14456/jiist.2017.1

Keywords:

open-vocabulary, speech recognition, Thai language

Abstract

We describe the recent development of the NECTEC Thai open-vocabulary automatic speech recognition system. Some of the techniques that were found beneficial over its baseline system are: hybrid word-subword language modeling to enhance the vocabulary coverage in a constraint resource; multi-conditioned noisy acoustic modeling to improve the system robustness and spoken-style language model interpolation using a newly developed large social media speech database; recent state-of-the-art speech features; and lastly, online decoding, speech compression, and Docker-based distributed computing to reduce the processing and data transmission time. These techniques result in a 29.0% word error rate on open-vocabulary noisy speech test sets which is 42.5% relatively low-er than the baseline system. The overall system operates at nearly 1.2xRT which is promising for real applications.

References

Saon, G., Kuo, H. J., Rennie, S., Picheny, M.: The IBM 2015 English conversational telephone speech recognition system. In: Proc. INTERSPEECH 2015, Dresden, Germany (2015)

Schalkwyk, J., Beeferman, D., Beaufays, F., Byrne, B., Chelba, C., Cohen, M., Garrett, M., Strope, B.: Google search by voice: a case study. In: Advances in Speech Recognition: Mobile Environments, Call Centers and Clinics, Springer, pp. 61-90 (2010)

Shaik, M., Tüske, Z., Tahir, M., Nussbaum-Thom, M., Schlüter, R., Ney, N.: Improvements in RWTH LVCSR evaluation systems for Polish, Portuguese, English, Urdu, and Arabic. In: INTERSPEECH 2015, Dresden, Germany, pp. 3154-3157 (2015)

Kasuriya, S., Sornlertlamvanich, V., Cotsomrong, P., Kanokphara, S., Thatphithakkul, N.: Thai speech corpus for speech recognition. In: Oriental COCOSDA 2003, Singapore (2003)

Saykham, K., Chotimongkol, A., Wutiwiwatchai, C.: Online temporal language model adaptation for a Thai broadcast news transcription system. In: LREC 2010, Valletta, Malta (2010)

Chotimongkol, A., Thatphithakkul, N., Purodakananda, S., Wutiwiwatchai, C., Chootrakool, P., Hansakunbuntheung, C., Suchato, A., Boonpramuk, P.: The development of a large Thai telephone speech corpus: LOTUS-Cell 2.0. In: Oriental COCOSDA 2010, Kathmandu, Nepal (2010)

Chotimongkol, A., Chunwijitra, V., Thatphithakkul, S., Kurpukdee, N., Wutiwiwatchai, C.: Elicit spoken-style data from social media through a style classifier. In: Oriental COCOSDA 2015, Shanghai, China (2015)

Chotimingkol, A., Saykham, K., Thatphithakkul, N., Wutiwiwatchai, C.: Toward benchmarking a general-domain Thai LVCSR system, In: ECTI-CON 2010, Thailand (2010)

Universal Speech Translation Advanced Research (USTAR) consortium, http://www.ustarconsortium.com/

Wutiwiwatchai, C., Thangthai, K., Sertsi, P.: Thai ASR development for network-based speech translation. In: Oriental COCOSDA 2012, Macau, China (2012)

Thangthai, K., Chotimongkol, A., Wutiwiwatchai, C.: A hybrid language model for open-vocabulary Thai LVCSR. In: INTERSPEECH 2013, Lyon, France (2013)

Chunwijitra, V., Chotimongkol, A., Wutiwiwatchai, C.: Combining multiple-type input units using recurrent neural network for LVCSR language modeling. In: INTERSPEECH 2015, Dresden, Germany (2015)

Kurpukdee, N., Sertsi, P., Chunwijitra, S., Chunwijitra, V., Chotimongkol, A., Wutiwiwatchai, C.: Enhance run-time performance with a collaborative distributed speech recognition framework. In: ICSEC 2015, Thailand (2015)

Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y.,Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K.: The Kaldi speech recognition toolkit. In: ASRU 2011, Hawaii, US (2011)

Stolcke, A.: SRILM – an extensible language modeling toolkit. In: ICSLP 2002, Colorado, US (2002)

El-Desoky, A., Gollan, C., Rybach, D., Schlüter, R., and Ney, H.: Investigating the use of morphological decomposition and diacritization for improving Arabic LVCSR. In:INTERSPEECH 2009, Brighton, UK, pp. 2679 – 2682 (2009)

Kwon, O. W., Park, J.: Korean large vocabulary continuous speech recognition with morpheme-based recognition units. Speech Communication, 39(3):287-300 (2003)

Jongtaveesataporn, M., Thienlikit, I., Wutiwiwatchai, C., Furui, S.: Lexical units for Thai LVCSR. Speech Communication, 51(4): 379-389 (2009)

Aroonmanakul, W.: Collocation and Thai word segmentation. In: SNLP-Oriental COCOSDA 2002, Prachuapkirikhan, Thailand, pp. 68-75 (2002)

Haeb-Umbach, R., Ney, H.: Linear discriminant analysis for improved large vocabulary continuous speech recognition. In: ICASSP 1992, pp. 13–16 (1992)

Gopinath, R.: Maximum likelihood modeling with Gaussian distributions for classification. In ICASSP 1998, vol. 2, pp. 661– 664 (1998)

Bahl, L., Brown, P., de Souza, P., Mercer, R.: Maximum mutual information estimation of hidden Markov model parameters for speech recognition. In: ICASSP 1986, vol. 1, pp. 49-52 (1986)

Povey, D., Woodland, P.: Minimum phone error and ismoothing for improved discriminative training. In: ICASSP, Kyoto, Japan (2012)

Speex: a free codec for free speech, http://www.speex.org/

Bernstein, D.: Containers and cloud: From lxc to docker to kubernetes. IEEE Cloud Computing, vol.1, no.3, pp.81–84, Sept 2014.

Chunwijitra, S., Junlouchai, C., Krairaksa, K., Chunwijitra, V., Wutiwiwatchai, C.: A cloud-based framework for Thai large vocabulary speech recognition. In: ECTI-CON 2016, Chianmai, Thailand (2016).

Recent Advance of Thai Open-Vocabulary Automatic Speech Recognition

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

Logo

Make a Submission

Manual

Visitors