Recent Advance of Thai Open-Vocabulary Automatic Speech Recognition

Authors

  • Chai Wutiwiwatchai National Electronics and Computer Technology Center, National Science and Technology Development Agency, Thailand
  • Vataya Chunwijitra National Electronics and Computer Technology Center, National Science and Technology Development Agency, Thailand
  • Sila Chunwijitra National Electronics and Computer Technology Center, National Science and Technology Development Agency, Thailand
  • Phuttapong Sertsi National Electronics and Computer Technology Center, National Science and Technology Development Agency, Thailand
  • Sawit Kasuriya National Electronics and Computer Technology Center, National Science and Technology Development Agency, Thailand
  • Patcharika Chootrakool National Electronics and Computer Technology Center, National Science and Technology Development Agency, Thailand
  • Kwanchiva Thangthai National Electronics and Computer Technology Center, National Science and Technology Development Agency, Thailand
  • Chanchai Junlouchai National Electronics and Computer Technology Center, National Science and Technology Development Agency, Thailand
  • Kamthorn Krairaksa Department of Computer Engineering, Faculty of Engineering, Mahidol University, Thailand

Keywords:

open-vocabulary, speech recognition, Thai language

Abstract

We describe the recent development of the NECTEC Thai open-vocabulary automatic speech recognition system. Some of the techniques that were found beneficial over its baseline system are: hybrid word-subword language modeling to enhance the vocabulary coverage in a constraint resource; multi-conditioned noisy acoustic modeling to improve the system robustness and spoken-style language model interpolation using a newly developed large social media speech database; recent state-of-the-art speech features; and lastly, online decoding, speech compression, and Docker-based distributed computing to reduce the processing and data transmission time. These techniques result in a 29.0% word error rate on open-vocabulary noisy speech test sets which is 42.5% relatively low-er than the baseline system. The overall system operates at nearly 1.2xRT which is promising for real applications.

References

Saon, G., Kuo, H. J., Rennie, S., Picheny, M.: The IBM 2015 English conversational telephone speech recognition system. In: Proc. INTERSPEECH 2015, Dresden, Germany (2015)

Schalkwyk, J., Beeferman, D., Beaufays, F., Byrne, B., Chelba, C., Cohen, M., Garrett, M., Strope, B.: Google search by voice: a case study. In: Advances in Speech Recognition: Mobile Environments, Call Centers and Clinics, Springer, pp. 61-90 (2010)

Shaik, M., Tüske, Z., Tahir, M., Nussbaum-Thom, M., Schlüter, R., Ney, N.: Improvements in RWTH LVCSR evaluation systems for Polish, Portuguese, English, Urdu, and Arabic. In: INTERSPEECH 2015, Dresden, Germany, pp. 3154-3157 (2015)

Kasuriya, S., Sornlertlamvanich, V., Cotsomrong, P., Kanokphara, S., Thatphithakkul, N.: Thai speech corpus for speech recognition. In: Oriental COCOSDA 2003, Singapore (2003)

Saykham, K., Chotimongkol, A., Wutiwiwatchai, C.: Online temporal language model adaptation for a Thai broadcast news transcription system. In: LREC 2010, Valletta, Malta (2010)

Chotimongkol, A., Thatphithakkul, N., Purodakananda, S., Wutiwiwatchai, C., Chootrakool, P., Hansakunbuntheung, C., Suchato, A., Boonpramuk, P.: The development of a large Thai telephone speech corpus: LOTUS-Cell 2.0. In: Oriental COCOSDA 2010, Kathmandu, Nepal (2010)

Chotimongkol, A., Chunwijitra, V., Thatphithakkul, S., Kurpukdee, N., Wutiwiwatchai, C.: Elicit spoken-style data from social media through a style classifier. In: Oriental COCOSDA 2015, Shanghai, China (2015)

Chotimingkol, A., Saykham, K., Thatphithakkul, N., Wutiwiwatchai, C.: Toward benchmarking a general-domain Thai LVCSR system, In: ECTI-CON 2010, Thailand (2010)

Universal Speech Translation Advanced Research (USTAR) consortium, http://www.ustarconsortium.com/

Wutiwiwatchai, C., Thangthai, K., Sertsi, P.: Thai ASR development for network-based speech translation. In: Oriental COCOSDA 2012, Macau, China (2012)

Thangthai, K., Chotimongkol, A., Wutiwiwatchai, C.: A hybrid language model for open-vocabulary Thai LVCSR. In: INTERSPEECH 2013, Lyon, France (2013)

Chunwijitra, V., Chotimongkol, A., Wutiwiwatchai, C.: Combining multiple-type input units using recurrent neural network for LVCSR language modeling. In: INTERSPEECH 2015, Dresden, Germany (2015)

Kurpukdee, N., Sertsi, P., Chunwijitra, S., Chunwijitra, V., Chotimongkol, A., Wutiwiwatchai, C.: Enhance run-time performance with a collaborative distributed speech recognition framework. In: ICSEC 2015, Thailand (2015)

Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y.,Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K.: The Kaldi speech recognition toolkit. In: ASRU 2011, Hawaii, US (2011)

Stolcke, A.: SRILM – an extensible language modeling toolkit. In: ICSLP 2002, Colorado, US (2002)

El-Desoky, A., Gollan, C., Rybach, D., Schlüter, R., and Ney, H.: Investigating the use of morphological decomposition and diacritization for improving Arabic LVCSR. In:INTERSPEECH 2009, Brighton, UK, pp. 2679 – 2682 (2009)

Kwon, O. W., Park, J.: Korean large vocabulary continuous speech recognition with morpheme-based recognition units. Speech Communication, 39(3):287-300 (2003)

Jongtaveesataporn, M., Thienlikit, I., Wutiwiwatchai, C., Furui, S.: Lexical units for Thai LVCSR. Speech Communication, 51(4): 379-389 (2009)

Aroonmanakul, W.: Collocation and Thai word segmentation. In: SNLP-Oriental COCOSDA 2002, Prachuapkirikhan, Thailand, pp. 68-75 (2002)

Haeb-Umbach, R., Ney, H.: Linear discriminant analysis for improved large vocabulary continuous speech recognition. In: ICASSP 1992, pp. 13–16 (1992)

Gopinath, R.: Maximum likelihood modeling with Gaussian distributions for classification. In ICASSP 1998, vol. 2, pp. 661– 664 (1998)

Bahl, L., Brown, P., de Souza, P., Mercer, R.: Maximum mutual information estimation of hidden Markov model parameters for speech recognition. In: ICASSP 1986, vol. 1, pp. 49-52 (1986)

Povey, D., Woodland, P.: Minimum phone error and ismoothing for improved discriminative training. In: ICASSP, Kyoto, Japan (2012)

Speex: a free codec for free speech, http://www.speex.org/

Bernstein, D.: Containers and cloud: From lxc to docker to kubernetes. IEEE Cloud Computing, vol.1, no.3, pp.81–84, Sept 2014.

Chunwijitra, S., Junlouchai, C., Krairaksa, K., Chunwijitra, V., Wutiwiwatchai, C.: A cloud-based framework for Thai large vocabulary speech recognition. In: ECTI-CON 2016, Chianmai, Thailand (2016).

Downloads

Published

2017-04-12

How to Cite

1.
Wutiwiwatchai C, Chunwijitra V, Chunwijitra S, Sertsi P, Kasuriya S, Chootrakool P, Thangthai K, Junlouchai C, Krairaksa K. Recent Advance of Thai Open-Vocabulary Automatic Speech Recognition. j.intell.inform. [Internet]. 2017 Apr. 12 [cited 2024 Dec. 23];1(April). Available from: https://ph05.tci-thaijo.org/index.php/JIIST/article/view/123