Recent Advance of Thai Open-Vocabulary Automatic Speech Recognition
Keywords:
open-vocabulary, speech recognition, Thai languageAbstract
We describe the recent development of the NECTEC Thai open-vocabulary automatic speech recognition system. Some of the techniques that were found beneficial over its baseline system are: hybrid word-subword language modeling to enhance the vocabulary coverage in a constraint resource; multi-conditioned noisy acoustic modeling to improve the system robustness and spoken-style language model interpolation using a newly developed large social media speech database; recent state-of-the-art speech features; and lastly, online decoding, speech compression, and Docker-based distributed computing to reduce the processing and data transmission time. These techniques result in a 29.0% word error rate on open-vocabulary noisy speech test sets which is 42.5% relatively low-er than the baseline system. The overall system operates at nearly 1.2xRT which is promising for real applications.
References
Saon, G., Kuo, H. J., Rennie, S., Picheny, M.: The IBM 2015 English conversational telephone speech recognition system. In: Proc. INTERSPEECH 2015, Dresden, Germany (2015)
Schalkwyk, J., Beeferman, D., Beaufays, F., Byrne, B., Chelba, C., Cohen, M., Garrett, M., Strope, B.: Google search by voice: a case study. In: Advances in Speech Recognition: Mobile Environments, Call Centers and Clinics, Springer, pp. 61-90 (2010)
Shaik, M., Tüske, Z., Tahir, M., Nussbaum-Thom, M., Schlüter, R., Ney, N.: Improvements in RWTH LVCSR evaluation systems for Polish, Portuguese, English, Urdu, and Arabic. In: INTERSPEECH 2015, Dresden, Germany, pp. 3154-3157 (2015)
Kasuriya, S., Sornlertlamvanich, V., Cotsomrong, P., Kanokphara, S., Thatphithakkul, N.: Thai speech corpus for speech recognition. In: Oriental COCOSDA 2003, Singapore (2003)
Saykham, K., Chotimongkol, A., Wutiwiwatchai, C.: Online temporal language model adaptation for a Thai broadcast news transcription system. In: LREC 2010, Valletta, Malta (2010)
Chotimongkol, A., Thatphithakkul, N., Purodakananda, S., Wutiwiwatchai, C., Chootrakool, P., Hansakunbuntheung, C., Suchato, A., Boonpramuk, P.: The development of a large Thai telephone speech corpus: LOTUS-Cell 2.0. In: Oriental COCOSDA 2010, Kathmandu, Nepal (2010)
Chotimongkol, A., Chunwijitra, V., Thatphithakkul, S., Kurpukdee, N., Wutiwiwatchai, C.: Elicit spoken-style data from social media through a style classifier. In: Oriental COCOSDA 2015, Shanghai, China (2015)
Chotimingkol, A., Saykham, K., Thatphithakkul, N., Wutiwiwatchai, C.: Toward benchmarking a general-domain Thai LVCSR system, In: ECTI-CON 2010, Thailand (2010)
Universal Speech Translation Advanced Research (USTAR) consortium, http://www.ustarconsortium.com/
Wutiwiwatchai, C., Thangthai, K., Sertsi, P.: Thai ASR development for network-based speech translation. In: Oriental COCOSDA 2012, Macau, China (2012)
Thangthai, K., Chotimongkol, A., Wutiwiwatchai, C.: A hybrid language model for open-vocabulary Thai LVCSR. In: INTERSPEECH 2013, Lyon, France (2013)
Chunwijitra, V., Chotimongkol, A., Wutiwiwatchai, C.: Combining multiple-type input units using recurrent neural network for LVCSR language modeling. In: INTERSPEECH 2015, Dresden, Germany (2015)
Kurpukdee, N., Sertsi, P., Chunwijitra, S., Chunwijitra, V., Chotimongkol, A., Wutiwiwatchai, C.: Enhance run-time performance with a collaborative distributed speech recognition framework. In: ICSEC 2015, Thailand (2015)
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y.,Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K.: The Kaldi speech recognition toolkit. In: ASRU 2011, Hawaii, US (2011)
Stolcke, A.: SRILM – an extensible language modeling toolkit. In: ICSLP 2002, Colorado, US (2002)
El-Desoky, A., Gollan, C., Rybach, D., Schlüter, R., and Ney, H.: Investigating the use of morphological decomposition and diacritization for improving Arabic LVCSR. In:INTERSPEECH 2009, Brighton, UK, pp. 2679 – 2682 (2009)
Kwon, O. W., Park, J.: Korean large vocabulary continuous speech recognition with morpheme-based recognition units. Speech Communication, 39(3):287-300 (2003)
Jongtaveesataporn, M., Thienlikit, I., Wutiwiwatchai, C., Furui, S.: Lexical units for Thai LVCSR. Speech Communication, 51(4): 379-389 (2009)
Aroonmanakul, W.: Collocation and Thai word segmentation. In: SNLP-Oriental COCOSDA 2002, Prachuapkirikhan, Thailand, pp. 68-75 (2002)
Haeb-Umbach, R., Ney, H.: Linear discriminant analysis for improved large vocabulary continuous speech recognition. In: ICASSP 1992, pp. 13–16 (1992)
Gopinath, R.: Maximum likelihood modeling with Gaussian distributions for classification. In ICASSP 1998, vol. 2, pp. 661– 664 (1998)
Bahl, L., Brown, P., de Souza, P., Mercer, R.: Maximum mutual information estimation of hidden Markov model parameters for speech recognition. In: ICASSP 1986, vol. 1, pp. 49-52 (1986)
Povey, D., Woodland, P.: Minimum phone error and ismoothing for improved discriminative training. In: ICASSP, Kyoto, Japan (2012)
Speex: a free codec for free speech, http://www.speex.org/
Bernstein, D.: Containers and cloud: From lxc to docker to kubernetes. IEEE Cloud Computing, vol.1, no.3, pp.81–84, Sept 2014.
Chunwijitra, S., Junlouchai, C., Krairaksa, K., Chunwijitra, V., Wutiwiwatchai, C.: A cloud-based framework for Thai large vocabulary speech recognition. In: ECTI-CON 2016, Chianmai, Thailand (2016).