mySentence: Sentence Segmentation for Myanmar Language using Neural Machine Translation Approach

Authors

  • Thura Aung King Mongkut’s Institute of Technology Ladkrabang
  • Ye Kyaw Thu National Electronic & Computer Technology Center (NECTEC)
  • Zar Zar Hlaing King Mongkut’s Institute of Technology Ladkrabang

Keywords:

Sentence segmentation, Neural machine translation, Sequence Tagging

Abstract

 A sentence is an independent unit which is a string of complete words containing valuable information of the text. In informal Myanmar Language, for which most of NLP applications like Automatic Speech Recognition (ASR) are used, there is no predefined rule to mark the end of sentence. In this paper, we contributed the first corpus for Myanmar Sentence Segmentation and proposed the first systematic study with Machine Learning based Sequence Tagging as baseline and Neural Machine Translation approach. Before conducting the experiments, we prepared two types of data - one containing only sentences and the other containing both sentences and paragraphs. We trained each model on both types of data and evaluated the results on both types of test data. The accuracies were measured in terms of Bilingual Evaluation Understudy (BLEU) and character n-gram F-score (CHRF ++) scores. Word Error Rate (WER) was also used for the detailed study of error analysis. The experimental results show that Sequence-to-Sequence architecture based Neural Machine Translation approach with the best BLEU score (99.78), which is trained on both sentence-level and paragraph-level data, achieved better CHRF ++ scores (+18.4) and (+16.7) than best results of such machine learning models on both test data.

Author Biographies

Thura Aung, King Mongkut’s Institute of Technology Ladkrabang

Thura Aung is a member of Language Understanding Lab., Myanmar. He is currently studying B.Eng. in Software Engineering at the Faculty of Computer Engineering, School of Engineering, King Mongkut’s Institute of Technology Ladkrabang (KMITL), Bangkok, Thailand. He is interested in the research areas of Artificial Intelligence (AI), Natural Language Processing (NLP), and Software Engineering.

Ye Kyaw Thu , National Electronic & Computer Technology Center (NECTEC)

Ye Kyaw Thu is a Visiting Professor of Language & Semantic Technology Research Team (LST), Artificial Intelligence Research Unit (AINRU), National Electronic & Computer Technology Center (NECTEC), Thailand and Affiliate Professor at Cambodia Academy of Digital Technology (CADT), Cambodia. He is also a founder of Language Understanding Lab., Myanmar. His research lies in the fields of artificial intelligence (AI), natural language processing (NLP) and human-computer interaction (HCI). He is actively co-supervising/supervising under[1]grad, masters’ and doctoral students of several universities including Assumption University (AU), Kasetsart University (KU), King Mongkut’s Institute of Technology Ladkrabang (KMITL) and Sirindhorn International Institute of Technology (SIIT).

Zar Zar Hlaing , King Mongkut’s Institute of Technology Ladkrabang

Zar Zar Hlaing is a member of the Language Understanding Lab in Myanmar. She is currently working as a Machine Learning and NLP Engineer. She earned her Ph.D. in Information Technology from the School of Information Technology at King Mongkut’s Institute of Technology Ladkrabang (KMITL) in Bangkok, Thailand. She holds a B.C.Sc. and a B.C.Sc. (Hons) in computer science from the University of Computer Studies in Monywa, as well as an M.C.Sc. in computer science from the University of Computer Studies in Mandalay. Her research interests include Artificial Intelligence (AI), Natural Language Processing (NLP), Language Acquisition, and Text Analysis.

References

Thazin Myint Oo, Ye Kyaw Thu and Khin Mar Soe, “Neural Machine Translation between Myanmar (Burmese) and Rakhine (Arakanese)”, In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, pages 80–88, Ann Arbor, Michigan, Association for Computational Linguistics, pp. 80-88, 2019.

I. Sutskever, O. Vinyals, Q. V. Le, “Sequence to sequence learning with neural networks”, In Proceedings of the 27th Inter national Conference on Neural Information Processing Systems - Volume 2, NIPS’14, MIT Press, Cambridge, MA, USA, pp. 3104–3112, December 8-13, 2014.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. u. Kaiser, and I. Polosukhin. “Attention is all you need”, In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.

Win Pa Pa, Ye Kyaw Thu, A. Finch, E. Sumita, “Word Bound ary Identification for Myanmar Text Using Conditional Random Fields”, In T. Zin, JW. Lin, JS. Pan, P. Tin, M. Yokota, (eds) Genetic and Evolutionary Computing. GEC 2015. Advances in Intelligent Systems and Computing, vol 388. Springer, Cham, 2016.

Ye Kyaw Thu, A. Finch, Y. Sagisaka, E. Sumita “A Study of Myanmar Word Segmentation Schemes for Statistical Ma chine Translation”, In Proceedings of the 11th International Conference on Computer Applications (ICCA 2013), Yangon, Myanmar, pp. 167-179, February 26 27, 2013.

Lingua::EN::Sentence: https://metacpan.org/release/KIMRYAN/ Lingua-EN-Sentence-0.29/view/lib/Lingua/EN/Sentence.pm

K. Tomanek, J. Wermter, and U. Hahn, “Sentence and token splitting based on conditional random fields,” in Proceedings of the 10th Conference of the Pacific Association for Computa tional Linguistics, vol. 49, p. 57, 2017.

B. Jurish and K.-M. Würzner, “Word and Sentence Tokenization with Hidden Markov Models.,” JLCL, vol. 28, no. 2, pp. 61–83, 2013. JOURNAL OF INTELLIGENT INFORMATICS AND SMART TECHNOLOGY, VOL. 9, OCTOBER 2023 9

N. Sadvilkar, M. Neumann, “PySBD: Pragmatic Sentence Boundary Disambiguation”, In Proceedings of Second Work shop for NLP Open Source Software (NLP-OSS), Association for Computational Linguistics, pp. 110-114, November, 2020.

Ye Kyaw Thu, V. Chea, A. Finch, M. Utiyama and E. Sumita, “A Large-scale Study of Statistical Machine Translation Meth ods for Khmer Language”, In Proceedings of 29th Pacific Asia Conference on Language, Information and Computation, Shang hai, China, pp. 259-269, October 30 - November 1, 2015.

Zar Zar Hlaing, Ye Kyaw Thu, T. Supnithi and P. Netisopakul, “Improving Neural Machine Translation with POS-tag features for low-resource language pairs,” Heliyon, vol. 8, August 2022. https://doi.org/10.1016/j.heliyon.2022.e10375

Ye Kyaw Thu, “myPOS : Myanmar Part-of-Speech Corpus”, GitHub Link: https://github.com/ye-kyaw-thu/myPOS

NHK World-Japan, “Corona Virus Questions and Answers in Burmese”, December 2022, Link: https://www3.nhk.or.jp/nhkworld/my/news/qa/coronavirus/

Shared By Louis Augustine: https://www.facebook.com/ sharedbylouisaugustine

Maung Zi’s Tales: https://www.facebook.com/MaungZiTales

Ye Kyaw Thu, “myWord: Syllable, Word and Phrase Segmenter for Burmese”, GitHub Link: https://github.com/ye kyaw-thu/myWord, September 2021

J. Lafferty, A. McCallum and F. C. N. Pereira, “Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data”, In Proceedings of the Eighteenth In ternational Conference on Machine Learning, ICML ’01, San Francisco, CA, USA, Morgan Kaufmann Publishers Inc, pp. 282–289, 2001.

L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition”, In Proceedings of the IEEE, vol. 77, pp. 257–286, 1989.

D. Jurafsky and J. H. Martin, “Speech and Language Process ing: An Introduction to Natural Language Processing, Com putational Linguistics, and Speech Recognition”, Prentice Hall PTR, Upper Saddle River, NJ, USA, 1 st edition, 2000.

Scheffer, Tobias,“Algebraic Foundation and Improved Methods of Induction of Ripple Down Rules”, pp. 23–25, 1996.

Dat Q. Nguyen, Dai Q. Nguyen, D. D. Pham and S. B. Pham, “RDRPOSTagger: A Ripple Down Rules-based Part-Of-Speech Tagger”, In Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, Gothenburg, Sweden, Association for Computational Linguistics, pp. 17–20, 2014.

D. Richards, “Two decades of Ripple Down Rules research”, Knowledge Eng. Review.24, pp. 159–184, 2009.

Okazaki, Naoaki, “CRFsuite: a fast implementation of Conditional Random Fields (CRFs)”, 2007.

de Kok, Daniël, “Jitar: A simple Trigram HMM part-of-speech tagger”, 2014, [accessed 2016].

T. Brants,“TnT: A Statistical Part-of-speech Tagger” , In Pro ceedings of the Sixth Conference on Applied Natural Language Processing, Stroudsburg, PA, USA, Association for Computa tional Linguistics, pp. 224–231, April 2000.

M. Junczys-Dowmunt, R. Grundkiewicz, T. Dwojak, H. Hoang, K. Heafield, T. Neckermann, F. Seide, U. Germann, A. F. Aji, N. Bogoychev, A. F. T. Martins, and A. Birch, “Marian: Fast neural machine translation in C++”. In Proceedings of ACL 2018, System Demonstrations, pp. 116–121, Melbourne, Australia. Association for Computational Linguistics, 2018.

A. Ali, S. Renals, “Word Error Rate Estimation for Speech Recognition: e-WER”, In Proceedings of the 56th Annual Meet ing of the Association for Computational Linguistics, vol.2: Short Papers, pp. 20-24, Melbourne, Australia, July, 2018.

CHRF ++: http://www.statmt.org/wmt17/pdf/WMT70.pdf

M. Popović, “CHRF: character n-gram F-score for automatic MT evaluation”, In Proceedings of the Tenth Workshop on Statistical Machine Translation, pp. 392–395, Lisboa, Portugal, September 17-18, 2015.

K. Papineni, S. Roukos, T. Ward, W. Zhu, “BLEU: a Method for Automatic Evaluation of Machine Translation”, IBM Research Report rc22176 (w0109022), Thomas J. Watson Research Cen ter, 2001.

A. Araabi, C. Monz, “Optimizing Transformer for Low-Resource Neural Machine Translation”. In Proceedings of the 28th International Conference on Computational Linguistics, pp. 3429- 3435, Barcelona, Spain (Online), January 2020.

Downloads

Published

2023-11-17

How to Cite

1.
Aung T, Kyaw Thu Y, Hlaing ZZ. mySentence: Sentence Segmentation for Myanmar Language using Neural Machine Translation Approach. j.intell.inform. [Internet]. 2023 Nov. 17 [cited 2024 Jul. 4];9(October):e001. Available from: https://ph05.tci-thaijo.org/index.php/JIIST/article/view/87