mySentence: Sentence Segmentation for Myanmar Language using Neural Machine Translation Approach
Keywords:
Sentence segmentation, Neural machine translation, Sequence TaggingAbstract
A sentence is an independent unit which is a string of complete words containing valuable information of the text. In informal Myanmar Language, for which most of NLP applications like Automatic Speech Recognition (ASR) are used, there is no predefined rule to mark the end of sentence. In this paper, we contributed the first corpus for Myanmar Sentence Segmentation and proposed the first systematic study with Machine Learning based Sequence Tagging as baseline and Neural Machine Translation approach. Before conducting the experiments, we prepared two types of data - one containing only sentences and the other containing both sentences and paragraphs. We trained each model on both types of data and evaluated the results on both types of test data. The accuracies were measured in terms of Bilingual Evaluation Understudy (BLEU) and character n-gram F-score (CHRF ++) scores. Word Error Rate (WER) was also used for the detailed study of error analysis. The experimental results show that Sequence-to-Sequence architecture based Neural Machine Translation approach with the best BLEU score (99.78), which is trained on both sentence-level and paragraph-level data, achieved better CHRF ++ scores (+18.4) and (+16.7) than best results of such machine learning models on both test data.
References
Thazin Myint Oo, Ye Kyaw Thu and Khin Mar Soe, “Neural Machine Translation between Myanmar (Burmese) and Rakhine (Arakanese)”, In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, pages 80–88, Ann Arbor, Michigan, Association for Computational Linguistics, pp. 80-88, 2019.
I. Sutskever, O. Vinyals, Q. V. Le, “Sequence to sequence learning with neural networks”, In Proceedings of the 27th Inter national Conference on Neural Information Processing Systems - Volume 2, NIPS’14, MIT Press, Cambridge, MA, USA, pp. 3104–3112, December 8-13, 2014.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. u. Kaiser, and I. Polosukhin. “Attention is all you need”, In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
Win Pa Pa, Ye Kyaw Thu, A. Finch, E. Sumita, “Word Bound ary Identification for Myanmar Text Using Conditional Random Fields”, In T. Zin, JW. Lin, JS. Pan, P. Tin, M. Yokota, (eds) Genetic and Evolutionary Computing. GEC 2015. Advances in Intelligent Systems and Computing, vol 388. Springer, Cham, 2016.
Ye Kyaw Thu, A. Finch, Y. Sagisaka, E. Sumita “A Study of Myanmar Word Segmentation Schemes for Statistical Ma chine Translation”, In Proceedings of the 11th International Conference on Computer Applications (ICCA 2013), Yangon, Myanmar, pp. 167-179, February 26 27, 2013.
Lingua::EN::Sentence: https://metacpan.org/release/KIMRYAN/ Lingua-EN-Sentence-0.29/view/lib/Lingua/EN/Sentence.pm
K. Tomanek, J. Wermter, and U. Hahn, “Sentence and token splitting based on conditional random fields,” in Proceedings of the 10th Conference of the Pacific Association for Computa tional Linguistics, vol. 49, p. 57, 2017.
B. Jurish and K.-M. Würzner, “Word and Sentence Tokenization with Hidden Markov Models.,” JLCL, vol. 28, no. 2, pp. 61–83, 2013. JOURNAL OF INTELLIGENT INFORMATICS AND SMART TECHNOLOGY, VOL. 9, OCTOBER 2023 9
N. Sadvilkar, M. Neumann, “PySBD: Pragmatic Sentence Boundary Disambiguation”, In Proceedings of Second Work shop for NLP Open Source Software (NLP-OSS), Association for Computational Linguistics, pp. 110-114, November, 2020.
Ye Kyaw Thu, V. Chea, A. Finch, M. Utiyama and E. Sumita, “A Large-scale Study of Statistical Machine Translation Meth ods for Khmer Language”, In Proceedings of 29th Pacific Asia Conference on Language, Information and Computation, Shang hai, China, pp. 259-269, October 30 - November 1, 2015.
Zar Zar Hlaing, Ye Kyaw Thu, T. Supnithi and P. Netisopakul, “Improving Neural Machine Translation with POS-tag features for low-resource language pairs,” Heliyon, vol. 8, August 2022. https://doi.org/10.1016/j.heliyon.2022.e10375
Ye Kyaw Thu, “myPOS : Myanmar Part-of-Speech Corpus”, GitHub Link: https://github.com/ye-kyaw-thu/myPOS
NHK World-Japan, “Corona Virus Questions and Answers in Burmese”, December 2022, Link: https://www3.nhk.or.jp/nhkworld/my/news/qa/coronavirus/
Shared By Louis Augustine: https://www.facebook.com/ sharedbylouisaugustine
Maung Zi’s Tales: https://www.facebook.com/MaungZiTales
Ye Kyaw Thu, “myWord: Syllable, Word and Phrase Segmenter for Burmese”, GitHub Link: https://github.com/ye kyaw-thu/myWord, September 2021
J. Lafferty, A. McCallum and F. C. N. Pereira, “Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data”, In Proceedings of the Eighteenth In ternational Conference on Machine Learning, ICML ’01, San Francisco, CA, USA, Morgan Kaufmann Publishers Inc, pp. 282–289, 2001.
L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition”, In Proceedings of the IEEE, vol. 77, pp. 257–286, 1989.
D. Jurafsky and J. H. Martin, “Speech and Language Process ing: An Introduction to Natural Language Processing, Com putational Linguistics, and Speech Recognition”, Prentice Hall PTR, Upper Saddle River, NJ, USA, 1 st edition, 2000.
Scheffer, Tobias,“Algebraic Foundation and Improved Methods of Induction of Ripple Down Rules”, pp. 23–25, 1996.
Dat Q. Nguyen, Dai Q. Nguyen, D. D. Pham and S. B. Pham, “RDRPOSTagger: A Ripple Down Rules-based Part-Of-Speech Tagger”, In Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, Gothenburg, Sweden, Association for Computational Linguistics, pp. 17–20, 2014.
D. Richards, “Two decades of Ripple Down Rules research”, Knowledge Eng. Review.24, pp. 159–184, 2009.
Okazaki, Naoaki, “CRFsuite: a fast implementation of Conditional Random Fields (CRFs)”, 2007.
de Kok, Daniël, “Jitar: A simple Trigram HMM part-of-speech tagger”, 2014, [accessed 2016].
T. Brants,“TnT: A Statistical Part-of-speech Tagger” , In Pro ceedings of the Sixth Conference on Applied Natural Language Processing, Stroudsburg, PA, USA, Association for Computa tional Linguistics, pp. 224–231, April 2000.
M. Junczys-Dowmunt, R. Grundkiewicz, T. Dwojak, H. Hoang, K. Heafield, T. Neckermann, F. Seide, U. Germann, A. F. Aji, N. Bogoychev, A. F. T. Martins, and A. Birch, “Marian: Fast neural machine translation in C++”. In Proceedings of ACL 2018, System Demonstrations, pp. 116–121, Melbourne, Australia. Association for Computational Linguistics, 2018.
A. Ali, S. Renals, “Word Error Rate Estimation for Speech Recognition: e-WER”, In Proceedings of the 56th Annual Meet ing of the Association for Computational Linguistics, vol.2: Short Papers, pp. 20-24, Melbourne, Australia, July, 2018.
CHRF ++: http://www.statmt.org/wmt17/pdf/WMT70.pdf
M. Popović, “CHRF: character n-gram F-score for automatic MT evaluation”, In Proceedings of the Tenth Workshop on Statistical Machine Translation, pp. 392–395, Lisboa, Portugal, September 17-18, 2015.
K. Papineni, S. Roukos, T. Ward, W. Zhu, “BLEU: a Method for Automatic Evaluation of Machine Translation”, IBM Research Report rc22176 (w0109022), Thomas J. Watson Research Cen ter, 2001.
A. Araabi, C. Monz, “Optimizing Transformer for Low-Resource Neural Machine Translation”. In Proceedings of the 28th International Conference on Computational Linguistics, pp. 3429- 3435, Barcelona, Spain (Online), January 2020.