Statistical Machine Translation of Myanmar Dialects
Keywords:
Statistical Machine Translation, Parallel Corpus Developing, Myanmar (Burmese), Rakhine (Arakanese), Dawei (Tavoyan), Myeik (Beik)Abstract
The goal of this work is to contribute the first evaluation of the quality of machine translation between Standard Myanmar and Other Myanmar Dialectal Languages. Myanmar Dialects present many challenges for machine translation, which is the lack of data resources. To fulfill this requirement, we also developed three Myanmar Dialect corpora based on the Myanmar language of ASEAN MT corpus. They are Myanmar-Rakhine (18K), Myanmar-Myeik (10K) and Myanmar-Dawei (9K) parallel corpora. The 10 folds cross-validation experiments were carried out using three different statistical machine translation approaches: phrase-based, hierarchical phrase-based, and the operation sequence model. In addition, two types of segmentation; word and syllable units were studied. The results show that all three statistical machine translation approaches give higher and comparable BLEU and RIBES scores between Myanmar and three dialects (Rakhine, Dawei and Myeik) in both directions. The OSM approach achieved the highest BLEU and RIBES scores among three approaches for both word and syllable segmentations. Moreover, we found that syllable segmentation is appropriate for translation quality comparing with word level segmentation results.
References
Koehn, Philipp and Och, Franz Josef and Marcu, Daniel, “Sta- tistical phrase-based translation,” Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, 2003, pp. 48–54.
Koehn, Philipp and Hoang, Hieu and Birch, Alexandra and Callison-Burch, Chris and Federico, Marcello and Bertoldi, Nicola and Cowan, Brooke and Shen, Wade and Moran, Christine and Zens, Richard and Dyer, Chris and Bojar, Ondřej and Constantin, Alexandra and Herbst, Evan, A. Constantin, and E. Herbst, “Moses: Open source toolkit for statistical machine translation,” Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, 2007, pp. 177– 180.
Koehn, Philipp, “Europarl: A parallel corpus for statistical ma- chine translation,” Conference Proceedings: the tenth Machine Translation Summit, 2005, pp. 79–86.
Ye Kyaw Thu, Andrew Finch, Win Pa Pa, and Eiichiro Sumita, “A Large-scale Study of Statistical Machine Translation Methods for Myanmar Language,” in Proceeding of SNLP2016, February 10-12, 2016.
Chiang, David, “Hierarchical phrase-based translation,” Compu- tational Linguistics 33(2), 2007, pp. 201-228.
Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing, “BLEU: a Method for Automatic Evaluation of Machine Translation,” Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, Philadelphia, Pennsylvania, 2002, pp. 311–318
Isozaki, Hideki and Hirao, Tsutomu and Duh, Kevin and Su- doh, Katsuhito and Tsukada, Hajime, “Automatic evaluation of translation quality for distant language pairs,” Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, 2010, pp. 944-952.
Win Pa Pa,Ye Kyaw Thu, Andrew Finch and Eiichiro Sumita, “A Study of Statistical Machine Translation Methods for Under Resourced Languages,” 5th Workshop on Spoken Language Tech- nologies for Under-resourced Languages (SLTU Workshop), 09-12 May, 2016, Yogyakarta, Indonesia, Procedia Computer Science, Volume 81, 2016, pp. 250–257.
Ye Kyaw Thu, Vichet Chea, Andrew Finch,Masao Utiyama and Eiichiro Sumita, “A Large-scale Study of Statistical Machine Translation Methods for Khmer Language” 29th Pacific Asia Conference on Language, Information and Computation,October 30 - November 1, 2015,Shanghai, China,pp. 259-269.
Karima Meftouh, Salima Harrat, Salma Jamoussi, Mourad Ab- bas and Kamel Smaili, “Machine Translation Experiments on PADIC: A Parallel Arabic DIalect Corpus,” oin Proc. of the 29th Pacific Asia Conference on Language, Information and Compu- tation, PACLIC 29, Shanghai, China, October 30 - November 1, 2015, pp. 26-34.
Neubarth Friedrich, Haddow Barry, Huerta Adolfo Hernandez and Trost Harald, “A Hybrid Approach to Statistical Machine Translation Between Standard and Dialectal Varieties,” Human Language Technology, Challenges for Computer Science and Linguistics: 6th Language and Technology Conference, LTC 2013, Poznan, Poland, December 7-9, 2013, Revised Selected Papers, pp .341–353.
Pierre-Edouard Honnet, Andrei Popescu-Belis, Claudiu Musat and Michael Baeriswyl, “Machine Translation of Low-Resource Spoken Dialects: Strategies for Normalizing Swiss German,” CoRR journal, volume (abs/1710.11035), 2017.
John Okell, “Three Burmese Dialects,” In David Bradley (ed.), Papers in Southeast Asian Linguistics No. 13: Studies in Burmese Languages, 1995, pp. 1–138
Pe Maung Tin, “The dialect of Tavoy”, Journal of the Burma Research Society 23, 1933, pp. 31-46.
Lucia Specia„ “Tutorial, Fundamental and New Approaches to Statistical Machine Translation,” International Conference Recent Advances in Natural Language Processing, 2011
Braune, Fabienne and Gojun, Anita and Fraser, Alexander, “Long-distance reordering during search for hierarchical phrase- based SMT,” In Proc. of the 16th Annual Conference of the European Association for Machine Translation, 2012, Trento, Italy, pp. 177-184.
Durrani, Nadir and Schmid, Helmut and Fraser, Alexander, “A Joint Sequence Translation Model with Integrated Reordering,” in Proc. of the 49th Annual Meeting of the Association for Com- putational Linguistics: Human Language Technologies - Volume 1, 2011, Portland, Oregon, pp. 1045-1054.
Nadir Durrani, Helmut Schmid, Alexander M. Fraser, Philipp Koehn and Hinrich Schutze “The Operation Sequence Model - Combining N-Gram-Based and Phrase-Based Statistical Machine Translation,” Computational Linguistics, Volume 41, No. 2, 2015, pp. 185-214.
Prachya, Boonkwan and Thepchai, Supnithi, “Technical Report for The Network-based ASEAN Language Translation Public Service Project,” Online Materials of Network-based ASEAN Languages Translation Public Service for Members, NECTEC, 2013.
Och Franz Josef and Ney Hermann, “Improved Statistical Align- ment Models,” in Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, Hong Kong, China, 2000, pp. 440-447.
Tillmann Christoph, “A Unigram Orientation Model for Statistical Machine Translation,” in Proceedings of HLT-NAACL 2004: Short Papers, Stroudsburg, PA, USA, 2004, pp. 101-104.
Heafield, Kenneth, “KenLM: Faster and Smaller Language Model Queries,” in Proceedings of the Sixth Workshop on Statis- tical Machine Translation, WMT ’11, Edinburgh, Scotland, 2011, pp. 187-197.
Chen Stanley F and Goodman Joshua, “An empirical study of smoothing techniques for language modeling,” in Proceedings of the 34th annual meeting on Association for Computational Linguistics, 1996, pp. 310-318.
Och Franz J., “Minimum error rate training in statistical ma- chine translation,” in Proceedings of the 41st Annual Meeting n Association for Computational Linguistics – Volume 1,Asso- ciation for Computer Linguistics, Sapporo, Japan, July, 2003, pp.160-167.
Thazin Myint Oo, Ye Kyaw Thu, Khin Mar Soe, “Statistical Machine Translation between Myanmar (Burmese) and Rakhine (Arakanese)”, In Proceedings of ICCA2018, February 22-23, 2018, Yangon, Myanmar, pp. 304-311
(NIST) The National Institute of Standards and Technology, Speech recognition scoring toolkit (SCTK), version: 2.4.10, 2015
Miller, Frederic P. and Vandome, Agnes F. and McBrewster, John, Levenshtein Distance: Information Theory, Computer Sci- ence, String (Computer Science), “String Metric, Damerau Lev- enshtein Distance, Spell Checker, Hamming Distance”, ISBN: 6130216904, 9786130216900, Alpha Press, 2009
Armstrong, Lilias E. and Pe Maung Tin, A Burmese Phonetic
Reader. London: University of London Press, 1925
Bradley,David.1982.RegisterinBurmese.(In)D.Bradley(ed.) Papers in South-East Asian Linguistics No. 8: Tonation. Pacific Linguistics Series ”No. 62, pp. 117-132
KhinPale,AstudyofMyeikdailyvocabulary,B.A.termpaper, Mawlamyaing University, Myanmar, 1974
Andreas Stolcke. 2002. SRILM - An Extensible Language Modeling Toolkit. In Proceedings of the International Conference on Spoken Language Processing, volume 2, pages 901–904, Denver