Deep Siamese Neural Network Vs Random Forest for Myanmar Language Paraphrase Classification

Myint Myint  Htay; Ye  Kyaw Thu; Hnin  Aye Thant; Thepchai  Supnithi

Authors

Myint Myint Htay University of Technology (Yatanarpon Cyber City), Pyin Oo Lwin, Myanmar
Ye Kyaw Thu CADT
Hnin Aye Thant The University of Technology (Yatanarpon Cyber City), Pyin Oo Lwin, Myanmar
Thepchai Supnithi National Electronic & Computer Technology Center (NECTEC)

Keywords:

Semantic Text Similarity, Burmese (Myanmar Language), Deep Siamese Neural Network, Random Forest Modeling, Manhattan LSTM (MaLSTM), Harry Tool

Abstract

Generally, paraphrase detection or semantic similarity of necessity is to understand the sentence as a whole sentence, but not just finding synonyms of the words. It is an important research area in natural language processing that plays a significant role in many applications such as question answering, summarization, information retrieval, and extraction. To our best knowledge, no studies have been conducted on Burmese (Myanmar language) paraphrase or not paraphrase detection and classification. In this research paper, we proposed the comparison of the results of Burmese paraphrase classification with the Deep Siamese Neural Network with MaLSTM (Manhattan LSTM) and Random Forest Classification with 21 features. More specifically, the contribution of this paper is the development of the human-annotated combination of Burmese paraphrase and non-paraphrase corpus that contained 40,461 sentence pairs and open-test data with 1,000 sentence pairs. According to the comparison of our implementation, the Random Forest Classifier is more accurate and useful for Burmese paraphrase classification than Deep Siamese Neural Network even with limited data.

Author Biographies

Myint Myint Htay, University of Technology (Yatanarpon Cyber City), Pyin Oo Lwin, Myanmar

Myint Myint Htay is a Ph.D candidate at University of Technology (Yatanarpon Cyber City), Pyin Oo Lwin and Faculty of Computer Science (UCS(Monywa)) Myanmar. Her current doctoral thesis research focuses on Machine Translation of Burmese Paraphrase. She is interested in the research area of natural language processing (NLP), big data analysis, and deep learning

Ye Kyaw Thu, CADT

Ye Kyaw Thu is a Visiting Professor of Language & Semantic Technology Research Team (LST), Artificial Intelligence Research Unit (AINRU), National Electronic & Computer Technology Center (NECTEC), Thailand and Affiliate Professor at Cambodia Academy of Digital Technology (CADT), Cambodia. He is also a founder of Language Understanding Lab., Myanmar. His research lies in the fields of artificial intelligence (AI), natural language processing (NLP) and human-computer interaction (HCI). He is actively co-supervising/supervising undergrad, masters’ and doctoral students of several universities including Assumption University (AU), Kasetsart University (KU), King Mongkut’s Institute of Technology Ladkrabang (KMITL) and Sirindhorn International Institute of Technology (SIIT).

Hnin Aye Thant, The University of Technology (Yatanarpon Cyber City), Pyin Oo Lwin, Myanmar

Hnin Aye Thant She is currently working as a Professor and Head of Department of Information Science at the University of Technology (Yatanarpon Cyber City), Pyin Oo Lwin Township, Mandalay Division, Myanmar. She got Ph.D (IT) Degree from University of Computer Studies, Yangon, Myanmar in 2005. The current responsibilities are managing professional teachers, doing instructional designer of e-learning content development and teaching. She has 14 years teaching experiences in Information Technology specialized in Programming Languages (C, C++, Java and Assembly), Data Structure, Design and Analysis of Algorithms/Parallel Algorithms, Database Management System, Web Application Development, Operating System, Data Mining and Natural Language Processing. She is a member of research group in “Neural Network Machine Translation between Myanmar Sign Language to Myanmar Written Text” and Myanmar NLP Lab in UTYCC. She is also a Master Instructor and Coaching Expert of USAID COMET Mekong Learning Center. So, she has trained 190 Instructors from ten Technological Universities, twelve Computer Universities and UTYCC for Professional Development course to transform teacher-centered approach to learner-centered approach. This model is to reduce the skills gap between Universities and Industries and to fulfill the students’ work-readiness skills.

Thepchai Supnithi, National Electronic & Computer Technology Center (NECTEC)

Thepchai Supnithi received the B.S. degree in Mathematics from Chulalongkorn University in 1992. He received the M.S. and Ph.D. degrees in Engineering from the Osaka University in 1997 and 2001, respectively. He is currently head of language and semantic research team artificial intelligence research unit, NECTEC, Thailand.

References

V. Rus, M.C. Lintean, R. Banjade, N.B. Niraula, and D. Stefanescu. 2013. Seminar: “The semantic similarity toolkit”. In ACL.

G. Majumder, P. Pakray, A.F. Gelbukh, and D. Pinto “Semantic textual similarity methods, tools, and applications A survey”. Computación y Sistemas, 2016.

R. Gupta, H. Bechara, and C. Orasan. “Intelligent translation memory matching and retrieval metric exploiting linguistic technology”. Proceedings of Translating and the Computer, 36: pp. 86–89.

H. Béchara, C. Orasan, H. Costa, S. Taslimipoor, R. Gupta, G.C. Pastor, and R. Mitkov. Miniexperts: “An svm approach for measuring semantic textual similarity”. In SemEval@NAACL-HLT. 2015.

J. V. A. Souza, L. E. S. E. Oliveira, Y. B. Gumiel, D. R. Carvalho, and C. M. C. Moro “Exploiting Siamese Neural Networks on Short Text Similarity Tasks for Multiple Domains and Languages” P. Quaresma et al. (Eds.): PROPOR 2020, LNAI 12037, pp. 357–367, 2020.

J. Bromley, Y. LeCun, I. Guyon, E. Säckinger, and R. Shah. “Signature verification using a Siamese time delay neural network”. IJPRAI, 7: pp. 669–688 1993.

G. Koch, R. Zemel, and R. Salakhutdinov. 2015. “Siamese neural networks for one-shot image recognition”. In ICML Deep Learning Workshop, volume 2.

P. Neculoiu, M. Versteegh, and M. Rotaru. 2016. “Learning text similarity with Siamese recurrent networks”. In ACL 2016.

H. He, K. Gimpel & J.J. LIN (2015). “Multi-perspective sentence similarity modeling with convolutional neural networks”. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, p. 1576–1586.

J. Mueller & A. Thyagarajan (2016). “Siamese recurrent architectures for learning sentence similarity”. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, p. 2786–2792: AAAI Press.

B. Rychalska, K. Pakulska, C.K. Hodorowska, W. Walczak & A. Ndruszkiewiczp, Samsung poland nlp team at semeval-2016 task 1: Necessity for diversity; combining recursive autoencoders, wordnet and ensemble methods to measure semantic similarity. In SemEval@ NAACL-HLT.

KIM Y. “Convolutional neural networks for sentence classification”. In Proceedings of EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, p. 1746–1751.

R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu & P. Kuksa “Natural language processing (almost) from scratch”. 2493–2537.

Ye Kyaw Thu, “myWord: Syllable, Word and Phrase Segmenter for Burmese, Sept 2021, GitHub Link: https://github.com/ye-kyaw-thu/myWord”.

A. Y. Ichida, F. Meneguzzi, D. D. Ruiz, “Measuring Semantic Similarity Between Sentences Using a Siamese Neural Network”.

Z. Chen, H. Zhang, X. Zhang, and L. Zhao, “Quora Question Pairs”, pp. 1–7, 2017.

E. L. Pontes, S. Huet, A. C. Linhares, J. Manuel, T. Moreno, “Predicting the Semantic Textual Similarity with Siamese CNN and LSTM”.

T. Mikolov, M. Karafiát, L. Burget, and S. Khudanpur, “Recurrent neural network based language model”, Proceedings of 11th Annual Conference on the International Speech Communication Association, pp. 1045–1048, Sept, 2010.

https://github.com/strohne/Facepager/releases/

https://my.wiktionary.org/wiki/ [21] M.M. Htay, Y.K. Thu, H.A Thant, T. Supnithi “Statistical Machine Translation for Myanmar Language Paraphrase Generation”, Proceedings of 2020 15th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP) pp. 255-260, 2020.

N. Othman, R. Faiz, and K. Smaı̈li, “Manhattan Siamese LSTM for Question Retrieval in Community Question Answering” Conference paper, 2019.

Z. Tang and J. Li, “Jointly Considering Siamese Network and MatchPyramid Network for Text Semantic Matching”. SAMSE 2018.

W. Bao, J. Du, Y. Yang and X. Zhao, “Attentive Siamese LSTM Network for Semantic Textual Similarity Measure”, 2018 International Conference on Asian Language Processing (IALP).

Rieck K. and Wressnegger C., “Harry: A Tool for Measuring String Similarity”, Journal of Machine Learning Research, Vol:17, pp. 1-5,2016.

Bakshi C., “https://levelup.gitconnected.com/random-forest-regression-209c0f354c84”.

https://wiki.pathmind.com/word2vec

https://blogs.sap.com/2019/07/03/glove-and-fasttext-two-popular-word-vector-models-in-nlp/

https://towardsdatascience.com/the-definitive-guide-to-bidaf-part-2-word-embedding-character-embedding-and-contextual-c151fc4f05bb