Deep Siamese Neural Network Vs Random Forest for Myanmar Language Paraphrase Classification
Keywords:
Semantic Text Similarity, Burmese (Myanmar Language), Deep Siamese Neural Network, Random Forest Modeling, Manhattan LSTM (MaLSTM), Harry ToolAbstract
Generally, paraphrase detection or semantic similarity of necessity is to understand the sentence as a whole sentence, but not just finding synonyms of the words. It is an important research area in natural language processing that plays a significant role in many applications such as question answering, summarization, information retrieval, and extraction. To our best knowledge, no studies have been conducted on Burmese (Myanmar language) paraphrase or not paraphrase detection and classification. In this research paper, we proposed the comparison of the results of Burmese paraphrase classification with the Deep Siamese Neural Network with MaLSTM (Manhattan LSTM) and Random Forest Classification with 21 features. More specifically, the contribution of this paper is the development of the human-annotated combination of Burmese paraphrase and non-paraphrase corpus that contained 40,461 sentence pairs and open-test data with 1,000 sentence pairs. According to the comparison of our implementation, the Random Forest Classifier is more accurate and useful for Burmese paraphrase classification than Deep Siamese Neural Network even with limited data.
References
V. Rus, M.C. Lintean, R. Banjade, N.B. Niraula, and D. Stefanescu. 2013. Seminar: “The semantic similarity toolkit”. In ACL.
G. Majumder, P. Pakray, A.F. Gelbukh, and D. Pinto “Semantic textual similarity methods, tools, and applications A survey”. Computación y Sistemas, 2016.
R. Gupta, H. Bechara, and C. Orasan. “Intelligent translation memory matching and retrieval metric exploiting linguistic technology”. Proceedings of Translating and the Computer, 36: pp. 86–89.
H. Béchara, C. Orasan, H. Costa, S. Taslimipoor, R. Gupta, G.C. Pastor, and R. Mitkov. Miniexperts: “An svm approach for measuring semantic textual similarity”. In SemEval@NAACL-HLT. 2015.
J. V. A. Souza, L. E. S. E. Oliveira, Y. B. Gumiel, D. R. Carvalho, and C. M. C. Moro “Exploiting Siamese Neural Networks on Short Text Similarity Tasks for Multiple Domains and Languages” P. Quaresma et al. (Eds.): PROPOR 2020, LNAI 12037, pp. 357–367, 2020.
J. Bromley, Y. LeCun, I. Guyon, E. Säckinger, and R. Shah. “Signature verification using a Siamese time delay neural network”. IJPRAI, 7: pp. 669–688 1993.
G. Koch, R. Zemel, and R. Salakhutdinov. 2015. “Siamese neural networks for one-shot image recognition”. In ICML Deep Learning Workshop, volume 2.
P. Neculoiu, M. Versteegh, and M. Rotaru. 2016. “Learning text similarity with Siamese recurrent networks”. In ACL 2016.
H. He, K. Gimpel & J.J. LIN (2015). “Multi-perspective sentence similarity modeling with convolutional neural networks”. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, p. 1576–1586.
J. Mueller & A. Thyagarajan (2016). “Siamese recurrent architectures for learning sentence similarity”. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, p. 2786–2792: AAAI Press.
B. Rychalska, K. Pakulska, C.K. Hodorowska, W. Walczak & A. Ndruszkiewiczp, Samsung poland nlp team at semeval-2016 task 1: Necessity for diversity; combining recursive autoencoders, wordnet and ensemble methods to measure semantic similarity. In SemEval@ NAACL-HLT.
KIM Y. “Convolutional neural networks for sentence classification”. In Proceedings of EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, p. 1746–1751.
R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu & P. Kuksa “Natural language processing (almost) from scratch”. 2493–2537.
Ye Kyaw Thu, “myWord: Syllable, Word and Phrase Segmenter for Burmese, Sept 2021, GitHub Link: https://github.com/ye-kyaw-thu/myWord”.
A. Y. Ichida, F. Meneguzzi, D. D. Ruiz, “Measuring Semantic Similarity Between Sentences Using a Siamese Neural Network”.
Z. Chen, H. Zhang, X. Zhang, and L. Zhao, “Quora Question Pairs”, pp. 1–7, 2017.
E. L. Pontes, S. Huet, A. C. Linhares, J. Manuel, T. Moreno, “Predicting the Semantic Textual Similarity with Siamese CNN and LSTM”.
T. Mikolov, M. Karafiát, L. Burget, and S. Khudanpur, “Recurrent neural network based language model”, Proceedings of 11th Annual Conference on the International Speech Communication Association, pp. 1045–1048, Sept, 2010.
https://github.com/strohne/Facepager/releases/
https://my.wiktionary.org/wiki/ [21] M.M. Htay, Y.K. Thu, H.A Thant, T. Supnithi “Statistical Machine Translation for Myanmar Language Paraphrase Generation”, Proceedings of 2020 15th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP) pp. 255-260, 2020.
N. Othman, R. Faiz, and K. Smaı̈li, “Manhattan Siamese LSTM for Question Retrieval in Community Question Answering” Conference paper, 2019.
Z. Tang and J. Li, “Jointly Considering Siamese Network and MatchPyramid Network for Text Semantic Matching”. SAMSE 2018.
W. Bao, J. Du, Y. Yang and X. Zhao, “Attentive Siamese LSTM Network for Semantic Textual Similarity Measure”, 2018 International Conference on Asian Language Processing (IALP).
Rieck K. and Wressnegger C., “Harry: A Tool for Measuring String Similarity”, Journal of Machine Learning Research, Vol:17, pp. 1-5,2016.
Bakshi C., “https://levelup.gitconnected.com/random-forest-regression-209c0f354c84”.
https://wiki.pathmind.com/word2vec
https://blogs.sap.com/2019/07/03/glove-and-fasttext-two-popular-word-vector-models-in-nlp/