Please use this identifier to cite or link to this item: /library/oar/handle/123456789/137283
Title: Cross-lingual transfer learning with Persian
Authors: Mollanorozy, Sepideh (2022)
Keywords: Persian language
Transfer learning (Machine learning)
Sentiment analysis
Issue Date: 2022
Citation: Mollanorozy, S. (2022). Cross-lingual transfer learning with Persian (Master's dissertation).
Abstract: Cross-lingual transfer learning (CLTL) is a technique that facilitates training a model on low-resource languages. Recently, this approach is being performed using multilingual pre-trained models such as XLM-RoBERTa. Due to data availability, English is the source language that is usually used. However, there is data for several languages for POS tagging thanks to the Universal Dependencies dataset. Recent studies on CLTL for POS tagging using XLM-RoBERTa model show that the performance at test time significantly improves when the low resource language is similar to a pre-training or fine-tuning language of the model. In this study, we focus on the Persian language and analyze the results of previous studies to investigate whether language similarity is influential in this case as well or not. Our analysis suggests that the answer is no. We also attempt to find the languages that are a good match with Persian for POS tagging. In this case, we recognize Kurmanji and Tagalog as low-resource languages to benefit from Persian as the source language. We use the WALS linguistic features dataset to find the common features between Persian and other languages. Our results show that Persian and Hindi share the most, around half of their features. We also conduct experiments using the ParsBERT model for POS tagging, with Persian as the source language. Our results show that ParsBERT does not outperform XLM-RoBERTa. We investigate whether CLTL with Persian is task-dependent by performing sentiment analysis in addition to our POS tagging analysis. Thus, we gather data for 30 target languages for the sentiment analysis task. We fine-tune the XLM-RoBERTa model with Persian data and test it with the target languages. Our results show that only Polish and Bulgarian exist among the top 10 languages achieving high scores for POS tagging and sentiment analysis. Therefore, based on our study, CLTL with Persian is task-dependent. However, further work with more languages and more tasks is encouraged.
Description: M.Sc. (HLST)(Melit.)
URI: https://www.um.edu.mt/library/oar/handle/123456789/137283
Appears in Collections:Dissertations - FacICT - 2022
Dissertations - FacICTAI - 2022

Files in This Item:
File Description SizeFormat 
2318ICTCSA531005075346_1.PDF
  Restricted Access
2.71 MBAdobe PDFView/Open Request a copy


Items in OAR@UM are protected by copyright, with all rights reserved, unless otherwise indicated.