Please use this identifier to cite or link to this item:
/library/oar/handle/123456789/137283| Title: | Cross-lingual transfer learning with Persian |
| Authors: | Mollanorozy, Sepideh (2022) |
| Keywords: | Persian language Transfer learning (Machine learning) Sentiment analysis |
| Issue Date: | 2022 |
| Citation: | Mollanorozy, S. (2022). Cross-lingual transfer learning with Persian (Master's dissertation). |
| Abstract: | Cross-lingual transfer learning (CLTL) is a technique that facilitates training a model on low-resource languages. Recently, this approach is being performed using multilingual pre-trained models such as XLM-RoBERTa. Due to data availability, English is the source language that is usually used. However, there is data for several languages for POS tagging thanks to the Universal Dependencies dataset. Recent studies on CLTL for POS tagging using XLM-RoBERTa model show that the performance at test time significantly improves when the low resource language is similar to a pre-training or fine-tuning language of the model. In this study, we focus on the Persian language and analyze the results of previous studies to investigate whether language similarity is influential in this case as well or not. Our analysis suggests that the answer is no. We also attempt to find the languages that are a good match with Persian for POS tagging. In this case, we recognize Kurmanji and Tagalog as low-resource languages to benefit from Persian as the source language. We use the WALS linguistic features dataset to find the common features between Persian and other languages. Our results show that Persian and Hindi share the most, around half of their features. We also conduct experiments using the ParsBERT model for POS tagging, with Persian as the source language. Our results show that ParsBERT does not outperform XLM-RoBERTa. We investigate whether CLTL with Persian is task-dependent by performing sentiment analysis in addition to our POS tagging analysis. Thus, we gather data for 30 target languages for the sentiment analysis task. We fine-tune the XLM-RoBERTa model with Persian data and test it with the target languages. Our results show that only Polish and Bulgarian exist among the top 10 languages achieving high scores for POS tagging and sentiment analysis. Therefore, based on our study, CLTL with Persian is task-dependent. However, further work with more languages and more tasks is encouraged. |
| Description: | M.Sc. (HLST)(Melit.) |
| URI: | https://www.um.edu.mt/library/oar/handle/123456789/137283 |
| Appears in Collections: | Dissertations - FacICT - 2022 Dissertations - FacICTAI - 2022 |
Files in This Item:
| File | Description | Size | Format | |
|---|---|---|---|---|
| 2318ICTCSA531005075346_1.PDF Restricted Access | 2.71 MB | Adobe PDF | View/Open Request a copy |
Items in OAR@UM are protected by copyright, with all rights reserved, unless otherwise indicated.
