OAR@UM Collection:/library/oar/handle/123456789/1075542025-11-06T07:43:13Z2025-11-06T07:43:13ZHighlight detection in live streams using audience reactions with transformer language models/library/oar/handle/123456789/1372852025-07-16T09:25:08Z2022-01-01T00:00:00ZTitle: Highlight detection in live streams using audience reactions with transformer language models
Abstract: Livestreaming of e-sports events has become very popular in recent years, with millions of people watching livestreams of competitions and commenting synchronously in chat rooms. As an effect of the rise of e-sports, there is a demand for match highlight videos. These videos, which consist of the most exciting moments in a match, help followers of the sport to stay up-to-date with or relive past games. Since their manual creation is time intensive, automatic and semi-automatic approaches for highlight detection in live streams have been devised. In this work, we suggest a novel transformer based approach to highlight detection. We employ the audience reactions found in live stream chat in order to find gripping segments of livestreams. To this end, we suggest an approach which combines contextual transformer embeddings with additional temporal features of the chat. We pre-train a language model for the domain of live stream chat in the game League of Legends and employ it on this task. For training this transformer language model, we collect a corpus from a popular livestreaming platform which contains audience reactions to competitive League of Legends matches. With our new model, we achieve an improvement over the state of the art of 0.01 f-score. We provide a new corpus for the domain and make available our pre-trained language model, which we call TwitchLeagueBert.
Description: M.Sc. (HLST)(Melit.)2022-01-01T00:00:00ZOn the cusp of comprehensibility : can language models distinguish between metaphors and nonsense?/library/oar/handle/123456789/1372842025-07-16T09:22:21Z2022-01-01T00:00:00ZTitle: On the cusp of comprehensibility : can language models distinguish between metaphors and nonsense?
Abstract: Utterly creative texts can sometimes be difficult to understand, balancing on the edge of comprehensibility. However, good language skills and common sense allow advanced language users both to interpret creative texts and to reject some linguistic input as nonsense. The goal of this thesis is to evaluate whether the current language models are also able to make the distinction between creative language use, namely (unconventional) metaphors, and nonsense. To test this, mean rank and pseudo-log-likelihood score (PLL) of metaphorical and nonsensical sentences were computed, and several pre-trained models (BERT, RoBERTa) were fine-tuned for binary classification between the two categories. There was a significant difference between the categories in the mean ranks and PPL scores, and the classifier reached around 70.0% - 85.5% accuracy, which is close to the 87% accuracy of the human baseline. The satisfactory performance seems to signal that it is already possible to train the current language models to distinguish between metaphors and nonsense. This raises further questions on the characteristics of metaphorical and nonsensical sentences which allow the successful classification.
Description: M.Sc. (HLST)(Melit.)2022-01-01T00:00:00ZCross-lingual transfer learning with Persian/library/oar/handle/123456789/1372832025-07-16T09:19:29Z2022-01-01T00:00:00ZTitle: Cross-lingual transfer learning with Persian
Abstract: Cross-lingual transfer learning (CLTL) is a technique that facilitates training a model on low-resource languages. Recently, this approach is being performed using multilingual pre-trained models such as XLM-RoBERTa. Due to data availability, English is the source language that is usually used. However, there is data for several languages for POS tagging thanks to the Universal Dependencies dataset. Recent studies on CLTL for POS tagging using XLM-RoBERTa model show that the performance at test time significantly improves when the low resource language is similar to a pre-training or fine-tuning language of the model. In this study, we focus on the Persian language and analyze the results of previous studies to investigate whether language similarity is influential in this case as well or not. Our analysis suggests that the answer is no. We also attempt to find the languages that are a good match with Persian for POS tagging. In this case, we recognize Kurmanji and Tagalog as low-resource languages to benefit from Persian as the source language. We use the WALS linguistic features dataset to find the common features between Persian and other languages. Our results show that Persian and Hindi share the most, around half of their features. We also conduct experiments using the ParsBERT model for POS tagging, with Persian as the source language. Our results show that ParsBERT does not outperform XLM-RoBERTa. We investigate whether CLTL with Persian is task-dependent by performing sentiment analysis in addition to our POS tagging analysis. Thus, we gather data for 30 target languages for the sentiment analysis task. We fine-tune the XLM-RoBERTa model with Persian data and test it with the target languages. Our results show that only Polish and Bulgarian exist among the top 10 languages achieving high scores for POS tagging and sentiment analysis. Therefore, based on our study, CLTL with Persian is task-dependent. However, further work with more languages and more tasks is encouraged.
Description: M.Sc. (HLST)(Melit.)2022-01-01T00:00:00ZHow does machine translation affect language? : analyzing the effect of machine translation on translated texts/library/oar/handle/123456789/1372822025-07-16T09:15:56Z2022-01-01T00:00:00ZTitle: How does machine translation affect language? : analyzing the effect of machine translation on translated texts
Abstract: This Master Thesis analyses the effect of neural machine translation on the language of the translation in terms of lexical, morphological, and syntactical diversity or richness. Four neural machine translation models are trained. Two different corpora of similar length and domain, one of which was created in this work, are used to train and evaluate the models, as well as translate text. Two language pairs were used in both directions: English and Spanish; and English and Croatian. Regarding lexical richness, the majority of our results indicate a degree of lexical loss in the translations. One metric shows a gain of lexical diversity in one of the translations. In morphological richness, the results are not as clear, with most of the metrics showing slight to no loss, or even a gain of richness in two of the translations. Part of speech distribution analysis, as well as parse distribution analyses, both seem to confirm claims made by some that neural machine translation systems increase the frequency of most and decrease the frequency of least frequent items.
Description: M.Sc. (HLST)(Melit.)2022-01-01T00:00:00Z