OAR@UM Community:

Sparse dynamic principal components analysis in the frequency domain

2025-10-22T09:22:01Z

Title: Sparse dynamic principal components analysis in the frequency domain Authors: Attard, Matt; Suda, David Paul; Sammut, Fiona Abstract: The main focus of this paper will be the sparsity treatment of dynamic principal components analysis (DPCA), which is an extension of principal components analysis (PCA) in a time series setting. Several sparse extensions for the high-dimensional data setting have been introduced in the past two decades. However, peer-reviewed literature addressing high-dimensionality in the DPCA setting remains scarce. This study addresses the high-dimensionality problem on the frequency-domain variant of DPCA, which replicates the classical dynamic approach on cross-spectra, the frequency domain analogue of the variancecovariance matrix. Taking cue from literature in sparse PCA, this research seeks to extend these methods on the frequency-domain DPCA via the cross-spectrum. The method being proposed is based on sparse eigenvector extraction from cross-spectral matrices with the la-penalty. Some preliminary results based on simulated data will be presented, and future research considerations set out.

Poor performance of large language models based on the diabetes and endocrinology specialty certificate examination of the United Kingdom

2025-10-08T13:31:47Z

Title: Poor performance of large language models based on the diabetes and endocrinology specialty certificate examination of the United Kingdom Authors: Fan, Ka Siu; Gan, Jeffrey; Zou, Isabelle X.; Kaladjiska, Maja; Borg Inguanez, Monique; Garden, Gillian L. Abstract: Introduction: The medical knowledge of large language models (LLMs) has been tested using several postgraduate medical examinations. However, it is rarely examined in diabetes and endocrinology. This study aimed to evaluate the performance of LLMs in answering multiple-choice questions using the Diabetes and Endocrinology Speciality Certificate Examination (SCE) of the United Kingdom. Methods: The official diabetes and endocrinology SCE sample questions were used to assess the seven freely accessible and subscription-based commercial LLMs: ChatGPT-o1 Preview (OpenAI, USA), ChatGPT-4o (OpenAI, USA), Gemini (Google, USA), Claude-3.5 Sonnet (Anthropic, USA), Copilot (Microsoft, USA), Perplexity AI (Perplexity, USA), and Meta AI (Meta, USA). The accuracy of LLMs was calculated by comparing outputs against sample answers. Literacy metrics, including Flesch Reading Ease (FRES) and Flesch Kincaid Grade Level (FKGL), were calculated for each response. 83 questions, three of which included photographs, were entered into the LLMs without employing any prompt engineering techniques. Results: A total of 581 responses were generated and captured between August and October 2024. Performance differed significantly between models, with ChatGPT-o1 Preview achieving the highest accuracy (73%). None of the other LLMs achieved the historical pass mark of 65%, with Gemini achieving the lowest accuracy of 33%. Readability metrics also differed significantly between LLMs (p=0.004). LLMs performed better for questions without reference ranges (p<0.001). Conclusions: The performance of LLMs was generally inadequate in the diabetes and endocrinology examination. Of those tested, ChatGPT-o1 Preview achieved the highest score and is likely the most useful model to aid medical education. This may be due to it being an advanced reasoning model with a greater ability to solve complex problems. Nonetheless, continued research is needed to keep pace with the advances in LLMs.

A climate suitability index for species distribution modelling applied to terrestrial arthropods in the Mediterranean region

2025-09-03T11:11:08Z

Title: A climate suitability index for species distribution modelling applied to terrestrial arthropods in the Mediterranean region Authors: Ciarlo, James M.; Borg Inguanez, Monique; Coppola, Erika; Micallef, Aaron; Mifsud, David Abstract: Climate change poses significant threats to global biodiversity, particularly impacting arthropods due to their sensitivity to shifts in temperature and precipitation, as well as other environmental conditions. These changes impact the suitability of their habitats, alter ecological interactions, and consequently affect the distribution and survival of species. Understanding how climate variability influences the ecological niches of arthropods is crucial for predicting future biodiversity patterns and implementing effective conservation strategies. This study introduces a simple index designed to model species' distribution on the basis of their climatic niche, with a specific focus on terrestrial Mediterranean arthropods. This approach leverages regional climate model data to construct a climatology of a species's preferred habitat, based on historically observed locations. This index offers a straightforward and rapid means to assess the resilience and vulnerability of arthropod populations and could be applied to future studies aiming to shed light on how climate change could affect the fundamental niches of terrestrial arthropods. The analysis revealed that the method is most reliable for species with observations exceeding 1000 points and climate datasets of high resolutions (although the latter had a smaller influence on the results). This study offers a proof of concept for the proposed index, demonstrating its potential utility in guiding conservation strategies and mitigating the adverse effects of climate change on arthropod habitats.

Comparison of tree-based learning methods for fraud detection in motor insurance

2025-07-18T08:13:30Z

Title: Comparison of tree-based learning methods for fraud detection in motor insurance Authors: Suda, David; Caruana, Mark Anthony; Grima, Lorin Abstract: Fraud detection in motor insurance is investigated with the implementation and comparison of various tree based learning methods subject to different data balancing approaches. A dataset obtained from the insurance industry will be used. The focus is on decision trees, random forests, gradient boosting machines, light gradient boosting machines and XGBoost. Due to the highly imbalanced nature of our dataset, synthetic minority oversampling and cost-sensitive learning approaches will be used to address this issue. A study aimed at comparing the two data-balancing approaches is novel in literature, and this study concludes that cost-sensitive learning is overall superior for this application. The light gradient boosting machine using cost-sensitive learning is the most effective method, achieving a balanced accuracy of 81% and successfully identifying 83% of fraudulent cases. For the most successful approach, the primary insights into the most important features are provided. The findings derived from this study provide a useful evaluation into the suitability of tree-based learners in the field of insurance fraud detection, and also contribute to the current development of useful tools for correct classification and the important features to be addressed.