OAR@UM Collection: /library/oar/handle/123456789/25578 2026-06-22T22:52:48Z Application and improvement of genetic algorithms and genetic programming towards the fight against spam and other internet malware /library/oar/handle/123456789/73653 Title: Application and improvement of genetic algorithms and genetic programming towards the fight against spam and other internet malware Abstract: While spam is increasingly acknowledged as a very expensive problem on the Internet, spam filters attempt to detect spam from emails with the highest possible accuracy. Reverse Polish Notation (RPN) expressions are proposed as a means to combine a range of evaluated features from emails for the detection of spam. Theoretical arguments for the use of RPN expressions applied to spam detection are proposed, together with a new RPN Block Representation. Theoretical comparisons of RPN expressions with Naïve Bayes and Support Vector Machine are also given. It is shown that such RPN expressions are more expressive. A proof that email spam detection is NP-complete is given by mapping groups of email spams onto malware virus families. Seventy-two features, ranging from Subject-line, Header-based, Message Body-based, URL-based and stylistic, have been used to evolve RPN expressions using Linear Genetic Programming. New features and specifically the application of a group of URL features to spam detection are proposed (since many spams contain links to domains which are at times even malicious). These new features are shown to be useful for spam detection theoretically and in practice. Linear Genetic Programming is a subset of Genetic Programming where chromosomes are computer programs represented using imperative computer language instructions or machine code instead of trees made up of symbolic expressions. Such machine code can encode RPN expressions. The Linear Genetic Programming system is used to “learn” an RPN expression consisting of a combination of features which can be used to detect spam. A number of feature selection algorithms are used to identify which subsets of features are most relevant to classification. The feature selection techniques Minimum Redundancy Maximum Relevance method (this filter technique finds features which are mutually far from each other while still having high “correlation” to classification), the conventional Maximum Relevance method, Principle Component Analysis and using the entire feature set are investigated using the Linear Genetic Programming system applied to the SpamAssassin Spam Corpus, which is a standard ham and spam archive. Theoretical and practical comparisons with literature results and industry open source applications are given for the proposed system. A new Block Crossover operator which helps preserve and keep intact RPN Blocks as well as a new RPN First Subexpression algorithm are also proposed. This algorithm helps to maintain a double check for spam and ham and results in improvement of detection accuracy and other metrics. The system was very demanding computationally, taking a long time to run on a supercomputer. The proposed system achieved 99.17% accuracy and an F1-score of 0.9868, which compare very well with results given in the literature as well as with performance of industrial Bogofilter and SpamAssassin Spam Filters. Description: PH.D. 2017-01-01T00:00:00Z Genetic algorithm based metaheuristic optimisation of machine learning algorithm parameters /library/oar/handle/123456789/27791 Title: Genetic algorithm based metaheuristic optimisation of machine learning algorithm parameters Abstract: through design and experimentation. The search for optimal or near optimal algorithm configuration parameters is one way to improve performance and presents a research area of significant interest not only to researchers in the field, but also to developers of new algorithms and data analysts in diverse scientific fields. This optimisation process involves a search of very often a large, possibly infinite, unique and irregular parameter search space which renders the search for optimality or near optimality difficult in feasible time. Metaheuristic algorithms with properties, such as exploration, exploitation, use of acquired knowledge gained and stochastic elements, render themselves them a suitable class of solutions for this problem. Various studies have explored different metaheuristic or meta-optimisation approaches in the search for ever better machine learner optimisers and more generally applicable ones. The aim of this study was thus to explore and validate the use of a Genetic Algorithm based approach, often selected in studies of specific optimisation problems, as a generally applicable solution to the Machine Learning Algorithm Parameter optimisation problem. The Simple Genetic Algorithm (SGA), in particular, was chosen as the focus for this study because of its clear metaheuristic qualities and its relatively simple and well known evolutionary architecture. The method used to measure the performance of the SGA as a general metaoptimiser was involved the carrying out of a series of experiments, in which the SGA and other meta-optimiser algorithms were applied to the optimisation of a select test base of machine learner algorithms and datasets, with different characteristics and under various experimental conditions. In a novel take over other studies, a close measured look was also taken at the efficiency and effectiveness of the SGA in the meta-optimisation process. This necessitated the design and setup of a significant number of different experiment runs in a set of phased studies. These experiments were run using multiple fold cross-validation at machine learner and meta-optimiser levels for statistical validity, involving around 200 million machine learner evaluations executed over a set of ten standard workstations. The duration of evaluations of individual configurations ranged from a few milliseconds to over one and a half hours. An automated system was designed and developed by the author to run the planned experimentation and gather valuable performance data for subsequent analysis and reporting. The results showed that there were consistent, though not statistically significant, indications that the SGA was on average a good, though not optimal optimiser of machine learning algorithm parameters over the selected test base. It was also found that without tuning, the SGA suffered from inefficiencies which reduced its overall effectiveness. Other secondary results and methodological insights obtained include: 1. the visualisation and measurement of the different accuracy parameter space landscapes, 2. the measurement of the overheads and inefficiencies of search, 3. the effect of evaluation time on machine learner performance, 4. the effect of dataset size and machine learner processing costs on metaoptimiser performance, 5. the effect of meta-optimiser tuning on its performance, 6. the value of low-cost pre-optimisation exploration of the optimisation problem, and 7. the general applicability of the measures developed or adapted for this study. Two meta-optimiser algorithms were developed for comparative analysis. One was based on Iterated Local search and employed a sampling of the parameter space neighbourhood of the current candidate for the local search. The other was a hybrid SGA with refinement of the epoch's best candidate through local search using the above neighbourhood sampling. The contribution of this study lies in the sum of its outcomes and the potential it holds for further research opportunities. Description: PH.D.IT 2017-01-01T00:00:00Z Link prediction in social graph databases /library/oar/handle/123456789/25817 Title: Link prediction in social graph databases Abstract: The graph structure is widely researched due to its significance in various areas. The topology of such graphs offers a basis of pattern and knowledge extraction by using various mining algorithms. Domains like social activity, continued to give rise to the popularity of graphs, especially in the field of predictive analysis. The dynamic nature of social graphs motivated various researchers to anticipate the evolution of these networks through time. Predicting the likelihood of social interactions formulating at a future time period is based on the Link Prediction Problem. Realizing these links, help to discover future and hidden activities which are very useful for different sectors. A selection of social activities and interactions are not only dynamic, but their strength and reach evolve over time too. Due to this, a number of studies suggest that considering time as an additional dimension improves the results obtained in link prediction. This study evaluates the effect of this consideration by comparing results obtained from static methods with those returned from temporal methods. A supervised binary classification technique is used on three different social datasets with features describing popular graph metrics representing the similarities and proximities between nodes. This study also proposes and implements a method to assign time-based weights which describe the activeness of the network nodes based on how recent their adjacent interactions are. Various performance measures such as accuracy, precision, and recall are used to aid with the comparative analysis of the results. The results of this study show that the consideration of time-based aspects helps improve the link predictions. The Katz metric yielded the best performance when compared to the other graph metrics. This result on one of the datasets managed to correctly classify seventeen additional links when the time-based method is used. Description: M.SC.IT 2017-01-01T00:00:00Z E-voting system using blockchain technology /library/oar/handle/123456789/25816 Title: E-voting system using blockchain technology Abstract: In recent years, cryptocurrencies have revolutionised the financial world with the introduction of Bitcoin. The Bitcoin operates in a decentralised network, where everyone is an owner, and no central server exists. This decentralised network is called a blockchain, and since its conception has evolved beyond its primary use. The idea of decentralised applications gave birth to new and innovative ways of designing applications. This idea continued growing with the introduction of smart contracts on another the Ethereum blockchain, giving new possibilities to developers. This dissertation provides one such application of blockchain technology, which area is blockchain based electronic voting. This application is analysed in the context of Malta and is aimed at replacing the currently used paper-based vote casting system. The system is designed under the constraint that the users have to be physically present at the polling booths The voting system is finally evaluated in terms of security and efficiency. Description: B.SC.IT(HONS) 2017-01-01T00:00:00Z