OAR@UM Collection:
/library/oar/handle/123456789/25578
2026-06-22T22:52:48ZApplication and improvement of genetic algorithms and genetic programming towards the fight against spam and other internet malware
/library/oar/handle/123456789/73653
Title: Application and improvement of genetic algorithms and genetic programming towards the fight against spam and other internet malware
Abstract: While spam is increasingly acknowledged as a very expensive problem on the Internet,
spam filters attempt to detect spam from emails with the highest possible accuracy.
Reverse Polish Notation (RPN) expressions are proposed as a means to combine a range
of evaluated features from emails for the detection of spam. Theoretical arguments for
the use of RPN expressions applied to spam detection are proposed, together with a new
RPN Block Representation. Theoretical comparisons of RPN expressions with Naïve
Bayes and Support Vector Machine are also given. It is shown that such RPN expressions
are more expressive. A proof that email spam detection is NP-complete is given by
mapping groups of email spams onto malware virus families. Seventy-two features,
ranging from Subject-line, Header-based, Message Body-based, URL-based and
stylistic, have been used to evolve RPN expressions using Linear Genetic Programming.
New features and specifically the application of a group of URL features to spam
detection are proposed (since many spams contain links to domains which are at times
even malicious). These new features are shown to be useful for spam detection
theoretically and in practice. Linear Genetic Programming is a subset of Genetic
Programming where chromosomes are computer programs represented using imperative
computer language instructions or machine code instead of trees made up of symbolic
expressions. Such machine code can encode RPN expressions. The Linear Genetic
Programming system is used to “learn” an RPN expression consisting of a combination
of features which can be used to detect spam.
A number of feature selection algorithms are used to identify which subsets of features
are most relevant to classification. The feature selection techniques Minimum
Redundancy Maximum Relevance method (this filter technique finds features which are
mutually far from each other while still having high “correlation” to classification), the
conventional Maximum Relevance method, Principle Component Analysis and using
the entire feature set are investigated using the Linear Genetic Programming system
applied to the SpamAssassin Spam Corpus, which is a standard ham and spam archive.
Theoretical and practical comparisons with literature results and industry open source
applications are given for the proposed system.
A new Block Crossover operator which helps preserve and keep intact RPN Blocks as
well as a new RPN First Subexpression algorithm are also proposed. This algorithm
helps to maintain a double check for spam and ham and results in improvement of
detection accuracy and other metrics. The system was very demanding computationally,
taking a long time to run on a supercomputer. The proposed system achieved 99.17%
accuracy and an F1-score of 0.9868, which compare very well with results given in the
literature as well as with performance of industrial Bogofilter and SpamAssassin Spam
Filters.
Description: PH.D.2017-01-01T00:00:00ZGenetic algorithm based metaheuristic optimisation of machine learning algorithm parameters
/library/oar/handle/123456789/27791
Title: Genetic algorithm based metaheuristic optimisation of machine learning algorithm parameters
Abstract: through design and experimentation. The search for optimal or near optimal algorithm
configuration parameters is one way to improve performance and presents a research
area of significant interest not only to researchers in the field, but also to developers of
new algorithms and data analysts in diverse scientific fields.
This optimisation process involves a search of very often a large, possibly infinite,
unique and irregular parameter search space which renders the search for optimality or
near optimality difficult in feasible time. Metaheuristic algorithms with properties,
such as exploration, exploitation, use of acquired knowledge gained and stochastic
elements, render themselves them a suitable class of solutions for this problem.
Various studies have explored different metaheuristic or meta-optimisation approaches
in the search for ever better machine learner optimisers and more generally applicable
ones.
The aim of this study was thus to explore and validate the use of a Genetic
Algorithm based approach, often selected in studies of specific optimisation problems,
as a generally applicable solution to the Machine Learning Algorithm Parameter
optimisation problem. The Simple Genetic Algorithm (SGA), in particular, was
chosen as the focus for this study because of its clear metaheuristic qualities and its
relatively simple and well known evolutionary architecture.
The method used to measure the performance of the SGA as a general metaoptimiser
was involved the carrying out of a series of experiments, in which the SGA
and other meta-optimiser algorithms were applied to the optimisation of a select test
base of machine learner algorithms and datasets, with different characteristics and
under various experimental conditions. In a novel take over other studies, a close
measured look was also taken at the efficiency and effectiveness of the SGA in the
meta-optimisation process.
This necessitated the design and setup of a significant number of different
experiment runs in a set of phased studies. These experiments were run using multiple
fold cross-validation at machine learner and meta-optimiser levels for statistical
validity, involving around 200 million machine learner evaluations executed over a set
of ten standard workstations. The duration of evaluations of individual configurations ranged from a few milliseconds to over one and a half hours. An automated system
was designed and developed by the author to run the planned experimentation and
gather valuable performance data for subsequent analysis and reporting.
The results showed that there were consistent, though not statistically significant,
indications that the SGA was on average a good, though not optimal optimiser of
machine learning algorithm parameters over the selected test base. It was also found
that without tuning, the SGA suffered from inefficiencies which reduced its overall
effectiveness.
Other secondary results and methodological insights obtained include:
1. the visualisation and measurement of the different accuracy parameter space
landscapes,
2. the measurement of the overheads and inefficiencies of search,
3. the effect of evaluation time on machine learner performance,
4. the effect of dataset size and machine learner processing costs on metaoptimiser
performance,
5. the effect of meta-optimiser tuning on its performance,
6. the value of low-cost pre-optimisation exploration of the optimisation problem,
and
7. the general applicability of the measures developed or adapted for this study.
Two meta-optimiser algorithms were developed for comparative analysis. One was
based on Iterated Local search and employed a sampling of the parameter space
neighbourhood of the current candidate for the local search. The other was a hybrid
SGA with refinement of the epoch's best candidate through local search using the
above neighbourhood sampling.
The contribution of this study lies in the sum of its outcomes and the potential it
holds for further research opportunities.
Description: PH.D.IT2017-01-01T00:00:00ZLink prediction in social graph databases
/library/oar/handle/123456789/25817
Title: Link prediction in social graph databases
Abstract: The graph structure is widely researched due to its significance in various areas. The
topology of such graphs offers a basis of pattern and knowledge extraction by using
various mining algorithms. Domains like social activity, continued to give rise to the
popularity of graphs, especially in the field of predictive analysis. The dynamic nature
of social graphs motivated various researchers to anticipate the evolution of these
networks through time. Predicting the likelihood of social interactions formulating at a
future time period is based on the Link Prediction Problem. Realizing these links, help
to discover future and hidden activities which are very useful for different sectors. A
selection of social activities and interactions are not only dynamic, but their strength and
reach evolve over time too. Due to this, a number of studies suggest that considering
time as an additional dimension improves the results obtained in link prediction. This
study evaluates the effect of this consideration by comparing results obtained from static
methods with those returned from temporal methods. A supervised binary classification
technique is used on three different social datasets with features describing popular
graph metrics representing the similarities and proximities between nodes. This study
also proposes and implements a method to assign time-based weights which describe
the activeness of the network nodes based on how recent their adjacent interactions are.
Various performance measures such as accuracy, precision, and recall are used to aid
with the comparative analysis of the results. The results of this study show that the
consideration of time-based aspects helps improve the link predictions. The Katz metric
yielded the best performance when compared to the other graph metrics. This result on
one of the datasets managed to correctly classify seventeen additional links when the
time-based method is used.
Description: M.SC.IT2017-01-01T00:00:00ZE-voting system using blockchain technology
/library/oar/handle/123456789/25816
Title: E-voting system using blockchain technology
Abstract: In recent years, cryptocurrencies have revolutionised the financial world with the
introduction of Bitcoin. The Bitcoin operates in a decentralised network, where
everyone is an owner, and no central server exists. This decentralised network is called
a blockchain, and since its conception has evolved beyond its primary use. The idea of
decentralised applications gave birth to new and innovative ways of designing
applications. This idea continued growing with the introduction of smart contracts on
another the Ethereum blockchain, giving new possibilities to developers. This
dissertation provides one such application of blockchain technology, which area is
blockchain based electronic voting. This application is analysed in the context of Malta
and is aimed at replacing the currently used paper-based vote casting system. The
system is designed under the constraint that the users have to be physically present at
the polling booths The voting system is finally evaluated in terms of security and
efficiency.
Description: B.SC.IT(HONS)2017-01-01T00:00:00Z