Please use this identifier to cite or link to this item: /library/oar/handle/123456789/122825
Title: SADIP : semi-automated data integration system for protein databases
Authors: Aquilina, Jurgen (2022)
Keywords: Bioinformatics
Databases
SQL (Computer program language)
Issue Date: 2022
Citation: Aquilina, J. (2022). SADIP: semi-automated data integration system for protein databases (Bachelor's dissertation).
Abstract: Biologists must commonly combine information from different biological databases, by manually following cross-references (hyperlinks), using the distinct access methods and data formats provided by the databases. Past research in data integration has outlined several approaches which can integrate biological databases to provide a unified view. One approach is known as data warehousing. The current state of the art in biological data warehousing, requires bespoke software development and maintenance for each database. In our view, this is infeasible given the large number of constantly changing biological databases with varying access methods and data formats. This project aims to develop a tool which can automatically integrate biological information from different databases into a data warehouse, using user-defined configurations. This tool was applied to construct a property graph database with integrated information from 10 protein databases. This allows bioinformaticians to specify complex queries through the Standard Query Language (SQL). On top of this, a web-based user interface was developed which provides biologists with all integrated information related to a single protein identified by a UniProtKB identifier. The obtained results for the utilised configuration show that developing such a tool is feasible. However, the developed prototype requires further amendments to improve its flexibility, robustness, and security. Further results obtained show that the data warehouse provides biologists with a considerable amount of valuable information but should be extended to incorporate a wider variety of biological information. Finally, the results highlighted performance deficiencies for nested information and structural domains.
Description: B.Sc. IT (Hons)(Melit.)
URI: https://www.um.edu.mt/library/oar/handle/123456789/122825
Appears in Collections:Dissertations - FacICT - 2022
Dissertations - FacICTCIS - 2022

Files in This Item:
File Description SizeFormat 
2208ICTICT391305069209_1.PDF
  Restricted Access
3.09 MBAdobe PDFView/Open Request a copy


Items in OAR@UM are protected by copyright, with all rights reserved, unless otherwise indicated.