¸£ÀûÔÚÏßÃâ·Ñ

Menu

Learn
•

Contributing towards the development of digital resources for the Maltese language

The NOMOCRAT project has come to a close, delivering resources and benchmarks for Maltese-language document analysis and OCR research, and contributing towards the development of digital resources for the Maltese language.

The project focused on extracting structured Maltese text from PDFs and scanned documents using document layout analysis (DLA) and optical character recognition (OCR) techniques.

As part of the project, the team collected over 300 Maltese-language documents and created annotated datasets for OCR, document layout analysis, and reading-order detection. Existing OCR and DLA models were evaluated on Maltese data, and fine-tuned versions of these models were developed and tested. The project also produced algorithms for paragraph reconstruction and reading-order identification, improving the accuracy of extracted text compared to conventional PDF text extraction methods.

The project demonstrated that computer-vision-based text extraction for Maltese documents is feasible and identified several areas for further work, including the creation of larger annotated datasets and improved OCR models for Maltese text. Beyond the technical achievements, the project highlighted the importance of creating high-quality Maltese digital corpora to support accessibility technologies, language technologies, AI systems, and future research in document engineering.

The work initiated through NOMOCRAT will continue to support future research in Maltese OCR and document analysis. As part of this ongoing effort, the ACM Symposium on Document Engineering 2026 will feature a competition focused on Optical Character Recognition (OCR), encouraging researchers and developers to continue advancing technologies for document understanding and language accessibility.

Funded by Xjenza Malta under grant REP-2024-057, the project was carried out under the principal investigation of Dr Marc Tanti from the Institute of Linguistics and Language Technology, with Prof. Alexandra Bonnici and Dr Stefania Cristina from the Department of Systems and Control Engineering serving as co-investigators.

Details about the OCR competition are available on the .


Categories