Fusing and recommending news reports using graph-based entity-relation representations

Please use this identifier to cite or link to this item: /library/oar/handle/123456789/100948

Title:	Fusing and recommending news reports using graph-based entity-relation representations
Authors:	Azzopardi, Joel (2012)
Keywords:	News Web sites Data mining Computer algorithms Journalism -- Technological innovations
Issue Date:	2012
Citation:	Azzopardi, J. (2012). Fusing and recommending news reports using graph-based entity-relation representations (Doctoral dissertation).
Abstract:	When an event occurs in the real world, news reports describing this event start to appear on news sites on the World Wide Web within a few minutes of the occurrence of that event. If the event is significant, numerous news reports will appear on different sites, and each report will give its own description of the event based on the sources of information available to its author. Moreover, as time passes, each news site may publish new reports related to that event that will contain information that has just been discovered. For a person to obtain all the details related to a particular event, he/she will have to read through all the reports covering that event. The multitude of news reports being published on a continuous basis on the World Wide Web also presents an issue of information overload on users. A user would need to sift through a huge number of news reports to identify those reports that are of interest to him/her. News aggregator web sites may cluster related news reports, but they do not attempt to fuse the reports into a single document that contains all of the pertinent information about a single event without any repetition. Such web sites also tend to display news reports chronologically, and a user who tracks an event over the course of several days must sift through them to identify previously unread material. Tracker news reports tend to repeat information that the user may have already read. Some news aggregator web sites and other web services will alert users to breaking news about types of events, or more typically, about news involving a named entity or event type (e.g. earthquake). However, the user must generally intervene to provide details of the entity or event type to track. In this thesis, we tackle a number of research problems: in theory, a user can identify any RSS feed as a source of news he/she would like to receive; we then cluster reports about related news received from the separate RSS feeds as they arrive; we fuse the reports into a single document, trying to preserve a logical order in which sub-events occur and eliminating repetition; new reports related to an existing cluster are integrated into the fused document; the user's interaction with a fused report is monitored in such a way that information that the user has already read is summarised so that in the next visit the user can focus on the new (novel) news; a user model is maintained to automatically identify entities and event types that appear to be of interest to the user so that he/she can be automatically alerted if a related new event occurs. We have developed the JNews news portal to implement our approach and to provide an evaluation platform to measure its ability to: i) cluster related news reports from disparate sources; ii) fuse related reports into a coherent document with minimal or no repetition but preserving all the information contained in the source reports; iii) provide an adaptive reading environment that automatically summarises information in reports that have already been read; iv) automatically identify entities and event types that the user is likely to be interested in based on their past interactions with JNews to make personalised recommendations about previously unread breaking news. As we do not know the number of clusters in advance, JNews uses a modified K-Means clustering algorithm. We represent information contained in news reports using a simplified version of Sowa's Conceptual Graphs. The graph representing a news event contains entities and their relationships. 福利在线免费 from related news reports is merged into a single graph. We keep track of the source sentences that express the relationships. The fused report is generated using the maximally expressive set of sentences, i.e. the sentences that contain most information about the entities and their relationships in the news report, and ensuring that all entities and relationships are expressed in the fused document. The advantage of using a simplified conceptual graph as the logical representation is that the entities and their relationships are represented canonically. We use the same graph to extract underlying patterns in information about types of events and/ or entities. If a user tends to read different fused reports about the same entity or event type then we can recommend similar breaking news to the user. In addition, we can recommend news using collaborative techniques. The user model is represented as a vector of weighted keywords. We use a summarisation technique, whereby the repetition of information across different documents is considered to be an indication of salience of that information, to present summaries of a fused report (containing only the most important information) that have already been read by a user. All components of JNews were designed to run fast without excessive computational resources so as to function well in an operational environment and be able to handle large amounts of data. The evaluation of JNews is performed on its three main components the Document Clustering Component, the Document Fusion Component, and the 福利在线免费 Filtering (recommendation) Component. The Document Clustering Component was evaluated using three different datasets. We found that our Document Clustering Component is very good in performing fine-grained clustering, but performs rather poorly when performing coarser-grained clustering. The Document Fusion component was evaluated using a set of news reports downloaded from MSNBC News that cite their sources, and also using human evaluation. We show that the Document Fusion component is able. to capture most of the information found across different source documents whilst maintaining readability. A corpus of news reports downloaded from Yahoo! News is used to evaluate the 福利在线免费 Filtering component. The results obtained are better than the baseline Rocchio algorithm without negative feedback.
Description:	PH.D.ARTIFICIAL INTELLIGENCE
URI:	https://www.um.edu.mt/library/oar/handle/123456789/100948
Appears in Collections:	Dissertations - FacICT - 2012 Dissertations - FacICTAI - 2002-2014

Files in This Item:

File	Description	Size	Format
PH.D._Azzopardi Joel_2012.pdf Restricted Access		13.43 MB	Adobe PDF	View/Open Request a copy

Show full item record Statistics

福利在线免费