It has been demonstrated that a way to increase the number of relevant documents returned by an informational query performed on a Web repository is to expand the original query with additional knowledge, for instance coded through other topic-related terms. In this paper we propose a new technique to build automatically, through the probabilistic topic model and given a small set of documents on a topic, the expansion of a query based on a mixed Graph of Terms (mGT ) representation composed of two levels: the conceptual level, a set of interconnected terms representing concepts (undirected edges), and the word level composed of the cloud of interconnected words specifying a concept (directed edges). A mGT can be automatically learnt from a small set of documents through two learning stages and thanks to the probabilistic topic model. We have evaluated the performance through a comparison between our searching methodology and a classic one which considers the query expansion formed of only the list of concepts and words composing the graph and so where relations have not been considered. The results obtained show that our system, independently of the topic, is able to retrieve more relevant web pages.
Improving Text Retrieval Accuracy Using a Graph of Terms
Fabio Clarizia;COLACE, Francesco;GRECO, LUCA;DE SANTO, Massimo;
2011
Abstract
It has been demonstrated that a way to increase the number of relevant documents returned by an informational query performed on a Web repository is to expand the original query with additional knowledge, for instance coded through other topic-related terms. In this paper we propose a new technique to build automatically, through the probabilistic topic model and given a small set of documents on a topic, the expansion of a query based on a mixed Graph of Terms (mGT ) representation composed of two levels: the conceptual level, a set of interconnected terms representing concepts (undirected edges), and the word level composed of the cloud of interconnected words specifying a concept (directed edges). A mGT can be automatically learnt from a small set of documents through two learning stages and thanks to the probabilistic topic model. We have evaluated the performance through a comparison between our searching methodology and a classic one which considers the query expansion formed of only the list of concepts and words composing the graph and so where relations have not been considered. The results obtained show that our system, independently of the topic, is able to retrieve more relevant web pages.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.