It is well known that supervised text classification methods need to learn from many labeled examples to achieve a high accuracy. However, in a real context, sufficient labeled examples are not always available. For this reason, there has been recent interest in methods that are capable of obtaining a high accuracy even if the size of the training set is not big. The main purpose of text mining techniques is to identify common patterns through the observation of vectors of features and then to use such patterns to make predictions. Most existing methods usually make use of a vector of features made up of weighted words that unfortunately are insufficiently discriminative when the number of features is much higher than the number of labeled examples. In this paper we demonstrate that, to obtain a greater accuracy in the analysis and revelation of common patterns, we could employ more complex features than simple weighted words. The proposed vector of features considers a hierarchical structure, named a mixed Graph of Terms, composed of a directed and an undirected sub-graph of words, that can be automatically constructed from a set of documents through the probabilistic Topic Model. The method has been tested on the top 10 classes of the ModApte split from the Reuters-21578 dataset, learned on several subsets of the original training set and showing a better performance than a method using a list of weighted words as a vector of features and linear support vector machines.
Text Classification Using a Graph of Terms
NAPOLETANO, PAOLO;COLACE, Francesco;DE SANTO, Massimo;GRECO, LUCA
2012-01-01
Abstract
It is well known that supervised text classification methods need to learn from many labeled examples to achieve a high accuracy. However, in a real context, sufficient labeled examples are not always available. For this reason, there has been recent interest in methods that are capable of obtaining a high accuracy even if the size of the training set is not big. The main purpose of text mining techniques is to identify common patterns through the observation of vectors of features and then to use such patterns to make predictions. Most existing methods usually make use of a vector of features made up of weighted words that unfortunately are insufficiently discriminative when the number of features is much higher than the number of labeled examples. In this paper we demonstrate that, to obtain a greater accuracy in the analysis and revelation of common patterns, we could employ more complex features than simple weighted words. The proposed vector of features considers a hierarchical structure, named a mixed Graph of Terms, composed of a directed and an undirected sub-graph of words, that can be automatically constructed from a set of documents through the probabilistic Topic Model. The method has been tested on the top 10 classes of the ModApte split from the Reuters-21578 dataset, learned on several subsets of the original training set and showing a better performance than a method using a list of weighted words as a vector of features and linear support vector machines.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.