One of the primary tasks of text mining is to organise a large number of unlabeled documents into a smaller set of meaningful and coherent clusters that are similar in content. Clustering algorithms typically operate on document x term matrices, where each document is represented as a vector in an algebraic format. Alternatively, a collection of documents can be represented using a documents x documents structure, which can be viewed as an adjacency matrix and graphically depicted as a graph. In network analysis, community detection is used on these graphs to identify groups of nodes that share common characteristics and perform similar functions. This paper aims to evaluate different data structures and grouping criteria, showing the effectiveness of various alternatives in a text categorisation strategy. We conduct a comparative study involving classical text clustering methods and community detection approaches, examining and discussing their performances.
From Vectors to Networks: Comparing conventional and graph-based approaches to Unsupervised Text Categorisation
Michelangelo Misuraca
;
2025
Abstract
One of the primary tasks of text mining is to organise a large number of unlabeled documents into a smaller set of meaningful and coherent clusters that are similar in content. Clustering algorithms typically operate on document x term matrices, where each document is represented as a vector in an algebraic format. Alternatively, a collection of documents can be represented using a documents x documents structure, which can be viewed as an adjacency matrix and graphically depicted as a graph. In network analysis, community detection is used on these graphs to identify groups of nodes that share common characteristics and perform similar functions. This paper aims to evaluate different data structures and grouping criteria, showing the effectiveness of various alternatives in a text categorisation strategy. We conduct a comparative study involving classical text clustering methods and community detection approaches, examining and discussing their performances.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.