Abstract In the era of Internet of “everything”, the natural language text is still the undiscussed medium of representing information, as evidenced by the pervasiveness of tweets, instant messages, posts, and documents. There is an increasing need of innovative technologies targeted at a more machine-oriented communication. Many keyword-based and statistical approaches have supported information retrieval, data mining, and natural language processing systems, but a deeper understanding of text is still an urgent challenge: concepts, semantic relationships among them, contextual information needed for the concept disambiguation require further progress in the textual-information management. This work introduces a novel technique of extracting the main concepts from the text. Concepts are described by word-based connections disposed in a semantic topological space, built by the formal model, the simplicial complex. It links the points, i.e., the words appearing in the text and incrementally creates a geometrical structure, describing concepts that are more or less specialized, depending on the aggregation distance of words. The conceptual network is context-aware, since it reveals unambiguous concepts, specialized by the analysis of the surrounding text. The framework that implements the approach, discovers basic concepts, composed of minimal number of words useful to describe a finite sense concept, and richer extended concepts built adding further relations among terms. The final topological space provides a multi-granule concept representation: from a local, word-closeness view to a highly refined description. Experiments and comparative analysis validate the effectiveness of the approach, evidencing satisfactory performance in the concept identification, with precision values greater than 80% in the most of the experiments and the recall is on average, around 60–70% with peaks of 90% for some specific concept categories.
File in questo prodotto:
Non ci sono file associati a questo prodotto.