The exponential growth of scientific production has made it increasingly difficult to rapidly identify the most relevant contributions within the literature. In this context, automatic text summarization methods offer promising solutions, enabling the generation of informative summaries from long and structured documents. This study introduces Integrated Text Summarization (ITS), a novel unsupervised extractive approach specifically designed for scientific texts. The algorithm combines structural analysis of the document with the integration of keywords provided by the authors and terms automatically extracted from the text, in order to identify the most relevant sentences in each section. ITS was evaluated on a multidisciplinary sample of scientific articles by comparing the extracted sentences with those selected by the original authors. Its performance was further benchmarked against two reference methods: the classical TextRank algorithm and the generative model GPT-4o. The results show that ITS achieves greater accuracy and stability in identifying relevant content, even across diverse disciplinary domains. The proposed approach thus emerges as a transparent, interpretable, and effective solution for the automatic summarization of scientific knowledge.
L’aumento esponenziale della produzione scientifica rende sempre più complessa l’individuazione rapida dei contributi rilevanti nella letteratura. In questo contesto, i metodi di sintesi automatica dei testi offrono soluzioni promettenti, permettendo la generazione di riassunti informativi da documenti lunghi e strutturati. Questo studio introduce l’Integrated Text Summarization (ITS), un nuovo approccio estrattivo non supervisionato progettato specificamente per i testi scientifici. L’algoritmo combina l’analisi strutturale del documento con l’integrazione di parole chiave fornite dagli autori e/o estratte automaticamente dal testo, al fine di selezionare le frasi più rilevanti in ciascuna sezione. L’ITS è stato valutato su un campione multidisciplinare di articoli, confrontando i risultati con frasi indicate dai loro autori. Le prestazioni sono state inoltre messe a confronto con due metodi di riferimento: l’algoritmo TextRank e il modello GPT-4o. I risultati mostrano che l’ITS raggiunge una maggiore accuratezza e stabilità nella selezione dei contenuti rilevanti, anche in contesti disciplinari diversi. L’approccio si configura quindi come una soluzione trasparente, interpretabile ed efficace per la sintesi automatica della conoscenza scientifica.
Sintetizzare la conoscenza: un approccio integrato per l’estrazione di contenuti rilevanti nella letteratura scientifica
Michelangelo Misuraca;
2026
Abstract
The exponential growth of scientific production has made it increasingly difficult to rapidly identify the most relevant contributions within the literature. In this context, automatic text summarization methods offer promising solutions, enabling the generation of informative summaries from long and structured documents. This study introduces Integrated Text Summarization (ITS), a novel unsupervised extractive approach specifically designed for scientific texts. The algorithm combines structural analysis of the document with the integration of keywords provided by the authors and terms automatically extracted from the text, in order to identify the most relevant sentences in each section. ITS was evaluated on a multidisciplinary sample of scientific articles by comparing the extracted sentences with those selected by the original authors. Its performance was further benchmarked against two reference methods: the classical TextRank algorithm and the generative model GPT-4o. The results show that ITS achieves greater accuracy and stability in identifying relevant content, even across diverse disciplinary domains. The proposed approach thus emerges as a transparent, interpretable, and effective solution for the automatic summarization of scientific knowledge.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


