The paper introduces an efficient text mining method using finite automata to extract knowledge domains from textual documents. It focuses on identifying multi-word units within terminological ontologies. Unlike simple words, multi-word units (credit card, for example) possess a monosemic nature and are relatively few and diverse from each other, precisely pinpointing a semantic area. The algorithm, designed to handle challenges posed by even very long multi-word units composed of a variable number of simple words, integrates selected ontologies into a single finite automaton. At runtime, it efficiently recognizes and outputs the knowledge domain associated with each multi-word unit, even when they partially or completely overlap. Benefits of the system include minimal IT maintenance for ontologies, continuous updates without additional computational costs, and no need for software training. The proposed approach demonstrates robust performance on both short and long documents, validated through tests on multiple textual documents, with a specific test outlined in the paper.

Text Mining with Finite State Automata via Compound Words Ontologies

alberto postiglione
2024-01-01

Abstract

The paper introduces an efficient text mining method using finite automata to extract knowledge domains from textual documents. It focuses on identifying multi-word units within terminological ontologies. Unlike simple words, multi-word units (credit card, for example) possess a monosemic nature and are relatively few and diverse from each other, precisely pinpointing a semantic area. The algorithm, designed to handle challenges posed by even very long multi-word units composed of a variable number of simple words, integrates selected ontologies into a single finite automaton. At runtime, it efficiently recognizes and outputs the knowledge domain associated with each multi-word unit, even when they partially or completely overlap. Benefits of the system include minimal IT maintenance for ontologies, continuous updates without additional computational costs, and no need for software training. The proposed approach demonstrates robust performance on both short and long documents, validated through tests on multiple textual documents, with a specific test outlined in the paper.
2024
978-3-031-53554-3
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11386/4855311
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 2
  • ???jsp.display-item.citation.isi??? ND
social impact