The paper introduces an efficient text mining method using finite automata to extract knowledge domains from textual documents. It focuses on identifying multi-word units within terminological ontologies. Unlike simple words, multi-word units (credit card, for example) possess a monosemic nature and are relatively few and diverse from each other, precisely pinpointing a semantic area. The algorithm, designed to handle challenges posed by even very long multi-word units composed of a variable number of simple words, integrates selected ontologies into a single finite automaton. At runtime, it efficiently recognizes and outputs the knowledge domain associated with each multi-word unit, even when they partially or completely overlap. Benefits of the system include minimal IT maintenance for ontologies, continuous updates without additional computational costs, and no need for software training. The proposed approach demonstrates robust performance on both short and long documents, validated through tests on multiple textual documents, with a specific test outlined in the paper.
Text Mining with Finite State Automata via Compound Words Ontologies
alberto postiglione
2024-01-01
Abstract
The paper introduces an efficient text mining method using finite automata to extract knowledge domains from textual documents. It focuses on identifying multi-word units within terminological ontologies. Unlike simple words, multi-word units (credit card, for example) possess a monosemic nature and are relatively few and diverse from each other, precisely pinpointing a semantic area. The algorithm, designed to handle challenges posed by even very long multi-word units composed of a variable number of simple words, integrates selected ontologies into a single finite automaton. At runtime, it efficiently recognizes and outputs the knowledge domain associated with each multi-word unit, even when they partially or completely overlap. Benefits of the system include minimal IT maintenance for ontologies, continuous updates without additional computational costs, and no need for software training. The proposed approach demonstrates robust performance on both short and long documents, validated through tests on multiple textual documents, with a specific test outlined in the paper.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.