Linguistic Text Mining

Postiglione, Alberto

doi:10.3233/FAIA241433

This paper explores the challenges of processing the increasing volume of natural language text, which often surpasses traditional methods' real-time processing abilities. These texts are typically authored by individuals from diverse educational, cultural, and experiential backgrounds. The paper highlights the main linguistic and semantic issues that arise in the analysis of natural language text. Linguistic Text Mining is a computational approach that combines linguistic principles with computational techniques to extract high-quality information from natural language texts. Despite the frequent mentions of ``Linguistic'' and ``Text-Mining'' in scientific literature, no formal definition exists; this paper proposes one. It further explores LTM’s potential in enhancing knowledge extraction by emphasizing linguistic features, such as multi-word units (MWUs). Traditional text analysis relies heavily on statistical methods, focusing on simple words, which are often polysemous, or on aggregating words without semantic context and this limits systems' ability to interpret domain-specific semantics. By contrast, MWUs, like ``credit card'', convey specific, unambiguous meanings, critical for identifying specialized domains. MWUs are typically organized within ontologies that represent distinct knowledge domains. Building on previous work, the study compares AUTOMETA, an ontology-based approach using finite automata for MWU identification, with large language model (LLM)-based and other ontology-driven Linguistic Text Mining methods. Findings suggest that integrating linguistic frameworks significantly improves information extraction, offering a deeper understanding of complex language structures.

UniSa - IRIS Institutional Research Information System