This paper explores the challenges of processing the increasing volume of natural language text, which often surpasses traditional methods' real-time processing abilities. These texts are typically authored by individuals from diverse educational, cultural, and experiential backgrounds. The paper highlights the main linguistic and semantic issues that arise in the analysis of natural language text. Linguistic Text Mining is a computational approach that combines linguistic principles with computational techniques to extract high-quality information from natural language texts. Despite the frequent mentions of ``Linguistic'' and ``Text-Mining'' in scientific literature, no formal definition exists; this paper proposes one. It further explores LTM’s potential in enhancing knowledge extraction by emphasizing linguistic features, such as multi-word units (MWUs). Traditional text analysis relies heavily on statistical methods, focusing on simple words, which are often polysemous, or on aggregating words without semantic context and this limits systems' ability to interpret domain-specific semantics. By contrast, MWUs, like ``credit card'', convey specific, unambiguous meanings, critical for identifying specialized domains. MWUs are typically organized within ontologies that represent distinct knowledge domains. Building on previous work, the study compares AUTOMETA, an ontology-based approach using finite automata for MWU identification, with large language model (LLM)-based and other ontology-driven Linguistic Text Mining methods. Findings suggest that integrating linguistic frameworks significantly improves information extraction, offering a deeper understanding of complex language structures.

Linguistic Text Mining

Postiglione, Alberto
2024-01-01

Abstract

This paper explores the challenges of processing the increasing volume of natural language text, which often surpasses traditional methods' real-time processing abilities. These texts are typically authored by individuals from diverse educational, cultural, and experiential backgrounds. The paper highlights the main linguistic and semantic issues that arise in the analysis of natural language text. Linguistic Text Mining is a computational approach that combines linguistic principles with computational techniques to extract high-quality information from natural language texts. Despite the frequent mentions of ``Linguistic'' and ``Text-Mining'' in scientific literature, no formal definition exists; this paper proposes one. It further explores LTM’s potential in enhancing knowledge extraction by emphasizing linguistic features, such as multi-word units (MWUs). Traditional text analysis relies heavily on statistical methods, focusing on simple words, which are often polysemous, or on aggregating words without semantic context and this limits systems' ability to interpret domain-specific semantics. By contrast, MWUs, like ``credit card'', convey specific, unambiguous meanings, critical for identifying specialized domains. MWUs are typically organized within ontologies that represent distinct knowledge domains. Building on previous work, the study compares AUTOMETA, an ontology-based approach using finite automata for MWU identification, with large language model (LLM)-based and other ontology-driven Linguistic Text Mining methods. Findings suggest that integrating linguistic frameworks significantly improves information extraction, offering a deeper understanding of complex language structures.
2024
978-1-64368-569-4
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11386/4892164
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact