This paper addresses the challenges of managing and processing un- structured or semi-structured text, particularly in the context of increasing data vol- umes that traditional linguistic databases and algorithms struggle to handle in real- time scenarios. While humans can easily navigate linguistic complexities, compu- tational systems face significant difficulties due to algorithmic limitations and the shortcomings of Large Language Models (LLMs). These challenges often result in issues such as a lack of standardized formats, malformed expressions, semantic and lexical ambiguities, hallucinations, and failures to produce outputs aligned with the intricate meaning layers present in human language. As for the automatic analysis of linguistic data, is well known that Natural Lan- guage Processing (NLP) uses two different approaches, coming from diverse cul- tural and experiential backgrounds. The first approach is based on probabilistic computational statistics (PCS), which underpins most Machine Learning (ML), LLMs, and Artificial Intelligence (AI) techniques. The second approach is based, for each specific language, on the formalization of morpho-syntactic features and constraints used by humans in ordinary communication activities. At first glance, the second approach appears more effective in addressing linguistic phenomena such as polysemy and the formation of meaningful distributional sequences or, more precisely, acceptable and grammatical morpho-syntactic contexts. In this paper, we initiate a scientific discussion on the differences between these two approaches, aiming to shed light on their respective advantages and limitations.

Some Brief Considerations on Computational Statistics Effectiveness and Appropriateness in Natural Language Processing Applications

Monteleone, Mario
;
Postiglione, Alberto
2024-01-01

Abstract

This paper addresses the challenges of managing and processing un- structured or semi-structured text, particularly in the context of increasing data vol- umes that traditional linguistic databases and algorithms struggle to handle in real- time scenarios. While humans can easily navigate linguistic complexities, compu- tational systems face significant difficulties due to algorithmic limitations and the shortcomings of Large Language Models (LLMs). These challenges often result in issues such as a lack of standardized formats, malformed expressions, semantic and lexical ambiguities, hallucinations, and failures to produce outputs aligned with the intricate meaning layers present in human language. As for the automatic analysis of linguistic data, is well known that Natural Lan- guage Processing (NLP) uses two different approaches, coming from diverse cul- tural and experiential backgrounds. The first approach is based on probabilistic computational statistics (PCS), which underpins most Machine Learning (ML), LLMs, and Artificial Intelligence (AI) techniques. The second approach is based, for each specific language, on the formalization of morpho-syntactic features and constraints used by humans in ordinary communication activities. At first glance, the second approach appears more effective in addressing linguistic phenomena such as polysemy and the formation of meaningful distributional sequences or, more precisely, acceptable and grammatical morpho-syntactic contexts. In this paper, we initiate a scientific discussion on the differences between these two approaches, aiming to shed light on their respective advantages and limitations.
2024
978-1-64368-569-4
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11386/4892122
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact