This paper addresses the challenges of managing and processing un- structured or semi-structured text, particularly in the context of increasing data vol- umes that traditional linguistic databases and algorithms struggle to handle in real- time scenarios. While humans can easily navigate linguistic complexities, compu- tational systems face significant difficulties due to algorithmic limitations and the shortcomings of Large Language Models (LLMs). These challenges often result in issues such as a lack of standardized formats, malformed expressions, semantic and lexical ambiguities, hallucinations, and failures to produce outputs aligned with the intricate meaning layers present in human language. As for the automatic analysis of linguistic data, is well known that Natural Lan- guage Processing (NLP) uses two different approaches, coming from diverse cul- tural and experiential backgrounds. The first approach is based on probabilistic computational statistics (PCS), which underpins most Machine Learning (ML), LLMs, and Artificial Intelligence (AI) techniques. The second approach is based, for each specific language, on the formalization of morpho-syntactic features and constraints used by humans in ordinary communication activities. At first glance, the second approach appears more effective in addressing linguistic phenomena such as polysemy and the formation of meaningful distributional sequences or, more precisely, acceptable and grammatical morpho-syntactic contexts. In this paper, we initiate a scientific discussion on the differences between these two approaches, aiming to shed light on their respective advantages and limitations.
Some Brief Considerations on Computational Statistics Effectiveness and Appropriateness in Natural Language Processing Applications
Monteleone, Mario
;Postiglione, Alberto
2024-01-01
Abstract
This paper addresses the challenges of managing and processing un- structured or semi-structured text, particularly in the context of increasing data vol- umes that traditional linguistic databases and algorithms struggle to handle in real- time scenarios. While humans can easily navigate linguistic complexities, compu- tational systems face significant difficulties due to algorithmic limitations and the shortcomings of Large Language Models (LLMs). These challenges often result in issues such as a lack of standardized formats, malformed expressions, semantic and lexical ambiguities, hallucinations, and failures to produce outputs aligned with the intricate meaning layers present in human language. As for the automatic analysis of linguistic data, is well known that Natural Lan- guage Processing (NLP) uses two different approaches, coming from diverse cul- tural and experiential backgrounds. The first approach is based on probabilistic computational statistics (PCS), which underpins most Machine Learning (ML), LLMs, and Artificial Intelligence (AI) techniques. The second approach is based, for each specific language, on the formalization of morpho-syntactic features and constraints used by humans in ordinary communication activities. At first glance, the second approach appears more effective in addressing linguistic phenomena such as polysemy and the formation of meaningful distributional sequences or, more precisely, acceptable and grammatical morpho-syntactic contexts. In this paper, we initiate a scientific discussion on the differences between these two approaches, aiming to shed light on their respective advantages and limitations.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.