In this work, we present two modules for a python open-source library for the analysis of the Italian language. The modules include a Pos tagger based on Averaged Perceptron Tagger and a Lemmatizer, based on the vast collection of linguistic data held by the Department of Politics and Communication Science of the University of Salerno. While the Averaged Perceptron Tagger algorithm is mostly used for the the English language from famous python libraries such as NLTK or Spacy, the Lemmatizer represents an entirely original module that relies on a vast electronic dictionary characterized by the presence of syntactic, morphological, and semantic tags. We present our approach and a preliminary experiment in which we compare our module results with the results of another widely used Pos-tagger and Lemmatizer as Tree-Tagger.
Building a Pos Tagger and Lemmatizer for the Italian Language
Maisto, Alessandro
;Balzano, Walter
2021-01-01
Abstract
In this work, we present two modules for a python open-source library for the analysis of the Italian language. The modules include a Pos tagger based on Averaged Perceptron Tagger and a Lemmatizer, based on the vast collection of linguistic data held by the Department of Politics and Communication Science of the University of Salerno. While the Averaged Perceptron Tagger algorithm is mostly used for the the English language from famous python libraries such as NLTK or Spacy, the Lemmatizer represents an entirely original module that relies on a vast electronic dictionary characterized by the presence of syntactic, morphological, and semantic tags. We present our approach and a preliminary experiment in which we compare our module results with the results of another widely used Pos-tagger and Lemmatizer as Tree-Tagger.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.