One of the most relevant problems with Information Retrieval (IR) softwares is the correct processing of complex lexical units, today also known as multiword units. The shortcomings are mainly due to the fact that such units are often considered as extemporaneous combinations of words retrievable by means of statistical routines. On the contrary, several linguistic studies, also dating back to the '60s, show that multiword units, and mainly compound nouns, are almost always fixed meaning units, with specific formal, morphological, grammatical and semantic characteristics. Furthermore, these units can be processed as dictionary entries, thus becoming concrete lingware tools useful to achieve efficient semantic information retrieval (IR). Therefore, in this paper we will focus on CATALOGA®, an automatic IR software which retrieves terminological information from digitized texts without any human intervention. CATALOGA® is actually configured as a stand-alone software which can be integrated in Web sites and portals to be used online. More specifically, we will describe its lingware and software characteristics, discussing their usage as a possible solution to current IR software limitations. The analytical procedure here described will prove itself appropriate for any type of digitized text, and will also represent a relevant support for the building and implementing of Semantic Web (SW) interactive platforms.
CATALOGA®: a Software for Semantic and Terminological Information Retrieval
ELIA, Annibale;POSTIGLIONE, Alberto;MONTELEONE, Mario;MONTI, JOHANNA;GUGLIELMO, DANIELA
2011-01-01
Abstract
One of the most relevant problems with Information Retrieval (IR) softwares is the correct processing of complex lexical units, today also known as multiword units. The shortcomings are mainly due to the fact that such units are often considered as extemporaneous combinations of words retrievable by means of statistical routines. On the contrary, several linguistic studies, also dating back to the '60s, show that multiword units, and mainly compound nouns, are almost always fixed meaning units, with specific formal, morphological, grammatical and semantic characteristics. Furthermore, these units can be processed as dictionary entries, thus becoming concrete lingware tools useful to achieve efficient semantic information retrieval (IR). Therefore, in this paper we will focus on CATALOGA®, an automatic IR software which retrieves terminological information from digitized texts without any human intervention. CATALOGA® is actually configured as a stand-alone software which can be integrated in Web sites and portals to be used online. More specifically, we will describe its lingware and software characteristics, discussing their usage as a possible solution to current IR software limitations. The analytical procedure here described will prove itself appropriate for any type of digitized text, and will also represent a relevant support for the building and implementing of Semantic Web (SW) interactive platforms.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.