In recent years, there has been an explosion of data shared online. The majority of this internet information is in text format, and can be used as a source to create new knowledge. These data are frequently unstructured and, in their raw state, cannot be used for any type of analyses, resulting challenging to manage from an Information Technology (IT) perspective. But in addition to these types of data, most companies have a huge collection of structured data, acquired and built over time. The union of these two types of information represents therefore a gold mine to be able to draw as much knowledge as possible from them. Because of this, the so-called Data Pre-processing (DPP), an important stage in the Data Mining process, allows significant manipulations on them, in order to make them useable for any subsequent elaboration procedure. The general DPP steps are the Data Cleansing, Data Integration, Data Reduction, and Data Transformation, while guaranteeing the protection of the privacy. This research focused on two different applications related to structured and unstructured data through respectively a focus on a Data Integration (DI) challenge, and one on the Automatic Text Summarization (ATS) task, whose algorithm evaluation metrics were explored. One of the most challenging issues in DI, is the research for automatic or semi-automatic methodologies, since these techniques often require the expertise of a domain specialist who can direct the process and improve the results. However, in the literature, there are not many fully or semi-automatic DI approaches unless they include experts with specific IT-skills. So, in this study, by the assistance of an intermediary figure (the Company Manager), who is not necessary skilled in IT, using an Information Retrieval methodology, clustering methods, and a trained neural network, we have built a semi-automatic DI process. This process is capable of reducing persistent conflicts in data, and ensuring a unified view of them, respecting the original constraints of the datasets and guaranteeing a high-quality outcome for Business Intelligence evaluations. At the same time, having the ability to reduce the amount of text from which to extract information is essential, when there are textual data sources involved. This is important to recover the key concepts, but also to speed up the analysis systems. In particular, ATS is a interesting challenge of Natural Language Processing. The primary issue is that there are currently a number of algorithms that attempt to reduce documents, using both statistical techniques (Extractive algorithms) and Artificial Intelligence methods (Abstractive algorithms). However, several metrics primarily based on the overlap analysis of n-grams such as the ROUGE, which is the most used, are applied to assess the quality of the results. Therefore, determining if these metrics are efficient, and whether they really enable to compare the quality of the outcomes of the various Text Summarization (TS) algorithms, is the focus of the second research topic. [edited by Author]
Data Integration and Automatic Text Summarization: A path to more informed Business Decisions / Marcello Barbella , 2023 Apr 12., Anno Accademico 2021 - 2022. [10.14273/unisa-5358].
Data Integration and Automatic Text Summarization: A path to more informed Business Decisions
Barbella, Marcello
2023
Abstract
In recent years, there has been an explosion of data shared online. The majority of this internet information is in text format, and can be used as a source to create new knowledge. These data are frequently unstructured and, in their raw state, cannot be used for any type of analyses, resulting challenging to manage from an Information Technology (IT) perspective. But in addition to these types of data, most companies have a huge collection of structured data, acquired and built over time. The union of these two types of information represents therefore a gold mine to be able to draw as much knowledge as possible from them. Because of this, the so-called Data Pre-processing (DPP), an important stage in the Data Mining process, allows significant manipulations on them, in order to make them useable for any subsequent elaboration procedure. The general DPP steps are the Data Cleansing, Data Integration, Data Reduction, and Data Transformation, while guaranteeing the protection of the privacy. This research focused on two different applications related to structured and unstructured data through respectively a focus on a Data Integration (DI) challenge, and one on the Automatic Text Summarization (ATS) task, whose algorithm evaluation metrics were explored. One of the most challenging issues in DI, is the research for automatic or semi-automatic methodologies, since these techniques often require the expertise of a domain specialist who can direct the process and improve the results. However, in the literature, there are not many fully or semi-automatic DI approaches unless they include experts with specific IT-skills. So, in this study, by the assistance of an intermediary figure (the Company Manager), who is not necessary skilled in IT, using an Information Retrieval methodology, clustering methods, and a trained neural network, we have built a semi-automatic DI process. This process is capable of reducing persistent conflicts in data, and ensuring a unified view of them, respecting the original constraints of the datasets and guaranteeing a high-quality outcome for Business Intelligence evaluations. At the same time, having the ability to reduce the amount of text from which to extract information is essential, when there are textual data sources involved. This is important to recover the key concepts, but also to speed up the analysis systems. In particular, ATS is a interesting challenge of Natural Language Processing. The primary issue is that there are currently a number of algorithms that attempt to reduce documents, using both statistical techniques (Extractive algorithms) and Artificial Intelligence methods (Abstractive algorithms). However, several metrics primarily based on the overlap analysis of n-grams such as the ROUGE, which is the most used, are applied to assess the quality of the results. Therefore, determining if these metrics are efficient, and whether they really enable to compare the quality of the outcomes of the various Text Summarization (TS) algorithms, is the focus of the second research topic. [edited by Author]I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


