Data are published to encourage data exploitation. However, data quality issues threaten data consumption and require data consumers investing time and effort in data cleansing. By focusing on textual geographical data, we aim to detect inaccurate values, such as typos, truncated values, and propose corrections by a clustering-based approach. Our method is mainly based on a dictionary of correct values, the Agglomerative clustering to group data in clusters, and Levenshtein and Fuzzy string searching for computing word similarity. We test our approach on real open datasets published by the Campania region, heterogeneous in the topic, size, and type of errors by showing the positive results of using Levenshtein and Fuzzy Matching and exploiting clustering methods in detecting and correcting quality issues in textual geographical data. The achieved results are useful for data producers and consumers, both for the academy and the industry, in any application domain.

Detecting Data Accuracy Issues in Textual Geographical Data by a Clustering-based Approach

Pellegrino M. A.;Postiglione L.;Scarano V.
2020-01-01

Abstract

Data are published to encourage data exploitation. However, data quality issues threaten data consumption and require data consumers investing time and effort in data cleansing. By focusing on textual geographical data, we aim to detect inaccurate values, such as typos, truncated values, and propose corrections by a clustering-based approach. Our method is mainly based on a dictionary of correct values, the Agglomerative clustering to group data in clusters, and Levenshtein and Fuzzy string searching for computing word similarity. We test our approach on real open datasets published by the Campania region, heterogeneous in the topic, size, and type of errors by showing the positive results of using Levenshtein and Fuzzy Matching and exploiting clustering methods in detecting and correcting quality issues in textual geographical data. The achieved results are useful for data producers and consumers, both for the academy and the industry, in any application domain.
2020
9781450388177
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11386/4860157
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 3
  • ???jsp.display-item.citation.isi??? 3
social impact