In order to adhere to Open Government doctrine, Public Administrations (PAs) are requested to publish Open Data while preventing the disclosure of personal information of their citizens. Therefore, it is crucial for PAs to employ methods that ensure Privacy-preserving data publishing by distributing useful data while protecting individual privacy. In this paper, we study this problem by providing a two phases approach. First, we detect privacy issues by recognizing the minimum number of attributes that expose the highest number of unique values (that will be referred to as singletons) as Quasi-Identifier. We test our approach on real datasets openly published by the Italian government, and we discover that the quasi-identifier (year_of_birth, sex, ZIP_ofresidence) discloses up to 2% unique values in already anonymized datasets. Once accomplished the detection phase, we propose an anonymization approach to limit the privacy leakage. We investigate which combination of attributes must be generalized to achieve the minimum number of singletons while minimising the amount of modified and removed rows. We tested our approach on real datasets as in the previous phase, and we noticed that by generalizing only rows corresponding to the singletons, we achieve nearly no singletons while affecting only the 2% of rows.

Detecting and generalizing quasi-identifiers by affecting singletons

Pellegrino M. A.;Scarano V.
2020-01-01

Abstract

In order to adhere to Open Government doctrine, Public Administrations (PAs) are requested to publish Open Data while preventing the disclosure of personal information of their citizens. Therefore, it is crucial for PAs to employ methods that ensure Privacy-preserving data publishing by distributing useful data while protecting individual privacy. In this paper, we study this problem by providing a two phases approach. First, we detect privacy issues by recognizing the minimum number of attributes that expose the highest number of unique values (that will be referred to as singletons) as Quasi-Identifier. We test our approach on real datasets openly published by the Italian government, and we discover that the quasi-identifier (year_of_birth, sex, ZIP_ofresidence) discloses up to 2% unique values in already anonymized datasets. Once accomplished the detection phase, we propose an anonymization approach to limit the privacy leakage. We investigate which combination of attributes must be generalized to achieve the minimum number of singletons while minimising the amount of modified and removed rows. We tested our approach on real datasets as in the previous phase, and we noticed that by generalizing only rows corresponding to the singletons, we achieve nearly no singletons while affecting only the 2% of rows.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11386/4860152
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? ND
social impact