Over the past decade, there have been numerous extensions to the definition of Functional Dependency (FD), culminating in the introduction of Relaxed Functional Dependency (RFD), offering more flexible constraints compared to traditional FDs. This increased flexibility makes RFDs well-suited for exploring and profiling data in datasets with lower data quality. However, efficiently identifying RFDs within dynamic data sources presents a significant challenge, as it requires processing an entire dataset from scratch whenever modifications occur. To tackle this problem, incremental discovery algorithms have been defined, but they often suffer when the frequency and the size of batches of updates increase. This paper presents a new algorithm, namely D-INDIBITS, relying on a new decentralized architecture to balance the workload that drives the incremental discovery process of INDIBITS, which is based on bitwise operators for computing attribute similarities. Experiments demonstrate DINDIBITS's effectiveness compared to FD and RFD discovery algorithms on both static and dynamic real-world data. With batches of modifications of sizes 10k and 100k, D-INDIBITS is capable of updating the set of RFDs in a few seconds, whereas all other approaches often employ more than 3 hours.
Decentralized and Incremental Discovery of Relaxed Functional Dependencies Using Bitwise Similarity
Breve B.;Caruccio L.;Cirillo S.
;Deufemia V.;Polese G.
2024
Abstract
Over the past decade, there have been numerous extensions to the definition of Functional Dependency (FD), culminating in the introduction of Relaxed Functional Dependency (RFD), offering more flexible constraints compared to traditional FDs. This increased flexibility makes RFDs well-suited for exploring and profiling data in datasets with lower data quality. However, efficiently identifying RFDs within dynamic data sources presents a significant challenge, as it requires processing an entire dataset from scratch whenever modifications occur. To tackle this problem, incremental discovery algorithms have been defined, but they often suffer when the frequency and the size of batches of updates increase. This paper presents a new algorithm, namely D-INDIBITS, relying on a new decentralized architecture to balance the workload that drives the incremental discovery process of INDIBITS, which is based on bitwise operators for computing attribute similarities. Experiments demonstrate DINDIBITS's effectiveness compared to FD and RFD discovery algorithms on both static and dynamic real-world data. With batches of modifications of sizes 10k and 100k, D-INDIBITS is capable of updating the set of RFDs in a few seconds, whereas all other approaches often employ more than 3 hours.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.