Many modern application contexts, especially those related to the semantic Web, advocate for automatic techniques capable of extracting relationships between semi-structured data, for several purposes, such as the identification of inconsistencies or patterns of semantically related data, query rewriting, and so forth. One way to represent such relationships is to use relaxed functional dependencies (rfds), since they can embed approximate matching paradigms to compare unstructured data, and admit the possibility of exceptions for them. To this end, thresholds might need to be specified in order to limit the similarity degree in approximate comparisons or the occurrence of exceptions. Thanks to the availability of huge amount of data, including unstructured data available on the Web, nowadays it is possible to automatically discover rfds from data. However, due to the many different combinations of similarity and exception thresholds, the discovery process has an exponential complexity. Thus, it is vital devising proper optimization strategies, in order to make the discovery process feasible. To this end, in this paper, we propose a genetic algorithm to discover rfds from data, also providing an empirical evaluation demonstrating its effectiveness.
Evolutionary mining of relaxed dependencies from big data collections
CARUCCIO, LOREDANA;DEUFEMIA, Vincenzo;POLESE, Giuseppe
2017
Abstract
Many modern application contexts, especially those related to the semantic Web, advocate for automatic techniques capable of extracting relationships between semi-structured data, for several purposes, such as the identification of inconsistencies or patterns of semantically related data, query rewriting, and so forth. One way to represent such relationships is to use relaxed functional dependencies (rfds), since they can embed approximate matching paradigms to compare unstructured data, and admit the possibility of exceptions for them. To this end, thresholds might need to be specified in order to limit the similarity degree in approximate comparisons or the occurrence of exceptions. Thanks to the availability of huge amount of data, including unstructured data available on the Web, nowadays it is possible to automatically discover rfds from data. However, due to the many different combinations of similarity and exception thresholds, the discovery process has an exponential complexity. Thus, it is vital devising proper optimization strategies, in order to make the discovery process feasible. To this end, in this paper, we propose a genetic algorithm to discover rfds from data, also providing an empirical evaluation demonstrating its effectiveness.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.