The establishment of Large Language Models allowed people to interact with tools capable of answering in a natural language many kinds of questions on even very large sets of topics. Although the natural language generation processes have to address several issues (e.g., providing focused content w.r.t. queries, composing texts without ambiguities, and so forth), models and tools are becoming more and more capable of providing answers with a syntactically and semantically correct form, independently from both topics and languages. This led to enabling an algorithm to become capable of writing algorithms together with their implementation, so tackling an even more complex task since programming languages are more rigid and precise, and the generated code should also embrace the reasoning underlying methodologies used to solve problems at different levels of complexities. At present, the most representative example of such a tool is given by ChatGPT. Based on the GPT-3.5 model and trained over more than 300 Billion tokens, ChatGPT obtained high notoriety and is starting to impact society due to its wide usage in the daily life of people. This paper aims at evaluating to what extent ChatGPT and its underlying model are capable of generating algorithms for the discovery of Functional Dependencies (fds) from data. The latter represents a very complex problem to which the scientific literature has devoted much effort. The inference of a correct, minimal, and complete set of fds, holding on a given dataset, defines the main constraints guaranteeing literature solutions to be considered effective, leading to questioning if also solutions generated from ChatGPT can satisfy them. In particular, by following a prompt-based approach, we enabled ChatGPT to provide 7 different solutions to the fd discovery problem and measured their results in comparison with the ones provided by the HyFD discovery algorithm, one of the most efficient solutions provided in the literature.
Discovering Functional Dependencies: Can We Use ChatGPT to Generate Algorithms?
Caruccio Loredana;Cirillo Stefano;Pizzuti Tullio;Polese Giuseppe
2023-01-01
Abstract
The establishment of Large Language Models allowed people to interact with tools capable of answering in a natural language many kinds of questions on even very large sets of topics. Although the natural language generation processes have to address several issues (e.g., providing focused content w.r.t. queries, composing texts without ambiguities, and so forth), models and tools are becoming more and more capable of providing answers with a syntactically and semantically correct form, independently from both topics and languages. This led to enabling an algorithm to become capable of writing algorithms together with their implementation, so tackling an even more complex task since programming languages are more rigid and precise, and the generated code should also embrace the reasoning underlying methodologies used to solve problems at different levels of complexities. At present, the most representative example of such a tool is given by ChatGPT. Based on the GPT-3.5 model and trained over more than 300 Billion tokens, ChatGPT obtained high notoriety and is starting to impact society due to its wide usage in the daily life of people. This paper aims at evaluating to what extent ChatGPT and its underlying model are capable of generating algorithms for the discovery of Functional Dependencies (fds) from data. The latter represents a very complex problem to which the scientific literature has devoted much effort. The inference of a correct, minimal, and complete set of fds, holding on a given dataset, defines the main constraints guaranteeing literature solutions to be considered effective, leading to questioning if also solutions generated from ChatGPT can satisfy them. In particular, by following a prompt-based approach, we enabled ChatGPT to provide 7 different solutions to the fd discovery problem and measured their results in comparison with the ones provided by the HyFD discovery algorithm, one of the most efficient solutions provided in the literature.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.