Artificial intelligence (AI) modeling shows strong potential for predicting process performance in environmental technologies, but its application is often constrained by the scarcity of high-quality and comprehensive datasets, which limits model accuracy and generalizability. Synthetic data generation has emerged as a promising strategy to address data scarcity, yet its effectiveness and reliability remain underexplored. In this study, we systematically examined the impact of partially synthetic datasets on the predictive performance of a widely used model – the categorical boosting regressor (CatBoost) – for volatile fatty acids (VFAs) rejection in nanofiltration processes. Ten different synthetic data generation algorithms were tested while varying the ratio of synthetic-to-actual data in the training datasets. Results showed that moderate augmentation (synthetic/actual data ratio ranging from 0.36 to 1.33) improved predictive accuracy (R2 up to 0.937 vs 0.885 for the baseline with only actual data), whereas excessive reliance on synthetic data led to performance deterioration. A maximum mean discrepancy (MMD)-controlled approach was introduced to constrain synthetic data, further improving accuracy and model explainability. SHapley Additive exPlanations (SHAP) analysis confirmed that the most influential features identified by the models – feed pH, membrane zeta potential, and applied pressure – were consistent with experimental evidence reported in the literature, ensuring domain-consistent and reliable model behavior. Overall, this study provides evidence of the feasibility and benefits of using synthetic data to enhance AI modeling of pressure-driven membrane processes for sustainable resource recovery. The proposed framework offers a pathway to mitigate data scarcity while improving model robustness and interpretability.

Artificial intelligence for predicting volatile fatty acids rejection in nanofiltration membranes

Cairone, Stefano;Zarra, Tiziano;Belgiorno, Vincenzo;Naddeo, Vincenzo
2026

Abstract

Artificial intelligence (AI) modeling shows strong potential for predicting process performance in environmental technologies, but its application is often constrained by the scarcity of high-quality and comprehensive datasets, which limits model accuracy and generalizability. Synthetic data generation has emerged as a promising strategy to address data scarcity, yet its effectiveness and reliability remain underexplored. In this study, we systematically examined the impact of partially synthetic datasets on the predictive performance of a widely used model – the categorical boosting regressor (CatBoost) – for volatile fatty acids (VFAs) rejection in nanofiltration processes. Ten different synthetic data generation algorithms were tested while varying the ratio of synthetic-to-actual data in the training datasets. Results showed that moderate augmentation (synthetic/actual data ratio ranging from 0.36 to 1.33) improved predictive accuracy (R2 up to 0.937 vs 0.885 for the baseline with only actual data), whereas excessive reliance on synthetic data led to performance deterioration. A maximum mean discrepancy (MMD)-controlled approach was introduced to constrain synthetic data, further improving accuracy and model explainability. SHapley Additive exPlanations (SHAP) analysis confirmed that the most influential features identified by the models – feed pH, membrane zeta potential, and applied pressure – were consistent with experimental evidence reported in the literature, ensuring domain-consistent and reliable model behavior. Overall, this study provides evidence of the feasibility and benefits of using synthetic data to enhance AI modeling of pressure-driven membrane processes for sustainable resource recovery. The proposed framework offers a pathway to mitigate data scarcity while improving model robustness and interpretability.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11386/4946238
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? 1
social impact