Context: Early security requirements identification is crucial in software development, facilitating the integration of security measures into IT networks and reducing time and costs throughout software life-cycle. Objectives: This paper addresses the limitations of existing methods that leverage Natural Language Processing (NLP) and machine learning techniques for detecting security requirements. These methods often fall short in capturing syntactic and semantic relationships, face challenges in adapting across domains, and rely heavily on extensive domain-specific data. In this paper we focus on identifying the most effective approaches for this task, highlighting both domain-specific and domain-independent strategies. Method: Our methodology encompasses two primary streams of investigation. First, we explore shallow machine learning techniques, leveraging word embeddings. We test ensemble methods and grid search within and across domains, evaluating on three industrial datasets. Next, we develop several domain-independent models based on BERT, tailored to better detect security requirements by incorporating data on software weaknesses and vulnerabilities. Results: Our findings reveal that ensemble and grid search methods prove effective in domain-specific and domain-independent experiments, respectively. However, our custom BERT models showcase domain independence and adaptability. Notably, the CweCveCodeBERT model excels in Precision and F1-score, outperforming existing approaches significantly. It improves F1-score by ∼3% and Precision by ∼14% over the best approach currently in the literature. Conclusion: BERT-based models, especially with specialized pre-training, show promise for automating security requirement detection. This establishes a foundation for software engineering researchers and practitioners to utilize advanced NLP to improve security in early development phases, fostering the adoption of these state-of-the-art methods in real-world scenarios.
Beyond domain dependency in security requirements identification
Casillo F.;Deufemia V.;Gravino C.
2025
Abstract
Context: Early security requirements identification is crucial in software development, facilitating the integration of security measures into IT networks and reducing time and costs throughout software life-cycle. Objectives: This paper addresses the limitations of existing methods that leverage Natural Language Processing (NLP) and machine learning techniques for detecting security requirements. These methods often fall short in capturing syntactic and semantic relationships, face challenges in adapting across domains, and rely heavily on extensive domain-specific data. In this paper we focus on identifying the most effective approaches for this task, highlighting both domain-specific and domain-independent strategies. Method: Our methodology encompasses two primary streams of investigation. First, we explore shallow machine learning techniques, leveraging word embeddings. We test ensemble methods and grid search within and across domains, evaluating on three industrial datasets. Next, we develop several domain-independent models based on BERT, tailored to better detect security requirements by incorporating data on software weaknesses and vulnerabilities. Results: Our findings reveal that ensemble and grid search methods prove effective in domain-specific and domain-independent experiments, respectively. However, our custom BERT models showcase domain independence and adaptability. Notably, the CweCveCodeBERT model excels in Precision and F1-score, outperforming existing approaches significantly. It improves F1-score by ∼3% and Precision by ∼14% over the best approach currently in the literature. Conclusion: BERT-based models, especially with specialized pre-training, show promise for automating security requirement detection. This establishes a foundation for software engineering researchers and practitioners to utilize advanced NLP to improve security in early development phases, fostering the adoption of these state-of-the-art methods in real-world scenarios.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.