Malware is a serious threat in a world where IoT devices are becoming more and more pervasive; indeed, every day new and more sophisticated malware can rely on an attack surface that grows together with the number of new devices coming to the market. There is a constant competition between malware detection systems that have to adapt their knowledge base and heuristics day by day and malware writers that have to find new techniques to evade these systems. In this scenario, machine learning methods are the best candidate to face the continuous evolution of malware; this justifies the increasing interest in such approaches to build antimalware systems able to learn and adapt themselves. However, a still open question is how robust machine learning-based systems are against obfuscation techniques: methods that base their effectiveness on what they are able to learn from a training set are potentially vulnerable to modifications of the code that alter the probabilistic distribution of the features observed during the training phase. In this paper we propose a comparison of seven different methods trained to classify malware, paying specific attention to the recent image-based approaches. The comparison has been conducted using one of the largest dataset of malware publicly released until now, i.e., the SOREL-20M, composed of more than 20 million of samples divided in 11 families of malware. In the proposed analysis, we have considered four basic obfuscation techniques based on the addition of a sequence of bytes at the end of the executable; they are very easy to implement for a malware writer. All the tested methods achieved a very high accuracy on unmodified test samples, but only few of them have demonstrated to be able to withstand the considered obfuscation techniques.
Machine Learning Methodologies for Preventing Malware Obfuscation
Carletti V.;Saggese A.;Foggia P.;Greco A.;Vento M.
2023
Abstract
Malware is a serious threat in a world where IoT devices are becoming more and more pervasive; indeed, every day new and more sophisticated malware can rely on an attack surface that grows together with the number of new devices coming to the market. There is a constant competition between malware detection systems that have to adapt their knowledge base and heuristics day by day and malware writers that have to find new techniques to evade these systems. In this scenario, machine learning methods are the best candidate to face the continuous evolution of malware; this justifies the increasing interest in such approaches to build antimalware systems able to learn and adapt themselves. However, a still open question is how robust machine learning-based systems are against obfuscation techniques: methods that base their effectiveness on what they are able to learn from a training set are potentially vulnerable to modifications of the code that alter the probabilistic distribution of the features observed during the training phase. In this paper we propose a comparison of seven different methods trained to classify malware, paying specific attention to the recent image-based approaches. The comparison has been conducted using one of the largest dataset of malware publicly released until now, i.e., the SOREL-20M, composed of more than 20 million of samples divided in 11 families of malware. In the proposed analysis, we have considered four basic obfuscation techniques based on the addition of a sequence of bytes at the end of the executable; they are very easy to implement for a malware writer. All the tested methods achieved a very high accuracy on unmodified test samples, but only few of them have demonstrated to be able to withstand the considered obfuscation techniques.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.