Unsupervised Domain Adaptation for Static Malware Detection based on Gradient Boosting Trees

Panpan Qi, Wei Wang, Lei Zhu, See Kiong Ng
2021 Proceedings of the 30th ACM International Conference on Information & Knowledge Management  
Static malware detection is important for protection against malware by allowing for malicious files to be detected prior to execution. It is also especially suitable for machine learning-based approaches. Recently, gradient boosting decision trees (GBDT) models, e.g., LightGBM (a popular implementation of GBDT), have shown outstanding performance for malware detection. However, as malware programs are known to evolve rapidly, malware classification models trained on the (source) training data
more » ... ften fail to generalize to the target domain, i.e., the deployed environment. To handle the underlying data distribution drifts, unsupervised domain adaptation techniques have been proposed for machine learning models including deep learning models. However, unsupervised domain adaptation for GBDT has remained challenging. In this paper, we adapt the adversarial learning framework for unsupervised domain adaptation to enable GBDT learn domain-invariant features and alleviate performance degradation in the target domain. In addition, to fully exploit the unlabelled target data, we merge them into the training dataset after pseudo-labelling. We propose a new weighting scheme integrated into GBDT for sampling instances in each boosting round to reduce the negative impact of wrongly labelled target instances. Experiments on two large malware datasets demonstrate the superiority of our proposed method. CCS CONCEPTS • Computing methodologies → Transfer learning; • Security and privacy → Malware and its mitigation.
doi:10.1145/3459637.3482400 fatcat:tahy77gksbbtro4j5r5lhf57mu