Using LSI to Detect Unknown Malicious VBA Macros
Journal of Information Processing
Targeted email attacks are one of the main threats to organizations of all sizes and fields. In targeted email attacks, malicious VBA (Visual Basic for Applications) macros are often embedded into the attachment files to compromise the target computers. These malicious VBA macros are obfuscated in several ways to deceive anti-virus programs. Therefore there are limitations on applying pattern-based detection to detecting these unknown malicious VBA macros. To detect unknown malicious VBA
... some methods with machine learning techniques are applicable. One method extracts words from the source code, and constructs a language model to represent VBA macros for machine learning techniques. This method constructs a language model from all the extracted words which include trivial words. Hence, there seems still room for improvement of this model. To construct an efficient language model, this paper focuses on LSI (Latent Semantic Indexing). LSI is a fundamental technique in topic modeling and calculates similarity of documents. Our method extracts words from the source code and converts them into feature vectors with several natural language processing techniques. Our method trains a classifier with benign and malicious VBA macros and detects unknown malicious VBA macros. Several thousands of samples for evaluation are obtained from Virus Total. The experimental results show that our method could detect unknown malicious VBA macros more efficiently, and reveal the advantages and disadvantages of each language model.