A novel evasive PDF malware detection model based on stacking learning

Thumbnail Image



Journal Title

Journal ISSN

Volume Title


University of New Brunswick


Over the last few years, Portable Document Format (PDF) has become the most popular content presenting format among users due to its extraordinarily flexible and easy-to-work features. However, advanced PDF features such as Javascript injection or file embedding make them an attractive target to exploit by attackers. Due to the complex PDF structure and sophistication of attacks, traditional detection approaches such as Anti-Viruses are ineffective as they rely on signature-based techniques. Various research works take a different direction and attempt to utilize AI technologies such as Machine Learning (ML) and Deep Learning (DL) to detect malicious PDF files. Despite the results from the research communities, the evasive malicious PDF files are still a security threat. This research attempts to address this gap by proposing a novel framework that stacks ML models for detecting an evasive malicious PDF file. In addition to that, we evaluated our solution by using two different datasets, Contagio and a newly generated evasive PDF dataset. In the first evaluation, we achieved accuracy and F1-score of 99.89% and 99.86%, which is better than the performance of existing models. Moreover, we re-evaluated our framework using our new evasive PDF dataset as an improved version of Contagio to verify our solution further. As a result, we achieved 98.69% and 98.77% as accuracy and F1-scores, demonstrating the effectiveness of our approach to improve the results. Experimental results along with the new dataset proved that our model could be applied in practice.