Evasion-Resistant PDF Malware Detection Using Statistical Byte-Level Features

Date of Award

5-9-2026

Degree Name

M.S. in Computer Science

Department

Department of Computer Science

Advisor/Chair

Phu Phung

Abstract

The widespread adoption of the Portable Document Format across enterprise and academic settings has made it a primary vehicle for malware delivery. Existing machine learning detectors, including the widely adopted Mimicus system, rely on keyword-based features that are fundamentally vulnerable to semantic-preserving structural mutations. This thesis presents an alternative detection approach grounded in statistical byte-level features computed from raw and decoded PDF stream content, with particular emphasis on seventeen novel features designed through systematic adversarial reasoning. A five-stage processing pipeline extracts ninety-seven features per document — eighty-two base statistical measurements and fifteen engineered features derived from structural balance ratios, inter-layer entropy differentials, and decoded content statistics. A Random Forest classifier trained on ten thousand balanced Contagio samples achieves 98.55 percent test accuracy, an F1 score of 0.9853, and a ROC-AUC of 0.9981 under five-fold stratified cross-validation. The seventeen novel features are evaluated in isolation across four classifiers — Random Forest, Gradient Boosting, Support Vector Machine, and Logistic Regression — to establish algorithm-independent discriminative validity. Random Forest achieves 97.60 percent test accuracy on the novel feature subset alone; Gradient Boosting reaches 97.35 percent; SVM achieves 96.35 percent; and Logistic Regression achieves 90.80 percent. Seven of the top ten features by consensus permutation importance belong to the Tier 1 group, which has no equivalent in prior PDF detection literature. The top-ranked features are feat_decoded_printable_min, feat_entropy_delta_raw_dec, and decoded_stream_entropy.p99 — all computed on decompressed stream content and invariant to stream repacking transforms by the losslessness of zlib decompression. A theoretical evasion resistance analysis demonstrates that decoded stream features are mathematically invariant to Transform T9 (stream repacking), and a feature-level simulation across all nine Yudin et al. structural mutations produces zero evasion when the complete ninety-seven-feature pipeline is correctly evaluated. The thesis also reproduces the Yudin et al. MCTS evasion framework, completing eleven skeleton stub methods, reconstructing the mPDF writer class from scratch, and implementing all nine transforms independently using pikepdf.

Keywords

Computer Science

Comments

OCLC No. 1591628020

Rights Statement

Copyright 2026, author.

Share

COinS
 
 
 

Links