This project aims to detect if a pdf file is clean or malicious.
You can generate malicious PDF Files from clean PDF Files to form your dataset using the project: https://github.com/jonaslejon/malicious-pdf. This is a project by - jonaslejon (Jonas Lejon), maggick (maggick), tonyarris (Tony Harris). For issues regarding generation of Malicious PDF Files, please contact them or raise an issue on their repository.
Create two directories maliciouspdf and cleanpdf and keep your malicious and clean PDF files accordingly.
-
command_exec.pywill iterate through each and every file in the folders vizmaliciouspdfandcleanpdf. -
feature_extraction.pyhelp in feature extraction of each pdf file based on its file structure. It uses pdfid.py script, which is an opensource file and part of peepdf. -
classifier.pyimplements the Random Forest Classifier and trains it with the datapdfdataset_n.csv. We also split the data into 30% for testing purpose. Accuracy is observed to be around 99%.
We have already extracted the necessary features from these files and formed a dataset as pdfdataset.csv and pdfdataset_n.csv is min-max normalized version of it.
Please raise a PR if you have improvements for the project.