PE-ML

If you had 200,000 executables and you were tasked with finding which were malicious and which were benign, how would you go about it?

For this challenge, I decided to add a couple of restrictions on how I can complete this model:

Can't use any data from within the sample.csv (except the label of malicious/benign)
Can't use any external apis (e.g. Virus Total)
Must be static, no dynamic analysis
Must be completely automated, I'm not manually diggen through 250k files.

The Features

I collected the following data for this ML project:

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
README.md		README.md
machine_learning.ipynb		machine_learning.ipynb
threaded_data_collection.ipynb		threaded_data_collection.ipynb
visualization.ipynb		visualization.ipynb