This is a mini-project for the EIE4121 Machine Learning for Cyber-security course at The Hong Kong Polytechnic University. The goal is to develop a machine learning model to classify password strength on a scale from 0 to 4, where 0 represents very weak passwords and 4 represents very strong passwords.
- 21106181D Chen Chen
- JamesHsu-porcupine
├── data/ # Password datasets
│ └── password_Set1.csv # Main dataset file
├── docs/ # Project documentation and reports
│ ├── MiniProject.doc # Project document (Word format)
│ └── Miniproject.pdf # Project document (PDF format)
├── model/MachineLearning/ # Traditional ML models
│ ├── KNN # K-Nearest Neighbors model
│ └── RF # Random Forest model
├── notebooks/DeepLearning/ # Deep learning implementation notebooks
│ ├── GP19project_EIE4121_DEEPLEA... # Deep learning implementation
│ └── GP19project_EIE4121_EDA_Che... # EDA notebook
├── .gitattributes # Git attributes file
└── README.md # This file
This project focuses on developing and comparing different machine learning approaches for password strength classification. We implement both traditional machine learning algorithms (KNN, Random Forest) and deep learning models to classify passwords into five strength categories.
The dataset (password_Set1.csv) contains password samples with the following features:
password: The password stringstrength: Password strength level (0-4)- 0: Very Weak
- 1: Weak
- 2: Average
- 3: Strong
- 4: Very Strong
Our approach involves:
- Exploratory Data Analysis (EDA) to understand password characteristics
- Feature Engineering to extract meaningful features from passwords:
- Length, character diversity, entropy
- Character type counts and ratios
- Pattern detection (sequential and repeated characters)
- Model Implementation:
- Traditional ML: K-Nearest Neighbors, Random Forest
- Deep Learning: Hybrid CNN-LSTM model with character embeddings
Our deep learning approach combines:
- Character-level embeddings to capture semantic information
- CNN layers to detect local patterns
- LSTM layers to understand sequential patterns
- Numerical features to incorporate password characteristics
- Class weighting to handle imbalanced data
Performance metrics for each model are evaluated using:
- Accuracy
- Precision, Recall, F1-score
- Confusion matrix
- Per-class performance
To use the notebooks:
- Clone the repository
- Install required dependencies:
pip install pandas numpy tensorflow scikit-learn matplotlib seaborn - Run the notebooks in the following order:
- EDA notebook
- Deep learning implementation
- Implement ensemble methods
- Explore additional feature engineering techniques
- Optimize hyperparameters
- Develop a user-friendly password strength checker
- Course materials from EIE4121
- Relevant research papers on password strength classification
- Documentation for scikit-learn, TensorFlow, and other libraries used