Defect Prediction using Machine Learning

This repository contains code for predicting software defects using machine learning models. It includes data preprocessing, exploratory data analysis, feature selection, model training, hyperparameter tuning, model stacking, and result analysis.

Introduction

Software defect prediction is essential for improving software quality and reducing maintenance costs. In this project, we aim to predict defects in software using various machine learning algorithms.

Dataset

The dataset used in this project is available in the data directory. It includes train.csv and test.csv for training and testing data, respectively.

Synthetically-Generated Datasets like every Kaggle Competition, you can find more info here: https://www.kaggle.com/competitions/playground-series-s3e23/data

Data Preparation

We start by loading the dataset and checking for missing values. Fortunately, there are no missing values in the dataset.

Data Exploration

We perform data exploration, including visualizing the distribution of the 'defect' variable. This helps us understand the class distribution and the balance between defects and non-defects.

Feature Selection

To reduce the dimensionality of the dataset, we calculate the correlation matrix and identify highly correlated features. We also use Principal Component Analysis (PCA) to select the most important features for modeling and have a better understanding of our variables

Model Training

We train several machine learning models, including Random Forest, LightGBM, XGBoost, and CatBoost. These models are trained on the preprocessed dataset and evaluated for their predictive performance.

Hyperparameter Tuning

To improve model performance, we use hyperparameter tuning with the Optuna library to find the best hyperparameters for each model.

Model Stacking

We apply a stacking technique to combine the predictions from multiple models with different weights. This ensemble approach helps to further improve predictive accuracy.

Results

After extensive model training and hyperparameter tuning, we evaluate the best model's performance using the ROC AUC score, which is the evaluation metric. Submissions are evaluated based on the area under the ROC curve between the predicted probability and the observed target.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
playground-series-s3e23.zip		playground-series-s3e23.zip
ps-s3e23.ipynb		ps-s3e23.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Defect Prediction using Machine Learning

Table of Contents

Introduction

Dataset

Data Preparation

Data Exploration

Feature Selection

Model Training

Hyperparameter Tuning

Model Stacking

Results

About

Releases

Packages

Languages

emanueleiacca/Playground-Series---Season-3-Episode-23

Folders and files

Latest commit

History

Repository files navigation

Defect Prediction using Machine Learning

Table of Contents

Introduction

Dataset

Data Preparation

Data Exploration

Feature Selection

Model Training

Hyperparameter Tuning

Model Stacking

Results

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages