Skip to content

gokaymeydan/semiconductor-fault-detection

Repository files navigation

Semiconductor Manufacturing Fault Detection

Project Objective

This project analyzes sensor data from a semiconductor manufacturing process to detect faulty products using unsupervised anomaly detection. The goal is to navigate a complex, real-world dataset, handle challenges like missing values and class imbalance, and evaluate machine learning models. A final interactive dashboard was created to present the findings.

Interactive Dashboard

An interactive dashboard summarizing the project's key findings and model performance was created using Tableau.

View the Interactive Dashboard on Tableau Public

Dashboard Screenshot

Data Source & Technologies

Project Workflow

  1. Data Cleaning: Addressed significant missing data by removing features with over 50% null values and imputing the rest with the median.
  2. Exploratory Data Analysis (EDA): Discovered a severe class imbalance, with only ~7% of products labeled as 'Pass' (+1). This highlighted that Accuracy would be a misleading metric.
  3. Data Preparation: The data was split into stratified training (70%) and test (30%) sets. Features were then scaled using StandardScaler.
  4. Modeling: Trained two unsupervised models, Isolation Forest and One-Class SVM.

Results & Analysis

The project followed an iterative modeling process. A baseline was established with unsupervised models, followed by advanced techniques to address the severe class imbalance.

Performance was evaluated on Recall for the minority 'Pass' class.

Approach Model 'Pass' Class Recall 'Pass' Class F1-Score
Baseline 1. Isolation Forest 16% 0.17
Attempt 1 2. XGBoost + SMOTE ~10% ~0.16
Attempt 2 3. XGBoost + PCA + SMOTE 16% 0.18

Conclusion

The analysis revealed that even advanced techniques like SMOTE and PCA did not significantly improve performance over the simple baseline model. This highlights the inherent difficulty and noisy nature of this high-dimensional dataset. The project concludes that achieving a high recall rate would likely require extensive hyperparameter tuning or domain-specific feature engineering beyond the scope of this initial analysis.

This iterative process itself is a key finding, demonstrating a realistic approach to complex data science problems where solutions are not always straightforward.

How to Run

  1. Clone the repository and install dependencies: pip install -r requirements.txt
  2. Run the Jupyter Notebooks in order.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published