Semiconductor Manufacturing Fault Detection

Project Objective

This project analyzes sensor data from a semiconductor manufacturing process to detect faulty products using unsupervised anomaly detection. The goal is to navigate a complex, real-world dataset, handle challenges like missing values and class imbalance, and evaluate machine learning models. A final interactive dashboard was created to present the findings.

Interactive Dashboard

An interactive dashboard summarizing the project's key findings and model performance was created using Tableau.

View the Interactive Dashboard on Tableau Public

Data Source & Technologies

Dataset: SECOM Dataset from UCI ML Repository
Technologies: Python, Pandas, Scikit-learn, Matplotlib, Jupyter Notebook, Tableau.

Project Workflow

Data Cleaning: Addressed significant missing data by removing features with over 50% null values and imputing the rest with the median.
Exploratory Data Analysis (EDA): Discovered a severe class imbalance, with only ~7% of products labeled as 'Pass' (+1). This highlighted that Accuracy would be a misleading metric.
Data Preparation: The data was split into stratified training (70%) and test (30%) sets. Features were then scaled using StandardScaler.
Modeling: Trained two unsupervised models, Isolation Forest and One-Class SVM.

Results & Analysis

The project followed an iterative modeling process. A baseline was established with unsupervised models, followed by advanced techniques to address the severe class imbalance.

Performance was evaluated on Recall for the minority 'Pass' class.

Approach	Model	'Pass' Class Recall	'Pass' Class F1-Score
Baseline	1. Isolation Forest	16%	0.17
Attempt 1	2. XGBoost + SMOTE	~10%	~0.16
Attempt 2	3. XGBoost + PCA + SMOTE	16%	0.18

Conclusion

The analysis revealed that even advanced techniques like SMOTE and PCA did not significantly improve performance over the simple baseline model. This highlights the inherent difficulty and noisy nature of this high-dimensional dataset. The project concludes that achieving a high recall rate would likely require extensive hyperparameter tuning or domain-specific feature engineering beyond the scope of this initial analysis.

This iterative process itself is a key finding, demonstrating a realistic approach to complex data science problems where solutions are not always straightforward.

How to Run

Clone the repository and install dependencies: pip install -r requirements.txt
Run the Jupyter Notebooks in order.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
data		data
images		images
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
advanced_model.ipynb		advanced_model.ipynb
data_exploration.ipynb		data_exploration.ipynb
model.ipynb		model.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Semiconductor Manufacturing Fault Detection

Project Objective

Interactive Dashboard

Data Source & Technologies

Project Workflow

Results & Analysis

Conclusion

How to Run

About

Uh oh!

Releases

Packages

Languages

License

gokaymeydan/semiconductor-fault-detection

Folders and files

Latest commit

History

Repository files navigation

Semiconductor Manufacturing Fault Detection

Project Objective

Interactive Dashboard

Data Source & Technologies

Project Workflow

Results & Analysis

Conclusion

How to Run

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages