This project will use machine learning to develop personalized air quality recommendations for users based on their personal information, such as location, age, health conditions, and activities. The goal of this project is to help users make informed decisions about their exposure to air pollution.
This project was geared towards utilizing machine learning techniques to provide personalized air quality recommendations to users based on their health status, age, activities, and pm2.5 values.
⚠️ The project is still in development and the user interface is not yet complete. We aimed at getting a predicting model first.
The project used a machine learning approach to develop personalized air quality recommendations. The following steps were taken:
- Collect data on air quality, user current location (we can use coordinates), and use personal information (age, gender, “health conditions'', and activities). We can think of what health conditions that are most affected by air quality e.g., respiratory diseases, mental health, cardiovascular diseases to mention but a few.
- Clean and prepare the data
- Personalize air quality recommendations for users based on their personal information
- Evaluate the effectiveness of the personalized air quality recommendations
By leveraging user data and real-time pm2.5 readings, the aim was to empower individuals with actionable insights to make informed decisions regarding their outdoor activities and exposure to air pollutants.
The project utilized two main datasets: site data and user data.
The site data encompassed hourly readings from AirQo and non-AirQo devices for the early months of 2023. It included attributes such as site names, pm values, and geographic coordinates. A selection of columns including site name, site latitude, site longitude, and pm2.5 values was made for further analysis.
The user data was synthesized, encompassing attributes like age, health conditions, and activities. This data was merged with the site data, forming the foundation for personalized recommendations.
To obtain the target feature column, a custom python script was written which considered the quality category of air and the age of the user to assign a recommendation. More can be found in the notebook
The data was then cleaned and preprocessed to remove missing values, duplicates, and outliers. The data was then split into training and testing sets. The training set was used to train the models, while the testing set was used to evaluate the performance of the models.
Age distribution of users
Counts of pm categories
To create an effective recommendation system, different machine learning algorithms were explored, and these include logistic regression, support vector machine, decision tree, random forest and xgboost classifier.
The model training process involved steps, such as feature selection, encoding categorical features, and splitting the dataset into training (70%) and testing sets (30%).
After the training, the models didn’t quite perform well. So, in-order to improve the accuracy, techniques like hyperparameter tuning and optimization had to be employed for the high performing algorithms.
Hyperparameter tuning for the Random Forest Classifier was performed using the Random Search CV method, resulting in optimal parameters of 156 estimators and a maximum depth of 15. For the decision tree tuning, we used the gridsearchCV with two sets of hyperparameters in which case the best parameters were ‘entropy’ as criterion, max depth of 30 and min samples split as 15.
Feature importance was also performed to determine the most important features in the model. The feature importance plot is shown below:
Upon evaluating the models, the Decision Tree model achieved an accuracy of 0.88. Its classification report exhibited varying precision, recall, and f1-score values, reflecting a range of 0.40 to 1.00 for different classes. The tuned Random Forest Classifier attained an accuracy of 0.8911, demonstrating the efficacy of the machine learning approach.
The models were then evaluated using the accuracy metric. The model with the highest accuracy was selected as the final model.
The confusion matrices for the models are shown below:
Decision Tree Classifier Confusion Matrix
Random Forest Classifier Confusion Matrix
The table showing the accuracy of the models used is shown below:
The project successfully developed a machine learning model to provide personalized air quality recommendations to users based on their health status, age, activities, and pm2.5 values.
For other information, take a look at the model card designed by @Nakacwa Olivia which was designed to provide a summary of the model's performance and limitations. The model card can also be found here
Detailed information about the project can be found in the notebook.
⚠️ The project is still in development and the user interface is not yet complete. We aimed at getting a prediction first.
Created by Bruce Mugizi - @bruceMug - [email protected] - feel free to contact me!
Project Link: https://github.com/bruceMug/airquality_recommender
Special thanks to the following people who have contributed to this project:
Nakacwa Olivia 📧 📖💡 📢 |
Marvin Satulo 📧 📖💡 📢 |