Skip to content

bruceMug/airquality_recommender

Repository files navigation

Personalizing Air Quality Recommendations using Machine Learning

Introduction

This project will use machine learning to develop personalized air quality recommendations for users based on their personal information, such as location, age, health conditions, and activities. The goal of this project is to help users make informed decisions about their exposure to air pollution.

This project was geared towards utilizing machine learning techniques to provide personalized air quality recommendations to users based on their health status, age, activities, and pm2.5 values.

⚠️ The project is still in development and the user interface is not yet complete. We aimed at getting a predicting model first.

Scope

The project used a machine learning approach to develop personalized air quality recommendations. The following steps were taken:

  • Collect data on air quality, user current location (we can use coordinates), and use personal information (age, gender, “health conditions'', and activities). We can think of what health conditions that are most affected by air quality e.g., respiratory diseases, mental health, cardiovascular diseases to mention but a few.
  • Clean and prepare the data
  • Personalize air quality recommendations for users based on their personal information
  • Evaluate the effectiveness of the personalized air quality recommendations

By leveraging user data and real-time pm2.5 readings, the aim was to empower individuals with actionable insights to make informed decisions regarding their outdoor activities and exposure to air pollutants.

Methodology

Data and Preprocessing

The project utilized two main datasets: site data and user data.

The site data encompassed hourly readings from AirQo and non-AirQo devices for the early months of 2023. It included attributes such as site names, pm values, and geographic coordinates. A selection of columns including site name, site latitude, site longitude, and pm2.5 values was made for further analysis.

The user data was synthesized, encompassing attributes like age, health conditions, and activities. This data was merged with the site data, forming the foundation for personalized recommendations.

To obtain the target feature column, a custom python script was written which considered the quality category of air and the age of the user to assign a recommendation. More can be found in the notebook

The data was then cleaned and preprocessed to remove missing values, duplicates, and outliers. The data was then split into training and testing sets. The training set was used to train the models, while the testing set was used to evaluate the performance of the models.

Age distribution of users

age distribution

Counts of pm categories

Air quality categories counts

Model Development

To create an effective recommendation system, different machine learning algorithms were explored, and these include logistic regression, support vector machine, decision tree, random forest and xgboost classifier.

The model training process involved steps, such as feature selection, encoding categorical features, and splitting the dataset into training (70%) and testing sets (30%).

After the training, the models didn’t quite perform well. So, in-order to improve the accuracy, techniques like hyperparameter tuning and optimization had to be employed for the high performing algorithms.

Hyperparameter tuning for the Random Forest Classifier was performed using the Random Search CV method, resulting in optimal parameters of 156 estimators and a maximum depth of 15. For the decision tree tuning, we used the gridsearchCV with two sets of hyperparameters in which case the best parameters were ‘entropy’ as criterion, max depth of 30 and min samples split as 15.

Feature importance was also performed to determine the most important features in the model. The feature importance plot is shown below:

Feature importance plot

Results

Upon evaluating the models, the Decision Tree model achieved an accuracy of 0.88. Its classification report exhibited varying precision, recall, and f1-score values, reflecting a range of 0.40 to 1.00 for different classes. The tuned Random Forest Classifier attained an accuracy of 0.8911, demonstrating the efficacy of the machine learning approach.

The models were then evaluated using the accuracy metric. The model with the highest accuracy was selected as the final model.

The confusion matrices for the models are shown below:

Decision Tree Classifier Confusion Matrix

decision tree confusion matrix

Random Forest Classifier Confusion Matrix

random forest confusion matrix

The table showing the accuracy of the models used is shown below: model accuracy

Conclusion

The project successfully developed a machine learning model to provide personalized air quality recommendations to users based on their health status, age, activities, and pm2.5 values.

For other information, take a look at the model card designed by @Nakacwa Olivia which was designed to provide a summary of the model's performance and limitations. The model card can also be found here

Detailed information about the project can be found in the notebook.

⚠️ The project is still in development and the user interface is not yet complete. We aimed at getting a prediction first.

Contact

Created by Bruce Mugizi - @bruceMug - [email protected] - feel free to contact me!

Project Link: https://github.com/bruceMug/airquality_recommender

Contributors ✨

Special thanks to the following people who have contributed to this project:

Olivia Nakacwa
Nakacwa Olivia

📧 📖💡 📢
Marvin Satulo
Marvin Satulo

📧 📖💡 📢

Acknowledgements

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages