This repository contains a single Jupyter/Colab notebook, Topicmod.ipynb, that runs an end-to-end experiment on Reddit and Twitter reply data discussing privacy and compliance topics. The notebook cleans incoming text, trains Latent Dirichlet Allocation (LDA) topic models, visualizes topic distributions with pyLDAvis, and performs VADER sentiment analysis with supporting bar charts.
- Provide a CSV named
replies.csvwhen prompted by the notebook. The file should include at least the columnsReply ID,Tweet ID,Text,Date,Author ID, andAuthor Name. - Both the Reddit and Twitter pipelines expect a
Textcolumn containing the raw reply body. Adjust the preprocessing cell if your schema differs.
- Create and activate a virtual environment (recommended):
python3 -m venv .venv source .venv/bin/activate - Install notebook dependencies:
pip install -r requirements.txt # or install manually: pandas numpy nltk gensim scikit-learn matplotlib seaborn pyLDAvis python -m nltk.downloader stopwords vader_lexicon - Launch Jupyter Lab or upload the notebook to Google Colab.
- Data upload – Run the first cell to upload
replies.csv(Colab) or place it beside the notebook when running locally. - Preprocessing – The notebook removes stop words, tokenizes, and builds gensim dictionaries/corpora for Reddit and Twitter subsets.
- Topic modeling – Two independent LDA models are trained (one per platform) and their topics printed to stdout.
- Visualization – Interactive topic dashboards are built with pyLDAvis for each dataset.
- Sentiment analysis – NLTK's VADER analyzer scores each reply and generates matplotlib bar charts that summarize the polarity distribution per platform.
- If you want to adjust the number of topics or passes, edit the
gensim.models.LdaModelconstructor in the "Run LDA" section. - For large datasets, consider enabling GPU/TPU acceleration in Colab to speed up preprocessing.
- Exclude IDE metadata (e.g.,
.idea/) by listing it in.gitignorebefore committing.