Privacy Compliance Topic Modeling

This repository contains a single Jupyter/Colab notebook, Topicmod.ipynb, that runs an end-to-end experiment on Reddit and Twitter reply data discussing privacy and compliance topics. The notebook cleans incoming text, trains Latent Dirichlet Allocation (LDA) topic models, visualizes topic distributions with pyLDAvis, and performs VADER sentiment analysis with supporting bar charts.

Data requirements

Provide a CSV named replies.csv when prompted by the notebook. The file should include at least the columns Reply ID, Tweet ID, Text, Date, Author ID, and Author Name.
Both the Reddit and Twitter pipelines expect a Text column containing the raw reply body. Adjust the preprocessing cell if your schema differs.

Environment setup

Create and activate a virtual environment (recommended):
```
python3 -m venv .venv
source .venv/bin/activate
```

Install notebook dependencies:

pip install -r requirements.txt  # or install manually: pandas numpy nltk gensim scikit-learn matplotlib seaborn pyLDAvis
python -m nltk.downloader stopwords vader_lexicon

Launch Jupyter Lab or upload the notebook to Google Colab.

Notebook workflow

Data upload – Run the first cell to upload replies.csv (Colab) or place it beside the notebook when running locally.
Preprocessing – The notebook removes stop words, tokenizes, and builds gensim dictionaries/corpora for Reddit and Twitter subsets.
Topic modeling – Two independent LDA models are trained (one per platform) and their topics printed to stdout.
Visualization – Interactive topic dashboards are built with pyLDAvis for each dataset.
Sentiment analysis – NLTK's VADER analyzer scores each reply and generates matplotlib bar charts that summarize the polarity distribution per platform.

Tips

If you want to adjust the number of topics or passes, edit the gensim.models.LdaModel constructor in the "Run LDA" section.
For large datasets, consider enabling GPU/TPU acceleration in Colab to speed up preprocessing.
Exclude IDE metadata (e.g., .idea/) by listing it in .gitignore before committing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Privacy Compliance Topic Modeling

Data requirements

Environment setup

Notebook workflow

Tips

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Privacy Compliance Topic Modeling

Data requirements

Environment setup

Notebook workflow

Tips