Skip to content

Latest commit

 

History

History
32 lines (27 loc) · 2.16 KB

File metadata and controls

32 lines (27 loc) · 2.16 KB

Privacy Compliance Topic Modeling

This repository contains a single Jupyter/Colab notebook, Topicmod.ipynb, that runs an end-to-end experiment on Reddit and Twitter reply data discussing privacy and compliance topics. The notebook cleans incoming text, trains Latent Dirichlet Allocation (LDA) topic models, visualizes topic distributions with pyLDAvis, and performs VADER sentiment analysis with supporting bar charts.

Data requirements

  • Provide a CSV named replies.csv when prompted by the notebook. The file should include at least the columns Reply ID, Tweet ID, Text, Date, Author ID, and Author Name.
  • Both the Reddit and Twitter pipelines expect a Text column containing the raw reply body. Adjust the preprocessing cell if your schema differs.

Environment setup

  1. Create and activate a virtual environment (recommended):
    python3 -m venv .venv
    source .venv/bin/activate
  2. Install notebook dependencies:
    pip install -r requirements.txt  # or install manually: pandas numpy nltk gensim scikit-learn matplotlib seaborn pyLDAvis
    python -m nltk.downloader stopwords vader_lexicon
  3. Launch Jupyter Lab or upload the notebook to Google Colab.

Notebook workflow

  1. Data upload – Run the first cell to upload replies.csv (Colab) or place it beside the notebook when running locally.
  2. Preprocessing – The notebook removes stop words, tokenizes, and builds gensim dictionaries/corpora for Reddit and Twitter subsets.
  3. Topic modeling – Two independent LDA models are trained (one per platform) and their topics printed to stdout.
  4. Visualization – Interactive topic dashboards are built with pyLDAvis for each dataset.
  5. Sentiment analysis – NLTK's VADER analyzer scores each reply and generates matplotlib bar charts that summarize the polarity distribution per platform.

Tips

  • If you want to adjust the number of topics or passes, edit the gensim.models.LdaModel constructor in the "Run LDA" section.
  • For large datasets, consider enabling GPU/TPU acceleration in Colab to speed up preprocessing.
  • Exclude IDE metadata (e.g., .idea/) by listing it in .gitignore before committing.