Skip to content

SaladbarAlex/PrivacyCompliance

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 

Repository files navigation

Privacy Compliance Topic Modeling

This repository contains a single Jupyter/Colab notebook, Topicmod.ipynb, that runs an end-to-end experiment on Reddit and Twitter reply data discussing privacy and compliance topics. The notebook cleans incoming text, trains Latent Dirichlet Allocation (LDA) topic models, visualizes topic distributions with pyLDAvis, and performs VADER sentiment analysis with supporting bar charts.

Data requirements

  • Provide a CSV named replies.csv when prompted by the notebook. The file should include at least the columns Reply ID, Tweet ID, Text, Date, Author ID, and Author Name.
  • Both the Reddit and Twitter pipelines expect a Text column containing the raw reply body. Adjust the preprocessing cell if your schema differs.

Environment setup

  1. Create and activate a virtual environment (recommended):
    python3 -m venv .venv
    source .venv/bin/activate
  2. Install notebook dependencies:
    pip install -r requirements.txt  # or install manually: pandas numpy nltk gensim scikit-learn matplotlib seaborn pyLDAvis
    python -m nltk.downloader stopwords vader_lexicon
  3. Launch Jupyter Lab or upload the notebook to Google Colab.

Notebook workflow

  1. Data upload – Run the first cell to upload replies.csv (Colab) or place it beside the notebook when running locally.
  2. Preprocessing – The notebook removes stop words, tokenizes, and builds gensim dictionaries/corpora for Reddit and Twitter subsets.
  3. Topic modeling – Two independent LDA models are trained (one per platform) and their topics printed to stdout.
  4. Visualization – Interactive topic dashboards are built with pyLDAvis for each dataset.
  5. Sentiment analysis – NLTK's VADER analyzer scores each reply and generates matplotlib bar charts that summarize the polarity distribution per platform.

Tips

  • If you want to adjust the number of topics or passes, edit the gensim.models.LdaModel constructor in the "Run LDA" section.
  • For large datasets, consider enabling GPU/TPU acceleration in Colab to speed up preprocessing.
  • Exclude IDE metadata (e.g., .idea/) by listing it in .gitignore before committing.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors