Skip to content

Creating a sparse autoencoder for researching universality amongst instances

Notifications You must be signed in to change notification settings

SohamD1/GPT2_SparseAE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 

Repository files navigation

Sparse Autoencoder for Language Model Interpretability

🔬 Attempting to replicate findings from Sparse Autoencoders for Language Model Interpretability by [Wes Gurnee et al]


Project Description

This project implements a sparse autoencoder to analyze activations from GPT-2-small. I am following methodologies described in the referenced research paper. My goal is to:

  • 🎯 Replicate key findings about feature decomposition in transformer models
  • 🔍 Identify interpretable directions in activation space

Features

  • 🧩 Activation extraction from GPT-2's intermediate layers
  • 🏗️ Custom sparse autoencoder architecture

The configuration is still not up to par with the research paper.

Installation ⚙️

# Required packages
pip install torch transformers datasets tqdm numpy
git clone https://github.com/yourusername/sparse-ae-gpt2.git
cd sparse-ae-gpt2
jupyter notebook sparse_autoencoder.ipynb

Saved models will appear as my_sparse_ae.pth

About

Creating a sparse autoencoder for researching universality amongst instances

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published