Sparse Autoencoder for Language Model Interpretability

🔬 Attempting to replicate findings from Sparse Autoencoders for Language Model Interpretability by [Wes Gurnee et al]

Project Description

This project implements a sparse autoencoder to analyze activations from GPT-2-small. I am following methodologies described in the referenced research paper. My goal is to:

🎯 Replicate key findings about feature decomposition in transformer models
🔍 Identify interpretable directions in activation space

Features

🧩 Activation extraction from GPT-2's intermediate layers
🏗️ Custom sparse autoencoder architecture

The configuration is still not up to par with the research paper.

Installation ⚙️

# Required packages
pip install torch transformers datasets tqdm numpy
git clone https://github.com/yourusername/sparse-ae-gpt2.git
cd sparse-ae-gpt2
jupyter notebook sparse_autoencoder.ipynb

Saved models will appear as my_sparse_ae.pth

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Algoverse_SAE_Training.ipynb		Algoverse_SAE_Training.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sparse Autoencoder for Language Model Interpretability

Project Description

Features

Installation ⚙️

About

Uh oh!

Releases

Packages

Languages

SohamD1/GPT2_SparseAE

Folders and files

Latest commit

History

Repository files navigation

Sparse Autoencoder for Language Model Interpretability

Project Description

Features

Installation ⚙️

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages