🔬 Attempting to replicate findings from Sparse Autoencoders for Language Model Interpretability by [Wes Gurnee et al]
This project implements a sparse autoencoder to analyze activations from GPT-2-small. I am following methodologies described in the referenced research paper. My goal is to:
- 🎯 Replicate key findings about feature decomposition in transformer models
- 🔍 Identify interpretable directions in activation space
- 🧩 Activation extraction from GPT-2's intermediate layers
- 🏗️ Custom sparse autoencoder architecture
The configuration is still not up to par with the research paper.
# Required packages
pip install torch transformers datasets tqdm numpy
git clone https://github.com/yourusername/sparse-ae-gpt2.git
cd sparse-ae-gpt2
jupyter notebook sparse_autoencoder.ipynbSaved models will appear as my_sparse_ae.pth