Skip to content
View DerOeko's full-sized avatar

Highlights

  • Pro

Block or report DerOeko

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
DerOeko/README.md

Hi, I'm Samuel.

I build systems that autonomously find failure modes in Large Language Models using Reinforcement Learning.

Currently, I am working at KachmanLab on an automated jailbreaking framework. My work focuses on the intersection of theoretical alignment research and high-performance engineering.

Current Focus

  • Reinforcement Learning: Implementing verifiable reward frameworks (using GRPO and Verifiers) to train adversarial agents.
  • Inference Optimization: Designing asynchronous generation pipelines using asyncio and vLLM to maximize throughput across multi-GPU environments.
  • Evaluation: Curating adversarial datasets and benchmarking model robustness using AgentDojo and custom environments.

Selected Work

  • KachmanLab (Current): End-to-end RL training for automated jailbreaking.
  • Prime Intellect: Developed RL environments for decentralized training (e.g., Gutenberg literary analysis).
  • RoboLearn: Built Bayesian models for depression (controllability) and data analysis pipelines for computational psychiatry in collaboration with Prof. Dr. Roshan Cools (Donders Institute/RadboudUMC).

Interests

Adversarial Robustness • Model Evals • Alignment Faking • Model Organisms • Mechanistic Interpretability • Steganography • Jailbreaking • Multi-Agent Systems.


Wanna talk? Book a 1-on-1 or do it the old fashioned way: samuelgerrit.nellessen{at}gmail.com

Pinned Loading

  1. UKGovernmentBEIS/inspect_ai UKGovernmentBEIS/inspect_ai Public

    Inspect: A framework for large language model evaluations

    Python 2k 503

  2. styx-interchange styx-interchange Public

    Mechanistic interpretability experiments on refusal behavior using activation patching and input gradients.

    Python

  3. PrimeIntellect-ai/community-environments PrimeIntellect-ai/community-environments Public

    Lightly-reviewed collection of community environments

    Python 229 220

  4. ARENA-SONAR-MechInterp ARENA-SONAR-MechInterp Public

    Capstone project investigating the interpretability of Text AutoEncoders like SONAR.

    HTML 1