Felix Hofstätter

About Me

I am an AI safety researcher with an interest in evaluations, AI control, and model organisms of misalignment.

Previously, I was a software engineer but I decided to work on AI safety when I realized that ensuring the safety of powerful AI systems is one the most important challenges of our time. I spent time upskilling independently, via ARENA, and via AISH (now LASR) labs. I was part of the MATS 2024 winter cohort where I conducted research on AI sandbagging under Francis Rhys Ward. My work focused on demonstrating sandbagging capabilities and stress-testing elicitation techniques on LLMs with hidden capabilities (to be published soon). Currently, I am building an evaluation testing if AI systems can sandbag in spite of control mechanisms.

From April 2024 to September I was generously supported by a grant from the Long Term Future Fund. Currently, I am funded by the Frontier Model Forum's AI Safety Fund.

You can find a list of all my blog posts on the AlignmentForum, LessWrong, and Medium here.

Download Full CV

Experience & Education

MATS Scholar

2024 | Berkeley, USA

Conduct research on AI sandbagging and create novel model-organisms to stress-test elicitation of hidden capabilities in LLMs.

AI Safety Hub Labs

2023 | Vienna, Austria

Demonstrate scaling trends for consistency of stated beliefs and the capability to exploit flawed evaluators in LLM systems.

ARENA Research Engineering Bootcamp

2023 | London, UK

Upskill in core AI-safety research engineering skills, such as fine-tuning, evaluating LMs, and Mechanistic Interpretability.

Software Consultant, TNG Technology Consulting

2021-2022 | Vienna, Austria

Development of an E-commerce platform.

MEng Mathematics & Computer Science (1st class Hons)

2017-2021 | Imperial College London, UK

Technical Projects

Open Source Contributions

TransformerLens
Implement Grouped Query Attention.
SteeringVectors
Add support for granular steering and batching.

Mechanistic Interpretability

Measuring Information Flow in LLMsUsing interventions to measure information flow between token positions in transformers. (ARENA Capstone)
Explaining the Transformer Circuits Framework by Example
I illustrate Anthropic's mathematical framework for transformer circuits by investigating how one-layer transformers do addition.

Publications

AI Sandbagging: Language Models can Strategically Underperform on Evaluations

2024

We present a definition of sandbagging as "strategic underperformance on evaluations". We then show how models can be made to hide capabilities from evaluators by e.g. implanting backdoors and fine-tuning them to emulate weaker models.

Recent Blog Posts

View All→

[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations

June 13, 2024

This post is a brief summary of the findings from our paper on AI sandbagging.

Skills

Large Language Models

PEFT Fine-Tuning, Synthetic Data Generation, Prompt Engineering

Machine Learning

PyTorch, HF transformers, NumPy, pandas

Software Development

Test Driven Development, Agile Methods, Git

Programming Languages

Python, TypeScript, Java