Felix Hofstätter

About Me

I am an AI safety researcher with an interest in evaluations, AI control, and model organisms of misalignment.

Previously, I was a software engineer but I decided to work on AI safety when I realized that ensuring the safety of powerful AI systems is one the most important challenges of our time. I spent time upskilling independently, via ARENA, and via AISH (now LASR) labs. I was part of the MATS 2024 winter cohort where I conducted research on AI sandbagging under Francis Rhys Ward. My work focused on demonstrating sandbagging capabilities and stress-testing elicitation techniques on LLMs with hidden capabilities (to be published soon). Currently, I am building an evaluation testing if AI systems can sandbag in spite of control mechanisms.

From April 2024 to September I was generously supported by a grant from the Long Term Future Fund. Currently, I am funded by the Frontier Model Forum's AI Safety Fund.

You can find a list of all my blog posts on the AlignmentForum, LessWrong, and Medium here.

Download Full CV

Experience & Education

MATS Scholar

2024 | Berkeley, USA

Conduct research on AI sandbagging and create novel model-organisms to stress-test elicitation of hidden capabilities in LLMs.

AI Safety Hub Labs

2023 | Vienna, Austria

Demonstrate scaling trends for consistency of stated beliefs and the capability to exploit flawed evaluators in LLM systems.

ARENA Research Engineering Bootcamp

2023 | London, UK

Upskill in core AI-safety research engineering skills, such as fine-tuning, evaluating LMs, and Mechanistic Interpretability.

Software Consultant, TNG Technology Consulting

2021-2022 | Vienna, Austria

Development of an E-commerce platform.

MEng Mathematics & Computer Science (1st class Hons)

2017-2021 | Imperial College London, UK

Technical Projects

Open Source Contributions

Mechanistic Interpretability

Publications

AI Sandbagging: Language Models can Strategically Underperform on Evaluations

2024

We present a definition of sandbagging as "strategic underperformance on evaluations". We then show how models can be made to hide capabilities from evaluators by e.g. implanting backdoors and fine-tuning them to emulate weaker models.

Recent Blog Posts

View All
[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations

June 13, 2024

This post is a brief summary of the findings from our paper on AI sandbagging.

Skills

Large Language Models

PEFT Fine-Tuning, Synthetic Data Generation, Prompt Engineering

Machine Learning

PyTorch, HF transformers, NumPy, pandas

Software Development

Test Driven Development, Agile Methods, Git

Programming Languages

Python, TypeScript, Java