I am an AI safety researcher with an interest in evaluations, AI control, and model organisms of misalignment.
Previously, I was a software engineer but I decided to work on AI safety when I realized that ensuring the safety of powerful AI systems is one the most important challenges of our time. I spent time upskilling independently, via ARENA, and via AISH (now LASR) labs. I was part of the MATS 2024 winter cohort where I conducted research on AI sandbagging under Francis Rhys Ward. My work focused on demonstrating sandbagging capabilities and stress-testing elicitation techniques on LLMs with hidden capabilities (to be published soon). Currently, I am building an evaluation testing if AI systems can sandbag in spite of control mechanisms.
From April 2024 to September I was generously supported by a grant from the Long Term Future Fund. Currently, I am funded by the Frontier Model Forum's AI Safety Fund.
You can find a list of all my blog posts on the AlignmentForum, LessWrong, and Medium here.
2024 | Berkeley, USA
Conduct research on AI sandbagging and create novel model-organisms to stress-test elicitation of hidden capabilities in LLMs.
2023 | Vienna, Austria
Demonstrate scaling trends for consistency of stated beliefs and the capability to exploit flawed evaluators in LLM systems.
2023 | London, UK
Upskill in core AI-safety research engineering skills, such as fine-tuning, evaluating LMs, and Mechanistic Interpretability.
2021-2022 | Vienna, Austria
Development of an E-commerce platform.
2017-2021 | Imperial College London, UK
Implement Grouped Query Attention.
Add support for granular steering and batching.
I illustrate Anthropic's mathematical framework for transformer circuits by investigating how one-layer transformers do addition.
2024
We present a definition of sandbagging as "strategic underperformance on evaluations". We then show how models can be made to hide capabilities from evaluators by e.g. implanting backdoors and fine-tuning them to emulate weaker models.
June 13, 2024
This post is a brief summary of the findings from our paper on AI sandbagging.
PEFT Fine-Tuning, Synthetic Data Generation, Prompt Engineering
PyTorch, HF transformers, NumPy, pandas
Test Driven Development, Agile Methods, Git
Python, TypeScript, Java