Blog Posts

[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations

June 13, 2024

This post is a brief summary of the findings from our paper on AI sandbagging.

April 26, 2024

My collaborators and I introduce the concept of sandbagging in the context of AI evaluation. We define AI sandbagging as strategic underperformance on an evaluation. We discuss why it might occor and give examples of what is and is not sandbagging. This work came out of my participation in the MATS 2024 winter cohort.

Simple distribution approximation: When sampled 100 times, can language models yield 80% A and 20% B?

January 29, 2024

Can you ask language models to approxiamte a probability distribution? For example, what happens if you ask an LM to yield A 80% of the time and B 20% of the time? We find that state-of-the-art LLMs are skilled at approximating simple distributions, a capability that could be helpful for targeted underperformance. Work done as part of the MATS 2024 winter cohort.

Tall Tales at Different Scales: Evaluating Scaling Trends For Deception In Language Models

November 8, 2023

Based on a formal definition of belief and deception in AI systems, we evaluate pre-conditions of deception in language models: consistency of stated beliefs and tendency to cause false beliefs. We find that larger LLMs express more consistent beliefs and that consistency can be increased by spending inference compute. We also fine-tune LLMs to be evaluated as truthful by an evaluator that makes systematic mistakes. Our results demonstrate that larger LLMs need to observe fewer mistakes to learn to exploit the evaluator. This work was done as part of the AI Safety Hub labs 2023 summer cohort.

Understanding the Information Flow inside Large Language Models

August 15, 2023

For our ARENA capstone project, my collaborator and I investigated a novel approach for interpreting language models. The core idea is to use interventions, such as path patching, to measure how information flows between token positions. This stands in contrast to typical circuit-detection methods, which measure information flow between attention heads. It turns out that one advantage of our approach is that it reveals "negative connections" between components which suppress correct behavior. This kind of connection is usually missed by automated circuit-detection. However, our method scales poorly to larger models and more complex tasks, where information appears to be distributed across many token positions. We suggest that performance at scale could be improved by aggregating token positions.

Explaining the Transformer Circuits Framework by Example

April 25, 2023

The aim of this post is to explain Anthropic's Mathematical Framework for Transformer Circuitsby applying it simple mechanistic interpretability problems. I investigate how a single-layer attention-only transformer implements the max function and how a single-layer transformer with MLP implements addition. It turns out that the later is implemented in a similar way to modular addition, which surprised me because I assumed that general addition would be simpler.

How Assistance Games make AI safer

October 26, 2022

And how they don't.

An investigation into when agents may be incentivized to manipulate our beliefs.

September 13, 2022

Previous work by DeepMind has considered Reward Tampering, which occurs when AI system can change their reward function. They use the lens of Causal Influence Diagrams to study when agents trained by RL have an incentive to tamper with their reward process, and prescribe principles for preventing tampering. However, their model assumes that the believes of users remain fixed. In this post I investigate what happens when reward functions are based on malleable user beliefs and how this influences an agent's incentives. I argue that when using direct learning, which the DeepMind paper shows is safe from reward tampering, there is still an incentive to manipulate user beliefs.

Spotting Unfair or Unsafe AI using Graphical Criteria

June 24, 2022

How to use causal influence diagrams to recognize the hidden incentives that shape an AI agent’s behavior

How to stop your AI agents from hacking their reward function

April 4, 2022

Using causal influence diagrams and current-RF optimisation

Counterfactuals for Reinforcement Learning II: Improving Reward Learning

Jan 15, 2022

Safer reward function learning using counterfactuals

Counterfactuals for Reinforcement Learning II: Improving Reward Learning

Dec 30, 2022

Introduction to the POMDP framework and counterfactuals

How learning reward functions can go wrong

Nov 16, 2021

An AI-safety minded perspective on the risks of Reinforcement Learning agents learning their reward functions

Adapting Soft Actor Critic for Discrete Action Spaces

Nov 16, 2021

How to apply the popular algorithm to new problems by changing only two equations