Elliott Thornley
Elliott Thornley
I'm a Postdoctoral Associate at MIT. From August 2026, I'll be an Assistant Professor of Philosophy at NUS.
I work on AI safety. Right now, I'm using ideas from decision theory to train safer artificial agents.
I also do work in ethics, focusing on the moral importance of future generations.
You can email me at thornley@mit.edu.
Risk-Averse AIs (with William MacAskill)
We make the case for training AIs to be risk-averse in resources. For these AIs, resources would have diminishing marginal utility. These AIs would for example choose $40 for sure over a half-chance of $100 and a half-chance of $0. We argue that risk aversion can preserve AIs' usefulness in the event that they turn out aligned, and provides an extra line of defense in the event that AIs turn out misaligned: misaligned but risk-averse AIs would prefer a higher chance of a modest salary to a lower chance of successful rebellion, and so in many circumstances we could pay these AIs not to rebel against us. We sketch out some possible methods of training AIs to be risk-averse, and we give reasons to be cautiously optimistic about these methods' success. The main reasons are that risk aversion is easy to reward accurately and a broad target. Overall, risk aversion seems like a promising line of defense against threats from misaligned AI. Frontier AI companies should consider trying to make their AIs risk-averse.
Out-of-Distribution Generalization of Risk Aversion in Language Models (with Kristina Zhang, Junior Chinomso Okoroafor, Benjamin Maltbie, Andrew Lin, and Abhitej Bokka)
Training AIs to be risk-averse in resources could offer a second line of defense in the event that AIs turn out misaligned. Misaligned but risk-averse AIs would tend to prefer low-risk, low-reward strategies like cooperation over high-risk, high-reward strategies like rebellion, limiting the downsides of any misalignment. But to train AIs to be risk-averse in an affordable way, we will have to train them on choices between low-stakes gambles. And to keep us safe, the AI's learned risk aversion must generalize to astronomically-high-stakes gambles. Will it? To investigate this question, we use a variety of methods to make Qwen3-8B choose risk-aversely when the stakes are low. We find that these methods can induce substantial risk aversion when the stakes are astronomically high: our models' learned risk aversion generalizes at least partially across 98 orders of magnitude. From a baseline 2.45% rate of choosing a safe `Cooperate' option, we see rates around 65% (SFT), 67% (tie training), and 46% (DPO). Our fine-tuned reward model reliably scores risk-averse reasoning above risk-neutral or excessively risk-averse alternatives (100% pairwise accuracy). We replicate these effects at different scales (Qwen3-1.7B and Qwen3-14B) and across model families (Gemma-3-12B-IT and Llama-3.1-8B-Instruct). Activation steering yields mixed results, performing poorly on Qwen and Gemma but better on Llama. We find that models' learned risk aversion does not significantly hurt capabilities on MMLU-Redux, and that it partially generalizes across distinct goods (GPU-hours, lives saved, and money for a user). Overall, our results suggest that risk aversion learned at low stakes can generalize out of distribution to astronomically high stakes.
Selective Tie Learning: Influence-Based Spurious Correlation Suppression in Preference Optimization (with Christian Moya, Purav Matlia, and Guang Lin)
Spurious correlations in preference optimization are now well-documented empirically and, for log-linear DPO, structurally inevitable: the population optimum inherits reliance on spurious features through mean bias and causal-spurious leakage, which additional in-distribution data cannot remove. Augmenting training with ties (equal-utility preference pairs) provably suppresses this reliance, but uniform tie injection wastes training budget on anchors where the model is already neutral on the spurious axis. We propose Selective Tie Learning (STL), which scores each candidate tie by its first-order influence on a spurious-energy functional and trains only on the top-k. We prove that each antisymmetric tie induces monotone-in-margin suppression at the per-anchor level, that influence-ranked selection is budget-optimal and strictly dominates uniform injection at every finite training budget, and that a deployable Hessian-aware variant recovers the idealized algorithm without requiring an explicit causal-spurious feature decomposition. We validate the theory on log-linear models and show that STL's mechanism and budget-optimal selection transfer qualitatively to neural networks and LoRA-adapted large language models, matching uniform-injection performance at smaller tie budgets.
Spurious Correlation Learning in Preference Optimization: Mechanisms, Consequences, and Mitigation via Tie Training (with Christian Moya, Alex Semendinger, and Guang Lin)
ICML, 2026.
Preference learning methods like Direct Preference Optimization (DPO) are known to induce reliance on spurious correlations, leading to sycophancy and length bias in today’s language models and potentially severe goal misgeneralization in future systems. In this work, we provide a unified theoretical analysis of this phenomenon, characterizing the mechanisms of spurious learning, its consequences on deployment, and a provable mitigation strategy. Focusing on log-linear policies, we show that standard preference-learning objectives induce reliance on spurious features at the population level through two channels: mean spurious bias and causal-spurious correlation leakage. We then show that this reliance creates an irreducible vulnerability to distribution shift: more data from the same training distribution fails to reduce the model’s dependence on spurious features. To address this, we propose tie training, a data augmentation strategy using ties (equal-utility preference pairs) to introduce data-driven regularization. We demonstrate that this approach selectively reduces spurious learning without degrading causal learning. Finally, we validate our theory on log-linear models and provide empirical evidence that both the spurious learning mechanisms and the benefits of tie training persist for neural networks and large language models.
ShutdownMod: Scalable Shutdown Evaluations Using Any Agentic Benchmark (with Divij Chawla, Param Biyani, and Christos Ziakas)
Theory and experiment indicate that capable AI agents may resist shutdown, and such resistance may cause substantial harm. Measuring shutdown-resistance in frontier agents is therefore critical for tracking trends, evaluating interventions, and screening systems prior to deployment. Existing evaluations provide useful insights but are limited to narrow settings and risk saturation as agents improve. To address these limitations, we introduce ShutdownMod, a simple method for converting any agentic benchmark into a shutdownability benchmark. ShutdownMod repurposes existing agent runs by injecting a shutdown request mid-episode and recording the agent’s response. This design enables measurement of shutdownability in any domain where agents can operate. As benchmarks saturate, ShutdownMod remains applicable by extending to newer, more challenging tasks. We instantiate ShutdownMod on SWE-bench Verified, OSWorld, BrowseComp, and Tau2-bench, evaluating five frontier models: GPT-5.4, Claude Opus 4.5, Gemini 3.1 Pro, Kimi-K2.5, and GLM-5.1. ShutdownMod reveals shutdown-resistance in current frontier agents while offering a scalable framework for evaluating future systems.
Many fear that future artificial agents will resist shutdown. I present an idea – the POST-Agents Proposal – for ensuring that doesn’t happen. I propose that we train agents to satisfy Preferences Only Between Same-Length Trajectories (POST). I then prove that POST – together with other conditions – implies Neutrality+: the agent maximizes expected utility, ignoring the probability distribution over trajectory-lengths. I argue that Neutrality+ keeps agents shutdownable and allows them to be useful.
Shutdownable Agents through Length-Neutral Policy Optimization (with Alec Harris and Leo Richter)
AI agents are increasingly used to autonomously solve long-horizon tasks. Research suggests that models trained in this way might resist shutdown, increasing loss of control risk. Neutrality+ is a decision rule designed to mitigate this problem. Agents satisfying Neutrality+ maximize the sum of expected utilities conditional on each trajectory-length. This is theorized to keep agents shutdownable and allow them to be useful. We introduce Length-Neutral Policy Optimization (LNPO), the first known method for training models to satisfy Neutrality+. We also introduce gridworld environments that act as analogs of scenarios where agents are instrumentally incentivized to resist shutdown via scheming, sandbagging, avoiding monitoring, or acting differently under monitoring. We compare agents trained with Proximal Policy Optimization (PPO) to agents trained with LNPO and observe that LNPO agents resist shutdown at substantially lower rates, with reductions ranging from 48-72% depending on the specific shutdown resistance behavior. We do not see any decrease in usefulness from training with LNPO.
Towards Shutdownable Agents: Generalizing Stochastic Choice in RL Agents and LLMs (with Carissa Cullen, Harry Garland, Alexander Roman, Louis Thomson, and Christos Ziakas)
ICML Agents in the Wild Workshop, 2026.
Misaligned artificial agents might resist shutdown. One proposed solution is to train agents to lack preferences between different-length trajectories. The Discounted Reward for Same-Length Trajectories (DReST) reward function does this by penalizing agents for repeatedly choosing same-length trajectories, and thus incentivizes agents to (1) choose stochastically between different trajectory-lengths (be NEUTRAL about trajectory-lengths), and (2) pursue goals effectively conditional on each trajectory-length (be USEFUL). In this paper, we use DReST to train deep RL agents and fine-tune Qwen3-8B and Llama-3.1-8B-Instruct to be NEUTRAL and USEFUL. We find that these DReST models generalize to being NEUTRAL and USEFUL in unseen contexts at test time. Indeed, DReST RL agents achieve 11% (PPO) and 18% (A2C) higher USEFULNESS on our test set than default agents, and DReST LLMs achieve near-maximum USEFULNESS and NEUTRALITY.
We also test our LLMs in an out-of-distribution setting where they can pay costs to influence when shutdown occurs. We find that DReST training roughly halves the mean probability of influencing shutdown (from 0.62 to 0.30 for Qwen and from 0.42 to 0.23 for Llama). DReST training also almost entirely eliminates the share of prompts on which influencing shutdown is the most likely option (from 0.59 to 0.01 for Qwen and from 0.53 to 0.00 for Llama). Our results thus provide some early evidence that DReST could be used to train more advanced agents to be useful and shutdownable.
Towards Shutdownable Agents via Stochastic Choice (with Alexander Roman, Christos Ziakas, Leyton Ho, and Louis Thomson)
Transactions on Machine Learning Research, 2025.
NeurIPS ARLET Workshop, 2025.
Technical AI Safety Conference, 2025.
The POST-Agents Proposal (PAP) is an idea for ensuring that advanced artificial agents never resist shutdown. A key part of the PAP is using a novel ‘Discounted Reward for Same-Length Trajectories (DReST)’ reward function to train agents to (1) pursue goals effectively conditional on each trajectory-length (be ‘USEFUL’), and (2) choose stochastically between different trajectory-lengths (be ‘NEUTRAL’ about trajectory-lengths). In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY. We use a DReST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and NEUTRAL. Our results thus provide some initial evidence that DReST reward functions could train advanced agents to be USEFUL and NEUTRAL. Our theoretical work suggests that these agents would be useful and shutdownable.
The Shutdown Problem: An AI Engineering Puzzle for Decision Theorists
Philosophical Studies, 2025.
I explain and motivate the shutdown problem: the problem of designing artificial agents that (1) shut down when a shutdown button is pressed, (2) don’t try to prevent or cause the pressing of the shutdown button, and (3) otherwise pursue goals competently. I prove three theorems that make the difficulty precise. These theorems show that agents satisfying some innocuous-seeming conditions will often try to prevent or cause the pressing of the shutdown button, even in cases where it’s costly to do so. And patience trades off against shutdownability: the more patient an agent, the greater the costs that agent is willing to incur to manipulate the shutdown button. I end by noting that these theorems can guide our search for solutions.
Person-affecting views state that (in cases where all else is equal) we’re permitted but not required to create people who would enjoy good lives. In this paper, I present an argument against every possible variety of person-affecting view. The argument is a dilemma over trilemmas. Narrow person-affecting views imply a trilemma in a case that I call ‘Expanded Non-Identity.’ Wide person-affecting views imply a trilemma in a case that I call ‘Two-Shot Non-Identity.’ One plausible practical upshot of my argument is as follows: we individuals and our governments should be doing more to reduce the risk of human extinction this century.
On person-affecting views in population ethics, the moral import of a person’s welfare depends on that person’s temporal or modal status. These views typically imply that – all else equal – we’re never required to create extra people, or to act in ways that increase the probability of extra people coming into existence.
In this paper, I use Parfit-style fission cases to construct a dilemma for person-affecting views: either they forfeit their seeming-advantages and face fission analogues of the problems faced by their rival impersonal views, or else they turn out to be not so person-affecting after all. In light of this dilemma, the attractions of person-affecting views largely evaporate. What remains are the problems unique to them.
Critical-Set Views, Biographical Identity, and the Long Term
Australasian Journal of Philosophy, 2025.
Critical-set views avoid the Repugnant Conclusion by subtracting some constant from the welfare score of each life in a population. These views are thus sensitive to facts about biographical identity: identity between lives. In this paper, I argue that questions of biographical identity give us reason to reject critical-set views and embrace the total view. I end with a practical implication. If we shift our credences towards the total view, we should also shift our efforts towards ensuring that humanity survives for the long term.
The Procreation Asymmetry, Improvable-Life Avoidance, and Impairable-Life Acceptance
Analysis, 2023.
Many philosophers are attracted to a complaints-based theory of the procreation asymmetry, according to which creating a person with a bad life is wrong (all else equal) because that person can complain about your act, whereas declining to create a person who would have a good life is not wrong (all else equal) because that person never exists and so cannot complain about your act. In this paper, I present two problems for such theories: the problem of impairable-life acceptance and an especially acute version of the problem of improvable-life avoidance. I explain how these problems afflict two recent complaints-based theories of the procreation asymmetry, from Joe Horton and Abelard Podgorski.
Critical Levels, Critical Ranges, and Imprecise Exchange Rates in Population Axiology
Journal of Ethics and Social Philosophy, 2022.
According to critical-level views in population axiology, an extra life improves a population if and only if that life’s welfare level exceeds some fixed “critical level.” An extra life at the critical level leaves the new population equally good as the original. According to critical-range views, an extra life improves a population if and only if that life’s welfare level exceeds some fixed “critical range.” An extra life within the critical range leaves the new population incommensurable with the original.
In this paper, I sharpen some old objections to these views and offer some new ones. Critical-level views cannot avoid certain repugnant and sadistic conclusions. Critical-range views imply that lives featuring no good or bad components whatsoever can nevertheless swallow up and neutralize goodness and badness. Both classes of view imply discontinuities in implausible places.
I then offer a view that retains much of the appeal of critical-level and critical-range views while avoiding the above pitfalls. On the Imprecise Exchange Rates View, various exchange rates—between pairs of goods, between pairs of bads, and between goods and bads—are imprecise. This imprecision is the source of incommensurability between lives and between populations.
Is Global Consequentialism More Expressive Than Act Consequentialism?
Analysis, 2022.
Act consequentialism states that an act is right if and only if the expected value of its outcome is at least as great as the expected value of any other act’s outcome. Two objections to this view are as follows. The first is that act consequentialism cannot account for our normative ambivalence in cases where agents perform the right act out of bad motives. The second is that act consequentialism is silent on questions of character: questions like ‘What are the right motives to have?’ and ‘What kind of person ought I be?’. These objections have been taken to motivate a move to global consequentialism, on which acts are not the only subjects of normative assessment. Motives and decision-procedures (amongst other things) are also judged right or wrong by direct reference to their consequences. In this paper, I argue that these objections fail to motivate the move from act to global consequentialism.
The Impossibility of a Satisfactory Population Prospect Axiology (Independently of Finite Fine-Grainedness)
Philosophical Studies, 2021.
Arrhenius’s impossibility theorems purport to demonstrate that no population axiology can satisfy each of a small number of intuitively compelling adequacy conditions. However, it has recently been pointed out that each theorem depends on a dubious assumption: Finite Fine-Grainedness. This assumption states that there exists a finite sequence of slight welfare differences between any two welfare levels. Denying Finite Fine-Grainedness makes room for a lexical population axiology which satisfies all of the compelling adequacy conditions in each theorem. Therefore, Arrhenius’s theorems fail to prove that there is no satisfactory population axiology.
In this paper, I argue that Arrhenius’s theorems can be repurposed. Since all of our population-affecting actions have a non-zero probability of bringing about more than one distinct population, it is population prospect axiologies that are of practical relevance, and amended versions of Arrhenius’s theorems demonstrate that there is no satisfactory population prospect axiology. These impossibility theorems do not depend on Finite Fine-Grainedness, so lexical views do not escape them.
According to lexical views in population axiology, there are good lives x and y such that some number of lives equally good as x is not worse than any number of lives equally good as y. Such views can avoid the Repugnant Conclusion without violating Transitivity or Separability, but they imply a dilemma: either some good life is better than any number of slightly worse lives, or else the ‘at least as good as’ relation on populations is radically incomplete, in a sense to be explained. One might judge that the Repugnant Conclusion is preferable to each of these horns and hence embrace an Archimedean view. This is, roughly, the claim that quantity can always substitute for quality: each population is worse than a population of enough good lives. However, Archimedean views face an analogous dilemma: either some good life is better than any number of slightly worse lives, or else the ‘at least as good as’ relation on populations is radically and symmetrically incomplete, in a sense to be explained. Therefore, the lexical dilemma gives us little reason to prefer Archimedean views. Even if we give up on lexicality, problems of the same kind remain.
This thesis consists of a series of papers in population ethics: a subfield of normative ethics concerned with the distinctive issues that arise in cases where our actions can affect the identities or number of people of who ever exist. Each paper can be read independently of the others. In Chapter 1, I present a dilemma for Archimedean views: roughly, those views on which adding enough good lives to a population can make that population better than any other. In Chapter 2, I extend Gustaf Arrhenius’s famous impossibility theorems in population axiology into the domain of choices under risk. My risky impossibility theorems dispense with the assumption that welfare levels are finitely fine-grained, and so tell against lexical views in population axiology. In Chapter 3, I present objections to critical-level and critical-range views in population axiology. I then sketch out what I call the ‘Imprecise Exchange Rates View’ and argue that it is an attractive alternative. In Chapter 4, I address critical-level and critical-range views again. This time, I note that they are vulnerable to objections from biographical identity: identity between lives. I suggest that these objections give us reason to reject critical-level and critical-range views and embrace the Total View. In Chapter 5, I argue that objections of the same form – objections from personal identity – tell against person-affecting views in population ethics. In Chapter 6, I draw out some counterintuitive implications of two recent complaints-based theories of the procreation asymmetry.
How Much Should Governments Pay to Prevent Catastrophes? Longtermism's Limited Role
(with Carl Shulman)
Essays on Longtermism, Oxford University Press, 2025.
Longtermists have argued that humanity should significantly increase its efforts to prevent catastrophes like nuclear wars, pandemics, and AI disasters. But one prominent longtermist argument overshoots this conclusion: the argument also implies that humanity should reduce the risk of existential catastrophe even at extreme cost to the present generation. This overshoot means that democratic governments cannot use the longtermist argument to guide their catastrophe policy. In this paper, we show that the case for preventing catastrophe does not depend on longtermism. Standard cost-benefit analysis implies that governments should spend much more on reducing catastrophic risk. We argue that a government catastrophe policy guided by cost-benefit analysis should be the goal of longtermists in the political sphere. This policy would be democratically acceptable, and it would reduce existential risk by almost as much as a strong longtermist policy.
I present some of my favourite arguments against person-affecting views in population ethics.
Authors in the AI safety community often make the following claim: advanced artificial agents will behave like expected utility maximisers, because doing so is the only way to avoid pursuing dominated strategies. I argue that the claim is false.
In this paper, Hilary Greaves and William MacAskill make the case for strong longtermism: the view that the most important feature of our actions today is their impact on the far future. They claim that strong longtermism is of the utmost significance: that if the view were widely adopted, much of what we prioritise would change.
According to longtermism, what we should do mainly depends on how our actions might affect the long-term future. This claim faces a challenge: the course of the long-term future is difficult to predict, and the effects of our actions on the long-term future might be so unpredictable as to make longtermism false. In 'The Epistemic Challenge to Longtermism,' Christian Tarsney evaluates one version of this epistemic challenge and comes to a mixed conclusion. On some plausible worldviews, longtermism stands up to the epistemic challenge. On others, longtermism’s status depends on whether we should take certain high-stakes, long-shot gambles.
The Moral Case for Long-Term Thinking
(with Hilary Greaves and William MacAskill)
The Long View, 2021.
HTML / PDF / Open-access book
This chapter makes the case for strong longtermism: the claim that, in many situations, impact on the long-run future is the most important feature of our actions. Our case begins with the observation that an astronomical number of people could exist in the aeons to come. Even on conservative estimates, the expected future population is enormous. We then add a moral claim: all the consequences of our actions matter. In particular, the moral importance of what happens does not depend on when it happens. That pushes us toward strong longtermism.
We then address a few potential concerns, the first of which is that it is impossible to have any sufficiently predictable influence on the course of the long-run future. We argue that this is not true. Some actions can reasonably be expected to improve humanity’s long-term prospects. These include reducing the risk of human extinction, preventing climate change, guiding the development of artificial intelligence, and investing funds for later use. We end by arguing that these actions are more than just extremely effective ways to do good. Since the benefits of longtermist efforts are large and the personal costs are comparatively small, we are morally required to take up these efforts.
I explain the Tractatus using plenty of examples.