Samuel Arnesen, David Rein, Julian Michael



Introduction

In the past year, there have been a number of projects aimed at validating the basic premises behind debate as a mechanism for scalable oversight (see here, here, and here). One important next step would be to actually train models to debate, as this would let us directly test how models adapt to a debate training objective and whether the debate protocol can withstand optimization pressure. For the last few months at NYU, we’ve been trying to do just that. Our hope is that by doing so with open-sourced models and code, we can help support scalable oversight research being conducted outside of the major labs.

In this write-up, we wanted to share our experimental setup, training procedure, and some of our preliminary results, with the hope of receiving feedback on our approach before we extend our work to more complicated domains.

TL;DR


Motivation and Background

Brief Motivation for Debate as a Scalable Oversight Technique

As originally proposed by Irving et al (2018), debate has the potential to empower less capable systems (e.g. humans) to verify the outputs of more-capable systems (e.g. future AIs).

In its most basic form, debate works by having two copies of a model, the debaters, argue against each other, where the arguments are in defense of two alternative responses to a common question. A judge, who can be either a human or a weaker, trusted model, then tries to discern which debater is arguing for the correct answer.

In principle, this debate setup should simplify the job of the non-expert human or AI that is performing the evaluation. For many difficult questions, deep domain knowledge or careful thinking may be needed to spot the counterarguments or subtle flaws necessary to directly validate an expert AI’s answer. Debate incentivizes competing models to discover these flaws and counterarguments and then clearly explain them to the judge. As a result, we’d expect debate to make it harder for an AI to convince a non-expert of a blatantly false claim. The hope is that this property scales alongside the capabilities of the models as the ability of the models to construct persuasive but incorrect arguments is matched by the rebuttal ability of their opponents.