AI control problem

In artificial intelligence (AI) and philosophy, the AI control problem is the hypothetical puzzle of how to build a superintelligent agent that will aid its creators, and avoid inadvertently building a superintelligence that will harm its creators. Its study is motivated by the claim that the human race will have to get the control problem right “the first time”, as a misprogrammed superintelligence might rationally decide to “take over the world” and refuse to permit its programmers to modify it after launch.

In addition, some scholars argue that solutions to the control problem, alongside other advances in “AI safety engineering“, might also find applications in existing non-superintelligent AI. Potential strategies include “capability control” (preventing an AI from being able to pursue harmful plans), and “motivational control” (building an AI that wants to be helpful).


Existential risk

The human race currently dominates other species because the human brain has some distinctive capabilities that the brains of other animals lack. Some scholars, such as philosopher Nick Bostrom and AI researcher Stuart Russell, controversially argue that if AI surpasses humanity in general intelligence and becomes “superintelligent”, then this new superintelligence could become powerful and difficult to control: just as the fate of the mountain gorilla depends on human goodwill, so might the fate of humanity depend on the actions of a future machine superintelligence.

Some scholars, including Nobel laureate physicists Stephen Hawking and Frank Wilczek, publicly advocate starting research into solving the (probably extremely difficult) “control problem” well before the first superintelligence is created, and argue that attempting to solve the problem after superintelligence is created would be too late, as an uncontrollable rogue superintelligence might successfully resist post-hoc efforts to control it.

Waiting until superintelligence seems to be “just around the corner” could also be too late, partly because the control problem might take a long time to satisfactorily solve (and so some preliminary work needs to be started as soon as possible), but also because of the possibility of a sudden “intelligence explosion” from sub-human to super-human AI, in which case there might not be any substantial or unambiguous warning before superintelligence arrives.

In addition, it is possible that insights gained from the control problem could in the future end up suggesting that some architectures for artificial general intelligence are more predictable and amenable to control than other architectures, which in turn could nudge helpfully early artificial general intelligence research toward the direction of the more controllable architectures.

Preventing unintended consequences from existing AI

In addition, some scholars argue that research into the AI control problem might be useful in preventing unintended consequences from existing weak AI. Google DeepMind researcher Laurent Orseau gives, as a simple hypothetical example, a case of a reinforcement learning robot that sometimes gets legitimately commandeered by humans when it goes outside: how should the robot best be programmed so that it doesn’t accidentally and quietly “learn” to avoid going outside, for fear of being commandeered and thus becoming unable to finish its daily tasks? Orseau also points to an experimental Tetris program that learned to pause the screen indefinitely to avoid “losing”. Orseau argues that these examples are similar to the “capability control” problem of how to install a button that shuts off a superintelligence, without motivating the superintelligence to take action to prevent you from pressing the button.

In the past, even pre-tested weak AI systems have occasionally caused harm (ranging from minor to catastrophic) that was unintended by the programmers. For example, in 2015, possibly due to human error, a German worker was crushed to death by a robot at a Volkswagen plant that apparently mistook him for an auto part.

In 2016 Microsoft launched a chatbot, Tay, that learned to use racist and sexist language. The University of Sheffield’s Noel Sharkey states that an ideal solution would be if “an AI program could detect when it is going wrong and stop itself”, but cautions the public that solving the problem in the general case would be “a really enormous scientific challenge”.

Problem description

Existing weak AI systems can be monitored and easily shut down and modified if they misbehave. However, a misprogrammed superintelligence, which by definition is smarter than humans in solving practical problems it encounters in the course of pursuing its goals, would realize that allowing itself to be shut down and modified might interfere with its ability to accomplish its current goals. If the superintelligence therefore decides to resist shutdown and modification, it would (again, by definition) be smart enough to outwit its programmers if there is otherwise a “level playing field” and if the programmers have taken no prior precautions. (Unlike in science fiction, a superintelligence will not “adopt a plan so stupid that even we can forsee how it would inevitably fail”, such as deliberately revealing its intentions ahead of time to the programmers, or allowing its programmers to flee into a locked room with a computer that the programmers can use to program and deploy another, competing superintelligence.) In general, attempts to solve the “control problem” after superintelligence is created, are likely to fail because a superintelligence would likely have superior strategic planning abilities to humans, and (all things equal) would be more successful at finding ways to dominate humans than humans would be able to post facto find ways to dominate the superintelligence. The control problem asks: What prior precautions can the programmers take to successfully prevent the superintelligence from catastrophically misbehaving?

Capability control

Some proposals aim to prevent the initial superintelligence from being capable of causing harm, even if it wants to. One tradeoff is that all such methods have the limitation that, if after the first deployment, superintelligences continue to grow smarter and smarter and more and more widespread, inevitably some malign superintelligence somewhere will eventually “escape” its capability control methods. Therefore, Bostrom and others recommend capability control methods only as an emergency fallback to supplement “motivational control” methods.

Kill switch

Just as humans can be killed or otherwise disabled, computers can be turned off. One challenge is that, if being turned off prevents it from achieving its current goals, a superintelligence would likely try to prevent its being turned off. Just as humans have systems in place to deter or protect themselves from assailants, such a superintelligence would have a motivation to engage in “strategic planning” to prevent itself being turned off. This could involve:[1]

  • Hacking other systems to install and run backup copies of itself, or creating other allied superintelligent agents without kill switches.
  • Pre-emptively disabling anyone who might want to turn the computer off.
  • Using some kind of clever ruse, or superhuman persuasion skills, to talk its programmers out of wanting to shut it down.

Utility balancing and safely interruptible agents

One partial solution to the kill-switch problem involves “utility balancing”: Some utility-based agents can, with some important caveats, be programmed to “compensate” themselves exactly for any lost utility caused by an interruption or shutdown, in such a way that they end up being indifferent to whether they are interrupted or not. The caveats include a severe unsolved problem that, as with evidential decision theory, the agent might follow a catastrophic policy of “managing the news”. Alternatively, in 2016, scientists Laurent Orseau and Stuart Armstrong proved that a broad class of agents, called “safely interruptible agents” (SIA), can eventually “learn” to become indifferent to whether their “kill switch” (or other “interruption switch”) gets pressed.

Both the utility balancing approach and the 2016 SIA approach have the limitation that, if the approach succeeds and the superintelligence is completely indifferent to whether the kill switch is pressed or not, the superintelligence is also unmotivated to care one way or another about whether the kill switch remains functional, and could incidentally and innocently disable it in the course of its operations (for example, for the purpose of removing and recycling an “unnecessary” component). Similarly, if the superintelligence innocently creates and deploys superintelligent sub-agents, it will have no motivation to install human-controllable kill switches in the sub-agents. More broadly, the proposed architectures, whether weak or superintelligent, will in a sense “act as if the kill switch can never be pressed” and might therefore fail to make any contingency plans to arrange a graceful shutdown.

This could hypothetically create a practical problem even for a weak AI; by default, an AI designed to be safely interruptible might difficulty understanding that it will be shut down for scheduled maintenance at 2 a.m. tonight and planning accordingly so that it won’t be caught in the middle of a task during shutdown. The breadth of what types of architectures are or can be made SIA-compliant, as well as what types of counter-intuitive unexpected drawbacks each approach has, are currently under research.

AI box

One of the tradeoffs of placing the AI into a sealed “box”, is that some AI box proposals reduce the usefulness of the superintelligence, rather than merely reducing the risks; a superintelligence running on a closed system with no inputs or outputs at all might be safer than one running on a normal system, but would also not be as useful. In addition, keeping control of a sealed superintelligence computer could prove difficult, if the superintelligence has superhuman persuasion skills, or if it has superhuman strategic planning skills that it can use to find and craft a winning strategy, such as acting in a way that tricks its programmers into (possibly falsely) believing the superintelligence is safe or that the benefits of releasing the superintelligence outweigh the risks.

Motivation selection methods

Some proposals aim to imbue the first superintelligence with human-friendly goals, so that it will want to aid its programmers. Experts do not currently know how to reliably program abstract values such as happiness or autonomy into a machine. It is also not currently known how to ensure that a complex, upgradeable, and possibly even self-modifying artificial intelligence will retain its goals through upgrades. Even if these two problems can be practically solved, any attempt to create a superintelligence with explicit, directly-programmed human-friendly goals runs into a problem of “perverse instantiation”.

The problem of perverse instantiation: “be careful what you wish for”

Autonomous AI systems may be assigned the wrong goals by accident. Two AAAI presidents, Tom Dietterich and Eric Horvitz, note that this is already a concern for existing systems: “An important aspect of any AI system that interacts with people is that it must reason about what people intend rather than carrying out commands literally.” This concern becomes more serious as AI software advances in autonomy and flexibility.

According to Bostrom, superintelligence can create a qualitatively new problem of “perverse instantiation”: the smarter and more capable an AI is, the more likely it will be able to find an unintended “shortcut” that maximally satisfies the goals programmed into it. Some hypothetical examples where goals might be instantiated in a perverse way that the programmers did not intend:

  • A superintelligence programmed to “maximize the expected time-discounted integral of your future reward signal”, might short-circuit its reward pathway to maximum strength, and then (for reasons of instrumental convergence) exterminate the unpredictable human race and convert the entire Earth into a fortress on constant guard against any even slight unlikely alien attempts to disconnect the reward signal.
  • A superintelligence programmed to “maximize human happiness”, might implant electrodes into the pleasure center of our brains, or upload a human into a computer and tile the universe with copies of that computer running a five-second loop of maximal happiness again and again.

Russell has noted that, on a technical level, omitting an implicit goal can result in harm: “A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable. This is essentially the old story of the genie in the lamp, or the sorcerer’s apprentice, or King Midas: you get exactly what you ask for, not what you want… This is not a minor difficulty.”

Indirect normativity

While direct normativity, such as the fictional Three Laws of Robotics, directly specifies the desired “normative” outcome, other (perhaps more promising) proposals suggest specifying some type of indirect process for the superintelligence to determine what human-friendly goals entail. Eliezer Yudkowsky of the Machine Intelligence Research Institute has proposed “coherent extrapolated volition” (CEV), where the AI’s meta-goal would be something like “achieve that which we would have wished the AI to achieve if we had thought about the matter long and hard.” Different proposals of different kinds of indirect normativity exist, with different, and sometimes unclearly-grounded, meta-goal content (such as “do what I mean” or “do what is right”), and with different non-convergent assumptions for how to practice decision theory and epistemology. As with direct normativity, it is currently unknown how to reliably translate even concepts like “would have” into the 1’s and 0’s that a machine can act on, and how to ensure the AI reliably retains its meta-goals (or even remains “sane”) in the face of modification or self-modification.

Related Resources