A common instrumental function is to prevent destruction

up:: AI Alignment MOC No matter what your aim is, being destroyed or shut down always ranks low as a secondary goal.

Take the example of an AI robot working to serve coffee in an office. His protocol is to fulfill the task to serve coffee. The robot being programmed to the simple task of serving coffee leads to self-preservation: the robot takes measures against anything that would stop him from fulfilling the task he’s programmed to do.

The simple solution seems “Just make it more attractive to be shut down!“. Yet, if being shut down has a reward of 2 and fulfilling a task has a reward of 1, the agent is just going to shut itself down since In reinforcement learning, agents seek to maximise reward TK.

Another approach might seem to set shutting down and fulfilling a task to the same reward of 1. What happens now? The agent is going to shut itself down since shutting itself down is easier than fulfilling the task. This is a form of [Reward Hacking TK]].

A solution could be to continuously balance out the reward between fulfilling the task and shutting itself down so that it fulfills the task but is still able to be shut down.


add: stop button solution: cooperative inverse reinforcement learning, deducting the utility function from observing behaviour (an expert in the game/a human in the environment)