No Free Lunch: Overcoming Reward Gaming in AI Safety Gridworlds

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review


We present two heuristics for tackling the problem of reward gaming by self-modification in Reinforcement Learning agents. Reward gaming occurs when the agent's reward function is mis-specified and the agent can achieve a high reward by altering or fooling, in some way, its sensors rather than by performing the desired actions. Our first heuristic tracks the rewards encountered in the environment and converts high rewards that fall outside the normal distribution into penalities. Our second heuristic relies on the existence of some validation action that an agent can take to check the reward. In this heuristic, on encountering an abnormally high reward, the agent performs a validation step before either accepting the reward as it is, or converting it into a penalty. We evaluate the performance of these heuristics on variants of the tomato watering problem from the AI Safety Gridworlds suite.

Bibliographical metadata

Original languageEnglish
Title of host publication4th International Workshop on Artificial Intelligence Safety Engineering (WAISE 2021)
Publication statusAccepted/In press - 12 Jun 2021