Deep CPT-RL: Imparting Human-Like Risk Sensitivity to Artificial Agents
Current deep reinforcement learning (DRL) methods fail to address risk in an intelligent manner, potentially leading to unsafe behaviors when deployed. One strategy for improving agent risk management is to mimic human behavior. While imperfect, human risk processing displays two key benefits absent from standard artificial agents: accounting for rare but consequential events and incorporating context. The former ability may prevent catastrophic outcomes in unfamiliar settings while the latter results in asymmetric processing of potential gains and losses. These two attributes have been quantified by behavioral economists and form the basis of cumulative prospect theory (CPT), a leading model of human decision-making. We introduce a two-step method for training DRL agents to maximize the CPT-value of full episode rewards accumulated from an environment, rather than the standard practice of maximizing expected discounted rewards. We quantitatively compare the distribution of outcomes when optimizing full-episode expected reward, CPT-value, and conditional value-at-risk (CVaR) in the CrowdSim robot navigation environment, elucidating the impacts of different objectives on the agent’s willingness to trade safety for speed. We find that properly-configured maximization of CPT-value allows for a reduction of the frequency of negative outcomes with only a slight degradation of the best outcomes, compared to maximization of expected reward.