Affiliations: The Institute of Scientific and Industrial Research, Osaka University 8-1, Mihogaoka, Ibaraki, Osaka, 567-0047, Japan E-mail: koichi@ai.sanken.osaka-u.ac.jp
Abstract: This work deals with Q-learning in a multiagent environment. There are many multiagent Q-learning methods, and most of them aim to converge to a Nash equilibrium, which is not desirable in games like the Prisoner's Dilemma (PD). However, normal Q-learning agents that use a stochastic method in choosing actions to avoid local optima may yield mutual cooperation in a PD game. Although such mutual cooperation usually occurs singly, it can be facilitated if the Q-function of cooperation becomes larger than that of defection after the cooperation. This work derives a theorem on how many consecutive repetitions of mutual cooperation are needed to make the Q-function of cooperation larger than that of defection. In addition, from the perspective of the author's previous works that discriminate utilities from rewards and use utilities for learning in PD games, this work also derives a corollary on how much utility is necessary to make the Q-function larger by one-shot mutual cooperation.