On the Study of Cooperative Multi-Agent Policy Gradient
Résumé
Reinforcement Learning (RL) for decentralized partially observable Markov decision
processes (Dec-POMDPs) is lagging behind the spectacular breakthroughs of single-agent RL.
That is because assumptions that hold in single-agent settings are often obsolete in decentralized
multi-agent systems. To tackle this issue, we investigate the foundations of policy gradient methods
within the centralized training for decentralized control (CTDC) paradigm. In this paradigm,
learning can be accomplished in a centralized manner while execution can still be independent.
Using this insight, we establish policy gradient theorem and compatible function approximations
for decentralized multi-agent systems. Resulting actor-critic methods preserve the decentralized
control at the execution phase, but can also estimate the policy gradient from collective experiences
guided by a centralized critic at the training phase. Experiments demonstrate our policy gradient
methods compare favorably against standard RL techniques in benchmarks from the literature.
Origine | Fichiers produits par l'(les) auteur(s) |
---|
Loading...