Facebooks artificial intelligence team has made what it describes as a
superhuman poker champion, a bot with the ability to beat
world-leading human pros
Facebook is
shouting about its AI bot, named Pluribus, as a major breakthrough: the first
capable of beating as many as six players and in a game that involves what is
known as hidden information - the other players hole cards.
In recent years there have been great strides in artificial intelligence
(AI), with games often serving as challenge problems, benchmarks, and
milestones for progress. Poker has served for decades as such a challenge
problem. Past successes in such benchmarks, including poker, have been limited
to two-player games. However, poker in particular is traditionally played with
more than two players. Multiplayer games present fundamental additional issues
beyond those in two-player games, and multiplayer poker is a recognized AI
milestone The past two decades have witnessed rapid progress in the
ability of AI systems to play increasingly complex forms of poker but all prior
breakthroughs have been limited to settings involving only two players. Here
however we have Pluribus, an AI capable of defeating elite human professionals
in six-player no-limit Texas holdem poker, the most commonly played poker
format in the world. The strength of Pluribuss strategy was
computed via self play, in which the it played against copies of itself,
without any data of human or prior AI play used as input. Pluribus started from
scratch by playing randomly, and gradually improved as it determined which
actions, and which probability distribution over those actions, lead to better
outcomes against earlier versions of its strategy. One of the
remarkable things about creating the blueprint strategy for Pluribus was that
it was calculated in a mere 8 days on a 64-core server for a total of 12,400
CPU core hours. It required less than 512 GB of memory. At standard cloud
computing rates that would cost about $100 to produce. This is in sharp
contrast to all the other recent superhuman AI milestones for games, which used
large numbers of servers and/or farms of GPUs or supercomputers at great
expense For nistance efforts from from Googles AI research shop Deepmind,
have relied on supercomputers consisting of more than 5,000 specialist
processors, at a reported cost of millions of dollars. Reducing the
computing power necessary for AI experiments is seen as a key hurdle to the
technologys development, with the computing power needed currently
exceeding the rate at which processors are getting more efficient. The
Facebook teams research also makes humbling reading for any poker players
proud of their ability to spot a tell. We think of
bluffing as this very human trait, explained Noam Brown, the lead
researcher from Facebooks AI team, speaking to BBC News.
But what we see is that bluffing is actually mathematical behaviour.
When the bot bluffs, it doesnt view it as deceptive or
dishonest, its just the way to make the most money. Mr
Brown went on to say that neither he nor Facebook had any plans to use the AI
in real poker games. Indeed, the firm has said it is not publicly disclosing
much of the code for fear of having a negative impact on the poker community.
It would, a spokesman said, provide examples of techniques to other researchers
working on AI. Beyond poker, Mr Brown would not be drawn on what
practical use Facebook might have in mind for the technology.
Pluribus
played against elite human professionals in two formats: five human
professionals playing with one copy of Pluribus (5H+1AI), and one human
professional playing with five copies of Pluribus (1H+5AI). Performance was
measured using the standard metric in this field of AI, milli big blinds per
game (mbb/game). This measures how many big blinds (the initial money the
second player must put into the pot) were won on average per thousand hands of
poker.
The human participants in the 5H+1AI experiment were Jimmy Chou,
Seth Davies, Michael Gagliano, Anthony Gregg, Dong Kim, Jason Les, Linus
Loeliger, Daniel McAulay, Greg Merson, Nicholas Petrangelo, Sean Ruane, Trevor
Savage, and Jacob Toole. In this experiment, 10,000 hands of poker were played
over 12 days. Each day, five volunteers from the pool of professionals were
selected to participate based on availability. The participants were not told
who else was participating in the experiment. Instead, each participant was
assigned an alias that remained constant throughout the experiment. The alias
of each player in each game was known, so that players could track the
tendencies of each player throughout the experiment. $50,000 was divided among
the human participants based on their performance to incentivize them to play
their best. Each player was guaranteed a minimum of $0.40 per hand for
participating, but this could increase to as much as $1.60 per hand based on
performance.
After applying a variance-reduction technique (AIVAT)
Pluribus won an average of 48 mbb/game (with a standard error of 25 mbb/game).
This is considered a very high win rate in six-player no-limit Texas
holdem poker, especially against a collection of elite professionals, and
implies that Pluribus is stronger than the human opponents. A top pro would
expect to win 50 mbb/game against average players (a game is a full lap of the
button so that every posts big and small blinds).
The human participants
in the 1H+5AI experiment were Chris Jesus Ferguson and Darren
Elias. Each of the two humans separately played 5,000 hands of poker against
five copies of Pluribus. Pluribus does not adapt its strategy to its opponents
and does not know the identity of its opponents, so the copies of Pluribus
could not intentionally collude against the human player. To incentivize strong
play, we offered each human $2,000 for participation and an additional $2,000
if he performed better against the AI than the other human player did. The
players did not know who the other participant was and were not told how the
other human was performing during the experiment. For the 10,000 hands played,
Pluribus beat the humans by an average of 32 mbb/game (with a standard error of
15 mbb/game).
Poker Lessons Because Pluribuss strategy
was determined entirely from self-play without any human data, it also provides
an outside perspective on what optimal play should look like in multiplayer
no-limit Texas holdem. Pluribus confirms the conventional human wisdom
that limping (calling the big blind rather than folding or raising)
is suboptimal for any player except the small blind player who
already has half the big blind in the pot by the rules, and thus has to invest
only half as much as the other players to call. While Pluribus initially
experimented with limping when computing its blueprint strategy offline through
self play, it gradually discarded this action from its strategy as self play
continued. However, Pluribus disagrees with the folk wisdom that donk
betting (starting a round by betting when one ended the previous betting
round with a call) is a mistake; Pluribus does this far more often than
professional humans do.
AI Conclusions Forms of self play
combined with forms of search has led to a number of high-profile successes in
perfect-information two-player zero-sum games. However, most real-world
strategic interactions involve hidden information and more than two players.
This makes the problem very different and significantly more difficult both
theoretically and practically. Developing a superhuman AI for multiplayer poker
was a widely-recognized milestone in this area and the major remaining
milestone in computer poker. In this paper we described Pluribus, an AI capable
of defeating elite human professionals in six-player no-limit Texas
holdem poker, the most commonly played poker format in the world.
Pluribuss success shows that despite the lack of known strong theoretical
guarantees on performance in multiplayer games, there are large-scale, complex
multiplayer imperfect-information settings in which a carefully constructed
self-play-with-search algorithm can produce superhuman strategies. |