How to get better at video games according to babies Brian Christian

In 2013, a group of researchers
at DeepMind in London

had set their sights on a grand challenge.

They wanted to create an AI system
that could beat,

not just a single Atari game,
but every Atari game.

They developed a system they called
Deep Q Networks, or DQN,

and less than two years later,
it was superhuman.

DQN was getting scores 13 times better

than professional human games testers
at “Breakout,”

17 times better at “Boxing,”
and 25 times better at “Video Pinball.”

But there was one notable, and glaring,
exception.

When playing “Montezuma’s Revenge”
DQN couldn’t score a single point,

even after playing for weeks.

What was it that made this particular game
so vexingly difficult for AI?

And what would it take to solve it?

Spoiler alert: babies.

We’ll come back to that in a minute.

Playing Atari games with AI involves
what’s called reinforcement learning,

where the system is designed to maximize
some kind of numerical rewards.

In this case, those rewards were
simply the game’s points.

This underlying goal drives the system
to learn which buttons to press

and when to press them
to get the most points.

Some systems use model-based approaches,
where they have a model of the environment

that they can use to predict
what will happen next

once they take a certain action.

DQN, however, is model free.

Instead of explicitly modeling
its environment,

it just learns to predict,
based on the images on screen,

how many future points it can expect
to earn by pressing different buttons.

For instance, “if the ball is here
and I move left, more points,

but if I move right, no more points.”

But learning these connections requires
a lot of trial and error.

The DQN system would start
by mashing buttons randomly,

and then slowly piece together
which buttons to mash when

in order to maximize its score.

But in playing “Montezuma’s Revenge,”

this approach of random button-mashing
fell flat on its face.

A player would have to perform
this entire sequence

just to score their first points
at the very end.

A mistake? Game over.

So how could DQN even know
it was on the right track?

This is where babies come in.

In studies, infants consistently look
longer at pictures

they haven’t seen before
than ones they have.

There just seems to be something
intrinsically rewarding about novelty.

This behavior has been essential
in understanding the infant mind.

It also turned out to be the secret
to beating “Montezuma’s Revenge.”

The DeepMind researchers worked
out an ingenious way

to plug this preference for novelty
into reinforcement learning.

They made it so that unusual or new images
appearing on the screen

were every bit as rewarding
as real in-game points.

Suddenly, DQN was behaving totally
differently from before.

It wanted to explore the room it was in,

to grab the key and escape
through the locked door—

not because it was worth 100 points,

but for the same reason we would:
to see what was on the other side.

With this new drive, DQN not only
managed to grab that first key—

it explored all the way through 15
of the temple’s 24 chambers.

But emphasizing novelty-based rewards
can sometimes create more problems

than it solves.

A novelty-seeking system that’s played
a game too long

will eventually lose motivation.

If it’s seen it all before,
why go anywhere?

Alternately, if it encounters, say,
a television, it will freeze.

The constant novel images
are essentially paralyzing.

The ideas and inspiration here
go in both directions.

AI researchers stuck
on a practical problem,

like how to get DQN to beat
a difficult game,

are turning increasingly to experts
in human intelligence for ideas.

At the same time,

AI is giving us new insights
into the ways we get stuck and unstuck:

into boredom, depression, and addiction,

along with curiosity, creativity,
and play.

2013 年，伦敦 DeepMind 的一组研究

人员将目光投向了一项重大挑战。

他们想创建一个 AI 系统
，不仅可以击败

单个 Atari 游戏，
还可以击败所有 Atari 游戏。

他们开发了一个名为
Deep Q Networks 或 DQN 的系统

，不到两年后，
它就超人了。

DQN 在“Breakout”中的得分

比专业人类游戏测试员高 13 倍
，

在“拳击”中高出 17 倍，
在“视频弹球”中高出 25 倍。

但是有一个值得注意的、明显的
例外。

在玩“Montezuma’s Revenge”
时，DQN

甚至在玩了几个星期后都无法得分。

是什么让这个特定的游戏
对 AI 来说如此困难？

解决它需要什么？

剧透警报：婴儿。

我们稍后再谈。

使用 AI 玩 Atari 游戏涉及
所谓的强化学习

，该系统旨在最大化
某种数值奖励。

在这种情况下，这些奖励
只是游戏的积分。

这个基本目标驱使
系统了解按下哪些按钮

以及何时按下它们
以获得最多分数。

一些系统使用基于模型的方法
，他们有一个环境模型

，他们可以使用该模型来预测

一旦采取某种行动会发生什么。

然而，DQN 是无模型的。它

没有显式地
对其环境进行建模

，而是
根据屏幕上的图像来学习预测

通过按下不同的按钮可以获得多少未来积分。

例如，“如果球在这里
并且我向左移动，则获得更多分，

但如果我向右移动，则不再获得积分。”

但是学习这些联系
需要大量的试验和错误。

DQN 系统会
从随机捣碎按钮开始，

然后慢慢拼凑
出要捣碎的按钮

，以使其得分最大化。

但是在播放“蒙特祖玛的复仇”时，

这种随机按键捣碎的方法
一败涂地。

玩家必须
完成整个

序列才能
在最后得分。

错误？游戏结束。

那么，DQN 怎么可能知道
它在正确的轨道上呢？

这就是婴儿的用武之地。

在研究中，婴儿看

他们以前没看过的照片的时间总是
比他们看过的照片长。新奇

似乎有一些
内在的回报。

这种行为
对于理解婴儿的思想至关重要。

事实证明，这也是
击败“蒙特祖玛的复仇”的秘诀。

DeepMind 研究人员
想出了一种巧妙的方法

，将这种对新奇事物的偏好
融入到强化学习中。

他们这样做是为了让屏幕上出现的不寻常或新图像

与真实的游戏积分一样有价值。

突然间，DQN 的行为
与以前完全不同。

它想探索它所在的房间

，抓住钥匙并
从锁着的门逃走——

不是因为它值 100 分，

而是出于同样的原因
：看看另一边有什么。

有了这个新的驱动器，DQN 不仅
设法抓住了第一把钥匙——

它一路探索
了寺庙 24 个房间中的 15 个。

但强调基于新颖性的奖励
有时会产生比它解决的问题更多的问题

。

一个玩游戏太久的新奇系统

最终会失去动力。

如果以前都看过，
为什么要去任何地方？

或者，如果它遇到
例如电视，它会冻结。

不断出现的新颖
图像基本上令人瘫痪。

这里的想法和灵感
是双向的。

人工智能研究人员陷入
了一个实际问题，

例如如何让 DQN 击败
一个困难的游戏，

正越来越多地向
人类智能专家寻求想法。

与此同时，

人工智能让我们对
卡住和解开的方式有了新的认识

：无聊、抑郁和成瘾，

以及好奇心、创造力
和玩耍。