How computers learn to recognize objects instantly Joseph Redmon

Ten years ago,

computer vision researchers
thought that getting a computer

to tell the difference
between a cat and a dog

would be almost impossible,

even with the significant advance
in the state of artificial intelligence.

Now we can do it at a level
greater than 99 percent accuracy.

This is called image classification –

give it an image,
put a label to that image –

and computers know
thousands of other categories as well.

I’m a graduate student
at the University of Washington,

and I work on a project called Darknet,

which is a neural network framework

for training and testing
computer vision models.

So let’s just see what Darknet thinks

of this image that we have.

When we run our classifier

on this image,

we see we don’t just get
a prediction of dog or cat,

we actually get
specific breed predictions.

That’s the level
of granularity we have now.

And it’s correct.

My dog is in fact a malamute.

So we’ve made amazing strides
in image classification,

but what happens
when we run our classifier

on an image that looks like this?

Well …

We see that the classifier comes back
with a pretty similar prediction.

And it’s correct,
there is a malamute in the image,

but just given this label,
we don’t actually know that much

about what’s going on in the image.

We need something more powerful.

I work on a problem
called object detection,

where we look at an image
and try to find all of the objects,

put bounding boxes around them

and say what those objects are.

So here’s what happens
when we run a detector on this image.

Now, with this kind of result,

we can do a lot more
with our computer vision algorithms.

We see that it knows
that there’s a cat and a dog.

It knows their relative locations,

their size.

It may even know some extra information.

There’s a book sitting in the background.

And if you want to build a system
on top of computer vision,

say a self-driving vehicle
or a robotic system,

this is the kind
of information that you want.

You want something so that
you can interact with the physical world.

Now, when I started working
on object detection,

it took 20 seconds
to process a single image.

And to get a feel for why
speed is so important in this domain,

here’s an example of an object detector

that takes two seconds
to process an image.

So this is 10 times faster

than the 20-seconds-per-image detector,

and you can see that by the time
it makes predictions,

the entire state of the world has changed,

and this wouldn’t be very useful

for an application.

If we speed this up
by another factor of 10,

this is a detector running
at five frames per second.

This is a lot better,

but for example,

if there’s any significant movement,

I wouldn’t want a system
like this driving my car.

This is our detection system
running in real time on my laptop.

So it smoothly tracks me
as I move around the frame,

and it’s robust to a wide variety
of changes in size,

pose,

forward, backward.

This is great.

This is what we really need

if we’re going to build systems
on top of computer vision.

(Applause)

So in just a few years,

we’ve gone from 20 seconds per image

to 20 milliseconds per image,
a thousand times faster.

How did we get there?

Well, in the past,
object detection systems

would take an image like this

and split it into a bunch of regions

and then run a classifier
on each of these regions,

and high scores for that classifier

would be considered
detections in the image.

But this involved running a classifier
thousands of times over an image,

thousands of neural network evaluations
to produce detection.

Instead, we trained a single network
to do all of detection for us.

It produces all of the bounding boxes
and class probabilities simultaneously.

With our system, instead of looking
at an image thousands of times

to produce detection,

you only look once,

and that’s why we call it
the YOLO method of object detection.

So with this speed,
we’re not just limited to images;

we can process video in real time.

And now, instead of just seeing
that cat and dog,

we can see them move around
and interact with each other.

This is a detector that we trained

on 80 different classes

in Microsoft’s COCO dataset.

It has all sorts of things
like spoon and fork, bowl,

common objects like that.

It has a variety of more exotic things:

animals, cars, zebras, giraffes.

And now we’re going to do something fun.

We’re just going to go
out into the audience

and see what kind of things we can detect.

Does anyone want a stuffed animal?

There are some teddy bears out there.

And we can turn down
our threshold for detection a little bit,

so we can find more of you guys
out in the audience.

Let’s see if we can get these stop signs.

We find some backpacks.

Let’s just zoom in a little bit.

And this is great.

And all of the processing
is happening in real time

on the laptop.

And it’s important to remember

that this is a general purpose
object detection system,

so we can train this for any image domain.

The same code that we use

to find stop signs or pedestrians,

bicycles in a self-driving vehicle,

can be used to find cancer cells

in a tissue biopsy.

And there are researchers around the globe
already using this technology

for advances in things
like medicine, robotics.

This morning, I read a paper

where they were taking a census
of animals in Nairobi National Park

with YOLO as part
of this detection system.

And that’s because Darknet is open source

and in the public domain,
free for anyone to use.

(Applause)

But we wanted to make detection
even more accessible and usable,

so through a combination
of model optimization,

network binarization and approximation,

we actually have object detection
running on a phone.

(Applause)

And I’m really excited because
now we have a pretty powerful solution

to this low-level computer vision problem,

and anyone can take it
and build something with it.

So now the rest is up to all of you

and people around the world
with access to this software,

and I can’t wait to see what people
will build with this technology.

Thank you.

(Applause)

十年前，

计算机视觉研究人员
认为，让

计算机分辨

猫和狗几乎是不可能的，

即使
在人工智能状态取得重大进展的情况下。

现在我们可以在
超过 99% 的准确度水平上做到这一点。

这被称为图像分类——

给它一张图像，
为该图像贴上标签

——计算机也知道
成千上万的其他类别。

我是
华盛顿大学的一名研究生

，从事一个名为 Darknet 的项目，

这是一个

用于训练和测试
计算机视觉模型的神经网络框架。

因此，让我们看看 Darknet

对我们拥有的这张图片的看法。

当我们在这张图片上运行我们的分类器

时，

我们看到我们不仅得到
了狗或猫的预测，

我们实际上得到了
特定的品种预测。

这就是
我们现在拥有的粒度级别。

这是正确的。

我的狗实际上是一只雪橇犬。

所以我们在图像分类方面取得了惊人的进步
，

但是
当我们

在看起来像这样的图像上运行分类器时会发生什么？

嗯……

我们看到分类器返回
了一个非常相似的预测。

这是正确的，
图像中有一只雪橇犬，

但只是给出了这个标签，
我们实际上并不知道

图像中发生了什么。

我们需要更强大的东西。

我研究了一个
叫做对象检测的问题

，我们查看图像
并尝试找到所有对象，

在它们周围放置边界框

并说出这些对象是什么。

所以这就是
我们在这张图片上运行检测器时会发生的情况。

现在，有了这种结果，

我们可以
用我们的计算机视觉算法做更多的事情。

我们看到它
知道有猫和狗。

它知道它们的相对位置，

它们的大小。

它甚至可能知道一些额外的信息。

有一本书坐在背景中。

如果你想
在计算机视觉之上构建一个系统，

比如自动驾驶汽车
或机器人系统，

这
就是你想要的信息。

您想要一些东西，以便
您可以与物理世界互动。

现在，当我开始
研究对象检测时，

处理一张图像需要 20 秒。

为了了解为什么
速度在这个领域如此重要，

这里有一个物体检测器的例子，

它需要两秒钟
来处理一张图像。

所以这

比每幅图像 20 秒检测器快 10 倍

，你可以看到，当
它做出预测时

，整个世界的状态都发生了变化

，这对于应用程序来说并不是很有用

。

如果我们将其再加快
10 倍，

这是一个
以每秒 5 帧运行的检测器。

这要好得多，

但是例如，

如果有任何重大运动，

我不希望这样的系统
驾驶我的汽车。

这是我们
在我的笔记本电脑上实时运行的检测系统。

因此，
当我在框架中移动时，它可以平滑地跟踪我，

并且它
对于大小、

姿势、

向前、向后的各种变化都很稳健。

这很棒。

如果我们要
在计算机视觉之上构建系统，这就是我们真正需要的。

（掌声）

所以在短短几年内，

我们从每张图像20秒到

每张图像20毫秒，
快了一千倍。

我们是怎么到那里的？

好吧，在过去，
物体检测系统

会拍摄这样的图像

并将其分割成一堆区域

，然后在每个区域上运行一个分类器

，该分类器的高分

将被视为
图像中的检测。

但这涉及
在图像上运行数千次分类器，

数千次神经网络评估
以产生检测。

相反，我们训练了一个网络
来为我们完成所有检测。

它同时产生所有的边界框
和类概率。

使用我们的系统，您无需查看
数千次图像

即可进行检测，

而只需查看一次

，这就是我们将
其称为 YOLO 对象检测方法的原因。

因此，以这种速度，
我们不仅限于图像；

我们可以实时处理视频。

现在，

我们不仅可以看到那只猫和狗，还可以看到它们四处走动
并相互互动。

这是我们

在 Microsoft 的 COCO 数据集中的 80 个不同类别上训练的检测器。

它有各种各样的东西，
比如勺子和叉子，碗，

像这样的普通物品。

它有各种更奇特的东西：

动物、汽车、斑马、长颈鹿。

现在我们要做一些有趣的事情。

我们只是
要走进观众

，看看我们能检测到什么样的东西。

有人想要毛绒玩具吗？

那里有一些泰迪熊。

我们可以
稍微降低我们的检测门槛，

这样我们就可以
在观众中找到更多的人。

让我们看看能不能得到这些停车标志。

我们找到了一些背包。

让我们放大一点。

这很棒。

所有的处理
都是

在笔记本电脑上实时进行的。

重要的是要记住

，这是一个通用
目标检测系统，

因此我们可以针对任何图像域进行训练。

我们

用来查找停车标志或行人、

自动驾驶汽车中的自行车的相同代码

可用于

在组织活检中查找癌细胞。

世界各地的研究人员
已经在使用这项技术

来
推动医学、机器人技术等领域的进步。

今天早上，我读到一篇论文

，他们正在

使用 YOLO
作为该检测系统的一部分在内罗毕国家公园进行动物普查。

那是因为 Darknet 是开源的

并且在公共领域，
任何人都可以免费使用。

（掌声）

但是我们想让检测
更容易访问和使用，

所以
通过模型优化、

网络二值化和逼近的结合，

我们实际上
在手机上运行了目标检测。

（掌声）

我真的很兴奋，因为
现在我们有了一个非常强大的解决方案

来解决这个低级计算机视觉问题

，任何人都可以使用它
并用它构建一些东西。

所以现在剩下的就取决于

你们所有人和世界各地
可以使用这个软件的人了

，我迫不及待地想看看人们
将用这项技术构建什么。

谢谢你。

（掌声）