Abe Davis New video technology that reveals an objects hidden properties

Most of us think of motion
as a very visual thing.

If I walk across this stage
or gesture with my hands while I speak,

that motion is something that you can see.

But there’s a world of important motion
that’s too subtle for the human eye,

and over the past few years,

we’ve started to find that cameras

can often see this motion
even when humans can’t.

So let me show you what I mean.

On the left here, you see video
of a person’s wrist,

and on the right, you see video
of a sleeping infant,

but if I didn’t tell you
that these were videos,

you might assume that you were looking
at two regular images,

because in both cases,

these videos appear to be
almost completely still.

But there’s actually a lot
of subtle motion going on here,

and if you were to touch
the wrist on the left,

you would feel a pulse,

and if you were to hold
the infant on the right,

you would feel the rise
and fall of her chest

as she took each breath.

And these motions carry
a lot of significance,

but they’re usually
too subtle for us to see,

so instead, we have to observe them

through direct contact, through touch.

But a few years ago,

my colleagues at MIT developed
what they call a motion microscope,

which is software that finds
these subtle motions in video

and amplifies them so that they
become large enough for us to see.

And so, if we use their software
on the left video,

it lets us see the pulse in this wrist,

and if we were to count that pulse,

we could even figure out
this person’s heart rate.

And if we used the same software
on the right video,

it lets us see each breath
that this infant takes,

and we can use this as a contact-free way
to monitor her breathing.

And so this technology is really powerful
because it takes these phenomena

that we normally have
to experience through touch

and it lets us capture them visually
and non-invasively.

So a couple years ago, I started working
with the folks that created that software,

and we decided to pursue a crazy idea.

We thought, it’s cool
that we can use software

to visualize tiny motions like this,

and you can almost think of it
as a way to extend our sense of touch.

But what if we could do the same thing
with our ability to hear?

What if we could use video
to capture the vibrations of sound,

which are just another kind of motion,

and turn everything that we see
into a microphone?

Now, this is a bit of a strange idea,

so let me try to put it
in perspective for you.

Traditional microphones
work by converting the motion

of an internal diaphragm
into an electrical signal,

and that diaphragm is designed
to move readily with sound

so that its motion can be recorded
and interpreted as audio.

But sound causes all objects to vibrate.

Those vibrations are just usually
too subtle and too fast for us to see.

So what if we record them
with a high-speed camera

and then use software
to extract tiny motions

from our high-speed video,

and analyze those motions to figure out
what sounds created them?

This would let us turn visible objects
into visual microphones from a distance.

And so we tried this out,

and here’s one of our experiments,

where we took this potted plant
that you see on the right

and we filmed it with a high-speed camera

while a nearby loudspeaker
played this sound.

(Music: “Mary Had a Little Lamb”)

And so here’s the video that we recorded,

and we recorded it at thousands
of frames per second,

but even if you look very closely,

all you’ll see are some leaves

that are pretty much
just sitting there doing nothing,

because our sound only moved those leaves
by about a micrometer.

That’s one ten-thousandth of a centimeter,

which spans somewhere between
a hundredth and a thousandth

of a pixel in this image.

So you can squint all you want,

but motion that small is pretty much
perceptually invisible.

But it turns out that something
can be perceptually invisible

and still be numerically significant,

because with the right algorithms,

we can take this silent,
seemingly still video

and we can recover this sound.

(Music: “Mary Had a Little Lamb”)

(Applause)

So how is this possible?

How can we get so much information
out of so little motion?

Well, let’s say that those leaves
move by just a single micrometer,

and let’s say that that shifts our image
by just a thousandth of a pixel.

That may not seem like much,

but a single frame of video

may have hundreds of thousands
of pixels in it,

and so if we combine all
of the tiny motions that we see

from across that entire image,

then suddenly a thousandth of a pixel

can start to add up
to something pretty significant.

On a personal note, we were pretty psyched
when we figured this out.

(Laughter)

But even with the right algorithm,

we were still missing
a pretty important piece of the puzzle.

You see, there are a lot of factors
that affect when and how well

this technique will work.

There’s the object and how far away it is;

there’s the camera
and the lens that you use;

how much light is shining on the object
and how loud your sound is.

And even with the right algorithm,

we had to be very careful
with our early experiments,

because if we got
any of these factors wrong,

there was no way to tell
what the problem was.

We would just get noise back.

And so a lot of our early
experiments looked like this.

And so here I am,

and on the bottom left, you can kind of
see our high-speed camera,

which is pointed at a bag of chips,

and the whole thing is lit
by these bright lamps.

And like I said, we had to be
very careful in these early experiments,

so this is how it went down.

(Video) Abe Davis: Three, two, one, go.

Mary had a little lamb!
Little lamb! Little lamb!

(Laughter)

AD: So this experiment
looks completely ridiculous.

(Laughter)

I mean, I’m screaming at a bag of chips –

(Laughter) –

and we’re blasting it with so much light,

we literally melted the first bag
we tried this on. (Laughter)

But ridiculous as this experiment looks,

it was actually really important,

because we were able
to recover this sound.

(Audio) Mary had a little lamb!
Little lamb! Little lamb!

(Applause)

AD: And this was really significant,

because it was the first time
we recovered intelligible human speech

from silent video of an object.

And so it gave us this point of reference,

and gradually we could start
to modify the experiment,

using different objects
or moving the object further away,

using less light or quieter sounds.

And we analyzed all of these experiments

until we really understood
the limits of our technique,

because once we understood those limits,

we could figure out how to push them.

And that led to experiments like this one,

where again, I’m going to speak
to a bag of chips,

but this time we’ve moved our camera
about 15 feet away,

outside, behind a soundproof window,

and the whole thing is lit
by only natural sunlight.

And so here’s the video that we captured.

And this is what things sounded like
from inside, next to the bag of chips.

(Audio) Mary had a little lamb
whose fleece was white as snow,

and everywhere that Mary went,
that lamb was sure to go.

AD: And here’s what we were able
to recover from our silent video

captured outside behind that window.

(Audio) Mary had a little lamb
whose fleece was white as snow,

and everywhere that Mary went,
that lamb was sure to go.

(Applause)

AD: And there are other ways
that we can push these limits as well.

So here’s a quieter experiment

where we filmed some earphones
plugged into a laptop computer,

and in this case, our goal was to recover
the music that was playing on that laptop

from just silent video

of these two little plastic earphones,

and we were able to do this so well

that I could even Shazam our results.

(Laughter)

(Music: “Under Pressure” by Queen)

(Applause)

And we can also push things
by changing the hardware that we use.

Because the experiments
I’ve shown you so far

were done with a camera,
a high-speed camera,

that can record video
about a 100 times faster

than most cell phones,

but we’ve also found a way
to use this technique

with more regular cameras,

and we do that by taking advantage
of what’s called a rolling shutter.

You see, most cameras
record images one row at a time,

and so if an object moves
during the recording of a single image,

there’s a slight time delay
between each row,

and this causes slight artifacts

that get coded into each frame of a video.

And so what we found
is that by analyzing these artifacts,

we can actually recover sound
using a modified version of our algorithm.

So here’s an experiment we did

where we filmed a bag of candy

while a nearby loudspeaker played

the same “Mary Had a Little Lamb”
music from before,

but this time, we used just a regular
store-bought camera,

and so in a second, I’ll play for you
the sound that we recovered,

and it’s going to sound
distorted this time,

but listen and see if you can still
recognize the music.

(Audio: “Mary Had a Little Lamb”)

And so, again, that sounds distorted,

but what’s really amazing here
is that we were able to do this

with something
that you could literally run out

and pick up at a Best Buy.

So at this point,

a lot of people see this work,

and they immediately think
about surveillance.

And to be fair,

it’s not hard to imagine how you might use
this technology to spy on someone.

But keep in mind that there’s already
a lot of very mature technology

out there for surveillance.

In fact, people have been using lasers

to eavesdrop on objects
from a distance for decades.

But what’s really new here,

what’s really different,

is that now we have a way
to picture the vibrations of an object,

which gives us a new lens
through which to look at the world,

and we can use that lens

to learn not just about forces like sound
that cause an object to vibrate,

but also about the object itself.

And so I want to take a step back

and think about how that might change
the ways that we use video,

because we usually use video
to look at things,

and I’ve just shown you how we can use it

to listen to things.

But there’s another important way
that we learn about the world:

that’s by interacting with it.

We push and pull and poke and prod things.

We shake things and see what happens.

And that’s something that video
still won’t let us do,

at least not traditionally.

So I want to show you some new work,

and this is based on an idea I had
just a few months ago,

so this is actually the first time
I’ve shown it to a public audience.

And the basic idea is that we’re going
to use the vibrations in a video

to capture objects in a way
that will let us interact with them

and see how they react to us.

So here’s an object,

and in this case, it’s a wire figure
in the shape of a human,

and we’re going to film that object
with just a regular camera.

So there’s nothing special
about this camera.

In fact, I’ve actually done this
with my cell phone before.

But we do want to see the object vibrate,

so to make that happen,

we’re just going to bang a little bit
on the surface where it’s resting

while we record this video.

So that’s it: just five seconds
of regular video,

while we bang on this surface,

and we’re going to use
the vibrations in that video

to learn about the structural
and material properties of our object,

and we’re going to use that information
to create something new and interactive.

And so here’s what we’ve created.

And it looks like a regular image,

but this isn’t an image,
and it’s not a video,

because now I can take my mouse

and I can start interacting
with the object.

And so what you see here

is a simulation of how this object

would respond to new forces
that we’ve never seen before,

and we created it from just
five seconds of regular video.

(Applause)

And so this is a really powerful
way to look at the world,

because it lets us predict
how objects will respond

to new situations,

and you could imagine, for instance,
looking at an old bridge

and wondering what would happen,
how would that bridge hold up

if I were to drive my car across it.

And that’s a question
that you probably want to answer

before you start driving
across that bridge.

And of course, there are going to be
limitations to this technique,

just like there were
with the visual microphone,

but we found that it works
in a lot of situations

that you might not expect,

especially if you give it longer videos.

So for example,
here’s a video that I captured

of a bush outside of my apartment,

and I didn’t do anything to this bush,

but by capturing a minute-long video,

a gentle breeze caused enough vibrations

that we could learn enough about this bush
to create this simulation.

(Applause)

And so you could imagine giving this
to a film director,

and letting him control, say,

the strength and direction of wind
in a shot after it’s been recorded.

Or, in this case, we pointed our camera
at a hanging curtain,

and you can’t even see
any motion in this video,

but by recording a two-minute-long video,

natural air currents in this room

created enough subtle,
imperceptible motions and vibrations

that we could learn enough
to create this simulation.

And ironically,

we’re kind of used to having
this kind of interactivity

when it comes to virtual objects,

when it comes to video games
and 3D models,

but to be able to capture this information
from real objects in the real world

using just simple, regular video,

is something new that has
a lot of potential.

So here are the amazing people
who worked with me on these projects.

(Applause)

And what I’ve shown you today
is only the beginning.

We’ve just started to scratch the surface

of what you can do
with this kind of imaging,

because it gives us a new way

to capture our surroundings
with common, accessible technology.

And so looking to the future,

it’s going to be
really exciting to explore

what this can tell us about the world.

Thank you.

(Applause)

我们大多数人都认为运动
是一种非常直观的东西。

如果我在
说话的时候走过这个舞台或用手做手势,

那个动作是你可以看到的。

但是有一个重要的运动世界
对于人眼来说太微妙

了,在过去的几年里,

我们开始发现相机

经常可以看到这种运动,
即使人类看不到。

所以让我告诉你我的意思。

左边
是一个人的手腕视频

,右边
是一个熟睡婴儿的视频,

但如果我没有告诉你
这些是视频,

你可能会认为你在
看两个常规图像 ,

因为在这两种情况下,

这些视频似乎
几乎完全静止。

但其实这里有
很多细微的动作,

你摸
左边的手腕,

你会感觉到脉搏

,如果你
把婴儿抱在右边,

你会感觉到脉搏的
起伏。

每次呼吸时,她的胸膛。

而这些动作
具有很大的意义,

但它们通常
太微妙以至于我们无法看到,

因此,我们必须

通过直接接触、触摸来观察它们。

但几年前,

我在麻省理工学院的同事开发
了他们所谓的运动显微镜,

这是一种软件,可以
在视频中发现这些细微的运动

并将它们放大,使它们
变得足够大,让我们看到。

因此,如果我们
在左侧视频中使用他们的软件,

它可以让我们看到这个手腕的脉搏

,如果我们要计算脉搏,

我们甚至可以计算出
这个人的心率。

如果我们
在正确的视频中使用相同的软件,

它可以让我们看到这个婴儿的每一次呼吸

,我们可以用它作为一种非接触式的方式
来监测她的呼吸。

因此,这项技术非常强大,
因为它采用

了我们通常必须
通过触摸体验的这些现象

,它让我们能够以视觉
和非侵入性的方式捕捉它们。

所以几年前,我开始
与创建该软件的人合作

,我们决定追求一个疯狂的想法。

我们认为,
我们可以使用软件

来可视化这样的微小动作,这很酷

,您几乎可以将其
视为扩展我们触觉的一种方式。

但是,如果我们可以
用我们的听觉能力做同样的事情呢?

如果我们可以使用视频
来捕捉声音的振动,

这只是另一种运动,

并将我们看到的一切
变成麦克风呢?

现在,这是一个有点奇怪的想法,

所以让
我试着为你解释一下。

传统
麦克风通过将内部振膜的运动

转换为电信号来工作,

并且该振膜被设计
为随声音轻松移动,

以便可以记录其运动
并将其解释为音频。

但是声音会使所有物体振动。

这些振动通常
太微妙太快,我们无法看到。

那么,如果我们
用高速摄像机记录它们

,然后使用软件

从我们的高速视频中提取微小的动作,

并分析这些动作以找出是
什么声音产生的呢?

这将使我们能够从远处将可见物体
变成视觉麦克风。

所以我们尝试了这个

,这是我们的一个实验

,我们把
你在右边看到的这个盆栽植物

,我们用高速摄像机拍摄了它,

同时附近的扬声器
播放了这个声音。

(音乐:“Mary Had a Little Lamb”

)所以这是我们录制的视频,我们

以每秒数千帧的速度录制了它,

但即使你仔细观察

,你会看到的

只是一些漂亮的叶子
只是坐在那里无所事事,

因为我们的声音只将那些叶子移动
了大约一微米。

那是千分之一厘米,在这张图片

中跨越
了百分之一到

千分之一的像素。

因此,您可以随意眯眼,

但是那么小的运动在感知上几乎是
不可见的。

但事实证明,有些
东西在感知上是不可见的

,但在数值上仍然很重要,

因为有了正确的算法,

我们可以拍摄这个无声的、
看似静止的视频

,我们可以恢复这个声音。

(音乐:《玛丽有只小羊羔》)

(掌声)

这怎么可能?

我们怎么能从
这么小的动作中得到这么多的信息呢?

好吧,假设这些叶子
仅移动了一个微米,

并且假设我们的
图像仅移动了千分之一像素。

这可能看起来不多,

但一帧视频中

可能有数十万
个像素,因此,

如果我们结合从整个图像中
看到的所有微小运动,

那么突然之间,千分之一像素

可以 开始
增加一些非常重要的东西。

就个人而言,
当我们弄清楚这一点时,我们非常激动。

(笑声)

但是即使有了正确的算法,

我们仍然错过
了一个非常重要的难题。

你看,有很多因素
会影响

这种技术何时以及如何发挥作用。

有物体,它有多远;

有你使用的相机和镜头;

有多少光照射在物体
上,你的声音有多大。

即使使用了正确的算法,

我们在
早期的实验中也必须非常小心,

因为如果我们把
这些因素中的任何一个弄错了,

就无法
判断问题出在哪里。

我们只会得到噪音。

所以我们早期的很多
实验都是这样的。

所以我在这里

,在左下角,你可以
看到我们的高速摄像机,

它指向一袋薯片

,整个东西都
被这些明亮的灯照亮。

就像我说的,我们
在这些早期的实验中必须非常小心,

所以它就是这样失败的。

(视频)安倍戴维斯:三,二,一,走。

玛丽有只小羊羔!
小羊羔! 小羊羔!

(笑声)

AD:所以这个实验
看起来完全荒谬。

(笑声)

我的意思是,我对着一袋薯片尖叫——

(笑声)

—— 我们用这么多的光爆破它,

我们真的把我们试穿的第一个袋子融化
了。 (笑声)

但是这个实验看起来很荒谬,

实际上它非常重要,

因为我们
能够恢复这个声音。

(音频)玛丽有一只小羊羔!
小羊羔! 小羊羔!

(掌声)

AD:这真的很重要,

因为这是我们第一次

从物体的无声视频中恢复出可理解的人类语音。

所以它给了我们这个参考点

,我们可以逐渐
开始修改实验,

使用不同的物体
或将物体移得更远,

使用更少的光线或更安静的声音。

我们分析了所有这些实验,

直到我们真正了解
了我们技术的局限性,

因为一旦我们了解了这些限制,

我们就可以弄清楚如何推动它们。

这导致了像这样的实验

,再一次,我要
对一袋薯条说话,

但这次我们把相机移到了
大约 15 英尺

外,在隔音窗后面

,整个事情是
仅由自然阳光照亮。

这是我们拍摄的视频。

这就是
从里面传来的声音,就在薯片袋旁边。

(音频)玛丽有一只小羊羔,
它的羊毛像雪

一样白,玛丽走到哪里,
那只小羊肯定会去。

AD:这就是我们能够
从窗外拍摄的无声视频中恢复的内容

(音频)玛丽有一只小羊羔,
它的羊毛像雪

一样白,玛丽走到哪里,
那只小羊肯定会去。

(掌声)

AD:
我们也可以通过其他方式突破这些限制。

所以这是一个更安静的实验

,我们拍摄了一些
插入笔记本电脑的耳机

,在这种情况下,我们的目标是从这两个小塑料耳机的无声视频中恢复
笔记本电脑上播放的音乐

,我们能够 做得很好

,我什至可以对我们的结果进行 Shazam。

(笑声)

(音乐:Queen 的“Under Pressure”)

(掌声

)我们也可以
通过改变我们使用的硬件来推动事情。

因为
到目前为止我向您展示的

实验是使用相机完成的,
一种高速相机,

它可以比大多数手机
快约 100 倍的速度录制视频

但我们还找到了
一种使用这种技术的方法

普通相机

,我们通过
利用所谓的滚动快门来做到这一点。

您会看到,大多数相机
一次记录一行图像

,因此如果
在记录单个图像期间物体移动,

则每行之间会有轻微的时间延迟

,这会导致轻微的伪影

被编码到视频的每一帧中 .

所以我们
发现,通过分析这些伪影,

我们实际上可以
使用我们算法的修改版本来恢复声音。

所以这是我们做的一个实验

,我们拍摄了一袋糖果,

而附近的扬声器

播放与以前相同的“玛丽有一只小羊羔”
音乐,

但这一次,我们只使用了一个普通的
商店购买的相机

,所以在一秒钟内 ,我会为你播放
我们恢复的声音

,这次声音会
失真,

但听听你是否还能
识别音乐。

(音频:“Mary Had a Little Lamb”

)所以,再一次,这听起来有点扭曲,

但真正令人惊奇的
是,我们能够


你可以

在百思买用完并拿起的东西来做到这一点。

所以在这一点上

,很多人看到这个作品

,立刻就想到
了监控。

公平地说

,不难想象你会如何使用
这项技术来监视某人。

但请记住,已经
有很多非常成熟的

技术用于监控。

事实上,几十年来,人们一直在使用激光

从远处窃听物体

但是这里真正的新事物

,真正不同的

是,现在我们有了
一种描绘物体振动的方法,

这为我们提供了一个观察世界的新镜头

,我们可以使用这个镜头

来学习不仅仅是关于 声音之类的力
会导致物体振动,

但也与物体本身有关。

所以我想退后一步

想想这可能会如何改变
我们使用视频的方式,

因为我们通常使用视频
来观看事物,

而我刚刚向您展示了我们如何使用它

来聆听事物。

但是我们了解世界还有另一种重要的方式

那就是与它互动。

我们推动、拉动、戳戳和刺激事物。

我们摇晃东西,看看会发生什么。

这是视频
仍然不允许我们做的事情,

至少传统上不会。

所以我想向你们展示一些新作品

,这是基于我
几个月前的一个想法,

所以这实际上是我第一次
向公众展示它。

基本思想是,我们
将使用视频中的振动

来捕捉物体,
让我们与它们互动

,看看它们对我们的反应。

所以这是一个物体

,在这种情况下,它是一个人形
的线形人形

,我们将
使用普通相机拍摄该物体。

所以
这款相机没有什么特别之处。

事实上,我以前
用手机做过这个。

但是我们确实希望看到物体振动,

所以为了实现这一点,

我们只需在录制这段视频时
在它静止的表面上敲

一下。

就是这样:只有五秒钟
的常规视频,

当我们敲击这个表面时

,我们将使用
该视频中的振动

来了解
我们物体的结构和材料特性,

我们将使用它 信息
来创造新的和互动的东西。

这就是我们创建的。

它看起来像一个普通的图像,

但这不是图像
,也不是视频,

因为现在我可以拿起鼠标

,开始
与对象进行交互。

所以你在这里看到的

是这个物体

如何响应
我们以前从未见过的新力量的模拟

,我们只用
五秒钟的常规视频创建了它。

(掌声

)所以这是一种非常有效
的看待世界的方式,

因为它可以让我们
预测物体

对新情况的反应

,你可以想象,例如,
看着一座旧桥

,想知道会发生什么,
如何

如果我开车穿过它,那座桥能撑得住吗?

在您开始开车过那座桥之前
,您可能想回答这个问题

当然,
这种技术也会有局限性,

就像
视觉麦克风一样,

但我们发现它可以
在很多

你可能没想到的情况下工作,

特别是如果你给它更长的视频。

例如,

是我在公寓外的灌木丛中拍摄的视频

,我没有对这棵灌木做任何事情,

但通过拍摄一分钟长的视频

,微风会引起足够的振动

,我们可以了解足够多的信息 这个
灌木创建这个模拟。

(掌声

)所以你可以想象把它
交给一个电影导演

,让他控制,比如说,一个镜头中

的风的强度和方向,
在它被记录下来之后。

或者,在这种情况下,我们将
相机对准悬挂的窗帘

,你甚至看不到
这段视频中的任何动作,

但是通过录制一段两分钟长的视频,

这个房间里的自然气流

创造了足够微妙、
难以察觉的

我们可以学到足够的运动和振动
来创建这个模拟。

具有讽刺意味的是

,当涉及到虚拟对象、

视频游戏
和 3D 模型时,我们已经习惯了这种交互性,

但是能够使用简单的方法
从现实世界中的真实对象中捕获这些信息

,常规视频,


具有很大潜力的新事物。

所以这里是
与我一起在这些项目上工作的了不起的人。

(掌声)

我今天给大家看
的只是一个开始。

我们才刚刚开始

探索
使用这种成像所能做的事情的表面,

因为它为我们提供了一种

用通用、可访问的技术捕捉周围环境的新方法。

因此,展望未来,

探索

这可以告诉我们关于世界的信息将会非常令人兴奋。

谢谢你。

(掌声)