How we can store digital data in DNA Dina Zielinski

I could fit all movies ever made
inside of this tube.

If you can’t see it,
that’s kind of the point.

(Laughter)

Before we understand how this is possible,

it’s important to understand
the value of this feat.

All of our thoughts
and actions these days,

through photos and videos –

even our fitness activities –

are stored as digital data.

Aside from running out of space

on our phones,

we rarely think about
our digital footprint.

But humanity has collectively
generated more data

in the last few years

than all of preceding human history.

Big data has become a big problem.

Digital storage is really expensive,

and none of these devices that we have
really stand the test of time.

There’s this nonprofit website
called the Internet Archive.

In addition to free books and movies,

you can access web pages
as far back as 1996.

Now, this is very tempting,

but I decided to go back and look at
the TED website’s very humble beginnings.

As you can see, it’s changed
quite a bit in the last 30 years.

So this led me to the first-ever TED,

back in 1984,

and it just so happened
to be a Sony executive

explaining how a compact disk works.

(Laughter)

Now, it’s really incredible
to be able to go back in time

and access this moment.

It’s also really fascinating
that after 30 years, after that first TED,

we’re still talking about digital storage.

Now, if we look back another 30 years,

IBM released the first-ever hard drive

back in 1956.

Here it is being loaded for shipping
in front of a small audience.

It held the equivalent of one MP3 song

and weighed over one ton.

At 10,000 dollars a megabyte,

I don’t think anyone in this room
would be interested in buying this thing,

except maybe as a collector’s item.

But it’s the best we could do at the time.

We’ve come such a long way
in data storage.

Devices have evolved dramatically.

But all media eventually wear out
or become obsolete.

If someone handed you a floppy drive today
to back up your presentation,

you’d probably look at them
kind of strange, maybe laugh,

but you’d have no way
to use the damn thing.

These devices can no longer meet
our storage needs,

although some of them can be repurposed.

All technology eventually dies or is lost,

along with our data,

all of our memories.

There’s this illusion that
the storage problem has been solved,

but really, we all just externalize it.

We don’t worry about storing
our emails and our photos.

They’re just in the cloud.

But behind the scenes,
storage is problematic.

After all, the cloud is just
a lot of hard drives.

Now, most digital data,
we could argue, is not really critical.

Surely, we could just delete it.

But how can we really know
what’s important today?

We’ve learned so much about human history

from drawings and writings in caves,

from stone tablets.

We’ve deciphered languages
from the Rosetta Stone.

You know, we’ll never really have
the whole story, though.

Our data is our story,

even more so today.

We won’t have our record
recorded on stone tablets.

But we don’t have to choose
what is important now.

There’s a way to store it all.

It turns out that there’s
a solution that’s been around

for a few billion years,

and it’s actually in this tube.

DNA is nature’s oldest storage device.

After all, it contains
all the information necessary

to build and maintain a human being.

But what makes DNA so great?

Well, let’s take our own genome

as an example.

If we were to print out
all three billion A’s, T’s, C’s and G’s

on a standard font, standard format,

and then we were
to stack all of those papers,

it would be about 130 meters high,

somewhere between the Statue of Liberty
and the Washington Monument.

Now, if we converted
all those A’s, T’s, C’s and G’s

to digital data, to zeroes and ones,

it would total a few gigs.

And that’s in each cell of our body.

We have more than 30 trillion cells.

You get the idea:

DNA can store a ton of information
in a minuscule space.

DNA is also very durable,

and it doesn’t even require
electricity to store it.

We know this because scientists
have recovered DNA from ancient humans

that lived hundreds
of thousands of years ago.

One of those is Ötzi the Iceman.

Turns out, he’s Austrian.

(Laughter)

He was found high, well-preserved,

in the mountains
between Italy and Austria,

and it turns out that he has living
genetic relatives here in Austria today.

So one of you could be a cousin of Ötzi.

(Laughter)

The point is that we have a better chance
of recovering information

from an ancient human

than we do from an old phone.

It’s also much less likely
that we’ll lose the ability to read DNA

than any single man-made device.

Every single new storage format
requires a new way to read it.

We’ll always be able to read DNA.

If we can no longer sequence,
we have bigger problems

than worrying about data storage.

Storing data on DNA is not new.

Nature’s been doing it
for several billion years.

In fact, every living thing
is a DNA storage device.

But how do we store data on DNA?

This is Photo 51.

It’s the first-ever photo of DNA,

taken about 60 years ago.

This is around the time that
that same hard drive was released by IBM.

So really, our understanding of digital
storage and of DNA have coevolved.

We first learned to sequence, or read DNA,

and very soon after, how to write it,

or synthesize it.

This is much like how we learn
a new language.

And now we have the ability
to read, write and copy DNA.

We do it in the lab all the time.

So anything, really anything,
that can be stored as zeroes and ones

can be stored in DNA.

To store something digitally,
like this photo,

we convert it to bits, or binary digits.

Each pixel in a black-and-white photo
is simply a zero or a one.

And we can write DNA much like an inkjet
printer can print letters on a page.

We just have to convert our data,
all of those zeroes and ones,

to A’s, T’s, C’s and G’s,

and then we send this
to a synthesis company.

So we write it, we can store it,

and when we want to recover our data,
we just sequence it.

Now, the fun part of all of this
is deciding what files to include.

We’re serious scientists,
so we had to include a manuscript

for good posterity.

We also included a $50 Amazon gift card –

don’t get too excited, it’s already
been spent, someone decoded it –

as well as an operating system,

one of the first movies ever made

and a Pioneer plaque.

Some of you might have seen this.

It has a depiction of a typical –
apparently – male and female,

and our approximate location
in the Solar System,

in case the Pioneer spacecraft
ever encounters extraterrestrials.

So once we decided what sort of files
we want to encode,

we package up the data,

convert those zeroes and ones
to A’s, T’s, C’s and G’s,

and then we just send this file off
to a synthesis company.

And this is what we got back.

Our files were in this tube.

All we had to do was sequence it.

This all sounds pretty straightforward,

but the difference between
a really cool, fun idea

and something we can actually use

is overcoming these practical challenges.

Now, while DNA is more robust
than any man-made device,

it’s not perfect.

It does have some weaknesses.

We recover our message
by sequencing the DNA,

and every time data is retrieved,

we lose the DNA.

That’s just part
of the sequencing process.

We don’t want to run out of data,

but luckily, there’s a way to copy the DNA

that’s even cheaper and easier
than synthesizing it.

We actually tested a way to make
200 trillion copies of our files,

and we recovered
all the data without error.

So sequencing also introduces
errors into our DNA,

into the A’s, T’s, C’s and G’s.

Nature has a way
to deal with this in our cells.

But our data is stored
in synthetic DNA in a tube,

so we had to find our own way
to overcome this problem.

We decided to use an algorithm
that was used to stream videos.

When you’re streaming a video,

you’re essentially trying to recover
the original video, the original file.

When we’re trying to recover
our original files,

we’re simply sequencing.

But really, both of these processes are
about recovering enough zeroes and ones

to put our data back together.

And so, because of our coding strategy,

we were able to package up all of our data

in a way that allowed us to make
millions and trillions of copies

and still always recover
all of our files back.

This is the movie we encoded.

It’s one of the first movies ever made,

and now the first to be copied
more than 200 trillion times on DNA.

Soon after our work was published,

we participated in an “Ask Me Anything”
on the website reddit.

If you’re a fellow nerd,
you’re very familiar with this website.

Most questions were thoughtful.

Some were comical.

For example, one user wanted to know
when we would have a literal thumb drive.

Now, the thing is,

our DNA already stores everything
needed to make us who we are.

It’s a lot safer to store data on DNA

in synthetic DNA in a tube.

Writing and reading data from DNA
is obviously a lot more time-consuming

than just saving all your files
on a hard drive –

for now.

So initially, we should focus
on long-term storage.

Most data are ephemeral.

It’s really hard to grasp
what’s important today,

or what will be important
for future generations.

But the point is,
we don’t have to decide today.

There’s this great program by UNESCO
called the “Memory of the World” program.

It’s been created to preserve
historical materials

that are considered of value
to all of humanity.

Items are nominated
to be added to the collection,

including that film that we encoded.

While a wonderful way
to preserve human heritage,

it doesn’t have to be a choice.

Instead of asking
the current generation – us –

what might be important in the future,

we could store everything in DNA.

Storage is not just about how many bytes

but how well we can actually
store the data and recover it.

There’s always been this tension
between how much data we can generate

and how much we can recover

and how much we can store.

Every advance in writing data
has required a new way to read it.

We can no longer read old media.

How many of you even have
a disk drive in your laptop,

never mind a floppy drive?

This will never be the case with DNA.

As long as we’re around, DNA is around,

and we’ll find a way to sequence it.

Archiving the world around us
is part of human nature.

This is the progress we’ve made
in digital storage in 60 years,

at a time when we were only
beginning to understand DNA.

Yet, we’ve made similar progress
in half that time with DNA sequencers,

and as long as we’re around,
DNA will never be obsolete.

Thank you.

(Applause)

我可以将所有制作过的电影都
装在这个管子里。

如果你看不到它,
那就是重点。

(笑声)

在我们了解这怎么可能之前,

了解
这一壮举的价值是很重要的。 如今

,我们所有的想法
和行动,

通过照片和视频——

甚至我们的健身活动——

都存储为数字数据。

除了手机空间不足之外

我们很少考虑
我们的数字足迹。

在过去的几年里,人类集体产生的数据

比人类历史上的所有历史都要多。

大数据已经成为一个大问题。

数字存储真的很昂贵,

而且我们拥有的这些设备都没有
真正经受住时间的考验。

有一个
名为 Internet Archive 的非营利性网站。

除了免费的书籍和电影,

您还可以访问
早在 1996 年的网页。

现在,这很诱人,

但我决定回去
看看 TED 网站非常简陋的开端。

正如你所看到的,
在过去的 30 年里,它发生了很大的变化。

所以这让我回到了 1984 年的第一个 TED,

恰好是一位索尼高管在

解释光盘是如何工作的。

(笑声)

现在,
能够回到过去

并进入这一刻真是不可思议。

同样令人着迷
的是,30 年后,在第一次 TED 之后,

我们仍在谈论数字存储。

现在,如果我们再回顾 30 年,

IBM 在 1956 年发布了有史以来的第一个硬盘驱动器。

这里,它正在
为一小部分观众装运。

它相当于一首 MP3 歌曲

,重量超过一吨。

以每兆字节 10,000 美元的价格,

我认为这个房间里的任何人都不
会有兴趣购买这东西,

除非是作为收藏品。

但这是我们当时能做的最好的事情。

我们在数据存储方面取得了长足的
进步。

设备已经发生了巨大的变化。

但所有媒体最终都会磨损
或过时。

如果今天有人递给你一个软盘驱动器
来备份你的演示文稿,

你可能会看着
他们有点奇怪,也许会笑,

但你没有
办法使用这该死的东西。

这些设备不再能满足
我们的存储需求,

尽管其中一些可以重新利用。

所有技术最终都会消亡或丢失,

连同我们的数据、

我们所有的记忆。

存在存储问题已解决的错觉,

但实际上,我们都只是将其外部化。

我们不担心存储
我们的电子邮件和照片。

他们只是在云端。

但在幕后,
存储是有问题的。

毕竟,云只是
一堆硬盘。

现在,
我们可以说,大多数数字数据并不是真正重要的。

当然,我们可以直接删除它。

但是我们如何才能真正知道
今天什么是重要的呢?

我们从洞穴中的绘画和文字、石碑中学到了很多关于人类历史的知识

我们已经破译
了罗塞塔石碑中的语言。

你知道,不过,我们永远不会
真正了解整个故事。

我们的数据就是我们的故事,

今天更是如此。

我们不会把我们的记录
记录在石碑上。

但我们不必选择
现在重要的东西。

有一种方法可以存储所有内容。

事实证明,有
一个解决方案已经

存在了数十亿年

,它实际上就在这个管子里。

DNA是自然界最古老的存储设备。

毕竟,它包含

了构建和维护人类所需的所有信息。

但是,是什么让 DNA 如此伟大?

好吧,让我们以我们自己的基因组

为例。

如果我们用标准字体、标准格式打印出
所有 30 亿个 A、T、C 和 G

然后我们
将所有这些文件堆叠起来,

它大约有 130 米高,

介于自由女神像
和 华盛顿纪念碑。

现在,如果我们将
所有这些 A、T、C 和 G 转换

为数字数据,转换为 0 和 1

,总共需要几场演出。

这存在于我们身体的每个细胞中。

我们有超过 30 万亿个细胞。

你明白了:

DNA 可以
在很小的空间中存储大量信息。

DNA 也非常耐用,

甚至不需要
电力来存储它。

我们之所以知道这一点,是因为科学家
们已经从生活在数十万年前的古人类身上恢复了 DNA

其中之一是奥兹冰人。

原来,他是奥地利人。

(笑声)

他被发现

在意大利和奥地利之间的山上,保存完好

,事实证明,他
今天在奥地利这里有活生生的遗传亲戚。

所以你们中的一个人可能是奥兹的表弟。

(笑声

) 关键是我们从古人类那里
恢复信息的机会

比从旧手机中恢复的机会要大。

与任何单个人造设备相比
,我们失去读取 DNA 能力的可能性也小得多

每一种新的存储格式都
需要一种新的读取方式。

我们将永远能够读取 DNA。

如果我们不能再排序,
我们面临的问题

比担心数据存储更大。

在 DNA 上存储数据并不新鲜。

大自然已经这样做
了数十亿年。

事实上,每一个生物
都是一个 DNA 存储设备。

但是我们如何在 DNA 上存储数据呢?

这是照片 51。

这是第一张 DNA 的照片,

拍摄于大约 60 年前。


与 IBM 发布同一个硬盘驱动器的时间差不多。

真的,我们对数字
存储和 DNA 的理解已经共同发展。

我们首先学会了测序或读取 DNA

,不久之后,我们学会了如何编写

或合成它。

这很像我们学习
一门新语言的方式。

现在我们有了
读取、写入和复制 DNA 的能力。

我们一直在实验室里做。

因此,
任何可以存储为 0 和 1 的东西

都可以存储在 DNA 中。

为了以数字方式存储某些东西,
例如这张照片,

我们将其转换为位或二进制数字。

黑白照片中的每个像素
都只是一个零或一。

我们可以编写 DNA,就像喷墨
打印机可以在页面上打印字母一样。

我们只需要将我们的数据
,所有这些零和一,转换

为 A、T、C 和 G,

然后我们将其发送
给合成公司。

所以我们写它,我们可以存储它

,当我们想要恢复我们的数据时,
我们只是对它进行排序。

现在,所有这一切的有趣部分
是决定要包含哪些文件。

我们是认真的科学家,
所以我们必须包括一份手稿

以供后代使用。

我们还包括一张 50 美元的亚马逊礼品卡——

不要太兴奋,它
已经用完了,有人解码了它——

以及一个操作系统、

有史以来第一部电影

和一个先锋牌匾。

你们中的一些人可能已经看到了这一点。

它描绘了典型的——
显然——男性和女性,

以及我们
在太阳系中的大致位置

,以防先锋号宇宙飞船
遇到外星人。

因此,一旦我们决定了要编码的文件类型

我们将数据打包,

将这些零和一转换
为 A、T、C 和 G,

然后我们将这个文件
发送给合成公司。

这就是我们得到的。

我们的文件在这个管子里。

我们所要做的就是对其进行排序。

这一切听起来很简单,

但是
一个非常酷、有趣的想法

和我们可以实际使用的东西之间的区别

在于克服了这些实际挑战。

现在,虽然 DNA
比任何人造设备都更强大,

但它并不完美。

它确实有一些弱点。

我们
通过对 DNA 进行测序来恢复我们的信息

,每次检索数据时,

我们都会丢失 DNA。

这只是
测序过程的一部分。

我们不想用完数据,

但幸运的是,有一种复制 DNA 的方法比合成它

更便宜、更容易

我们实际上测试了一种制作
200 万亿份文件副本的方法,

并且我们恢复了
所有数据而没有出错。

因此,测序也会将
错误引入我们的 DNA 中

,即 A、T、C 和 G 中。

自然有办法
在我们的细胞中处理这个问题。

但是我们的数据存储
在试管中的合成 DNA 中,

因此我们必须找到自己的方法
来克服这个问题。

我们决定使用
一种用于流式传输视频的算法。

当您流式传输视频时,

您实际上是在尝试
恢复原始视频,原始文件。

当我们试图恢复
我们的原始文件时,

我们只是在排序。

但实际上,这两个过程都是
关于恢复足够的零和一

以将我们的数据重新组合在一起。

因此,由于我们的编码策略,

我们能够

以一种允许我们制作
数百万和数万亿份副本的方式打包所有数据,

并且始终可以恢复
所有文件。

这是我们编码的电影。

这是有史以来第一部制作的电影,也是第一部

在 DNA 上被复制超过 200 万亿次的电影。

我们的作品发表后不久,

我们参加了网站 reddit 上的“Ask Me Anything”

如果你是一个书呆子,
你对这个网站非常熟悉。

大多数问题都经过深思熟虑。

有些很滑稽。

例如,一位用户想知道
我们什么时候会有一个真正的拇指驱动器。

现在,问题是,

我们的 DNA 已经存储
了使我们成为自己所需的一切。

将 DNA 数据存储

在试管中的合成 DNA 中要安全得多。

从 DNA 中写入和读取数据
显然

比仅将所有文件保存
在硬盘驱动器上要耗时得多

  • 目前。

所以最初,我们应该专注
于长期存储。

大多数数据都是短暂的。

很难把握
今天什么是重要的,

或者什么
对后代重要。

但关键是,
我们不必今天做出决定。

联合国教科文组织有一个
名为“世界记忆”计划的伟大计划。

它的创建是为了保存

被认为对全人类有价值的历史资料

项目被提名
添加到收藏中,

包括我们编码的那部电影。

虽然是
保护人类遗产的绝妙方式,

但它不一定是一种选择。 我们可以将所有东西都存储在 DNA 中,

而不是
询问当前这一代人——我们——

未来什么是重要的

存储不仅与多少字节

有关,还与我们实际
存储和恢复数据的能力有关。

我们可以生成多少数据

、可以

恢复多少以及可以存储多少之间一直存在这种紧张关系。

写入数据的每一次进步
都需要一种新的读取方式。

我们不能再阅读旧媒体了。

你们中有多少人
的笔记本电脑中甚至有磁盘驱动器,

更不用说软盘驱动器了?

DNA永远不会是这样。

只要我们在身边,DNA就在身边

,我们会找到一种对其进行测序的方法。

归档我们周围的世界
是人性的一部分。

这是我们
60 年来在数字存储方面取得的进步,

当时我们才刚刚
开始了解 DNA。

然而,我们
在 DNA 测序仪的一半时间内取得了类似的进展

,只要我们还在,
DNA 就永远不会过时。

谢谢你。

(掌声)