What we learned from 5 million books Erez Lieberman Aiden and JeanBaptiste Michel

[Music]

everyone knows that a picture is worth a

thousand words but we at Harvard we’re

wondering if this was really true so we

assembled a team of experts spanning

Harvard MIT the American Heritage

Dictionary the Encyclopedia Britannica

and even our proud sponsors the Google

and we cogitated about this for about

four years and we came to a startling

conclusion ladies and gentlemen a

picture is not worth a thousand words in

fact we found some pictures that are

worth 500 billion words so how did we

get to this conclusion

so Arizona were thinking about ways to

get a big picture of human culture and

human history change over time so many

books actually have been written about

over the years so we’re thinking well

the best way to learn from them is to

read all of these millions of books now

of course if there’s a scale for how

awesome that is that has to rank

extremely sternly high now the problem

is there’s an x axis for that which is

the practical axis this is very very low

now now people tend to use an

alternative approach which is to take a

few sources and read them very carefully

this is extremely practical but not too

awesome what you really want to do what

you really want to do is to get to the

awesome yet practical part of this space

so it turns out there’s a company across

the river called Google who has started

the digitization project a few years

back that might just enable this

approach they have digitized millions of

books so what that means is one could

use computational methods to read all

the books in the click of a button

that’s very practical and extremely

awesome

let me tell you a little bit about where

books come from time immemorial

there have been authors these authors

have been striving to write books and

this became considerably easier with the

development of the printing press some

centuries ago I think then the authors

have one on one hundred twenty nine

million distinct occasions publishing

books now those books are not lost the

history than they are somewhere in a

library and many of those books have

been getting retrieved from the

libraries and digitized by Google which

is stands 15 million books to date now

when Google digitize a book they put

into a really nice format now he’s got

the data plus we have metadata we have

information about things like where was

it published who’s the author when was

it published and what we do is go

through all of those records and exclude

everything that’s not the highest

quality data what we’re left with is a

collection of five million books five

hundred billion words a string of

characters a thousand times longer than

the human genome a text which when

written out would stretch from here to

the moon and back ten times over a

veritable shard of our cultural genome

of course what we did when faced with

such outrageous hyperbole was what any

self-respecting researchers would have

done

we took a page out of xkcd and we said

stand back we’re going to try science

now of course we’re thinking well let’s

just first put the data out there for

people to do science to it now we’re

thinking what data I can release well of

course you want to take the books and

release the full text of these five

millions of books now Google and John

are ones particular told us Bill

equation that we should learn so we have

five million books that’s happening

authors that is 5 million plaintiffs is

a massive lawsuit so although that would

be really real and again that extremely

extremely impractical the space now

again we can a caved-in and we did the

very practical purchase a bit less

awesome we we said well instead of

raising the full text we’re going to

relieve statistics about the books so

we’re going to take for instance a

glimmer of happiness it’s four words we

call it a four gram we’re going to tell

you how many times that particular 4

gram appeared in book solution 1801 1802

1803 all the way up to 2008 that gives

us a 10 series of how frequently this

particular sentence was used over time

we do that for all the words and phrases

that appear in those books that gives us

a big table of two billion lines that

tell us about the way culture has been

changing so those two billion lines we

call them two billion engrams

what do they tell us well the individual

engrams measure cultural trends let me

give you an example let’s suppose that I

am thriving then tomorrow I want to tell

you about how well I did and so I might

say yesterday I throw alternatively I

could say yesterday I thrived well which

one should I use hmmm how to know well

as of about six months ago but the state

of the art in this field is that you

would burn things go up to the following

psychologist with fabulous hair and

you’d say

Steve you’re an expert on the irregular

verbs what should I do and he tell you

well most people say thrive but some

people say throws are you also knew more

or less that if you were to go back in

time 200 years and ask the following

statesman with equally fabulous hair Tom

what should I say he’d say well in my

day most people throws but some thrived

so now what I’m just going to show you

is raw data two rows from this table of

two billion entries

what you’re seeing is year by year

frequency of thrived and throws over

time now this is just two out of two

billion rows so the entire data set is a

billion times more awesome than this

slide now there are many other pictures

that are worth 500 billion words for

instance this one if you just type in

influenza you will see peaks at the time

where you knew big influent pyramids

were actually killing millions of people

around the globe if you were not yet

convinced sea levels are rising so is

atmospheric co2 and global temperature

you might also want to have a look at

this for Tudor and Grandma nest and tell

Nitschke that God is not dead although

you might agree that he might need a

better cover system you can get some

pretty abstract concepts with this sort

of thing for instance let me tell you

the history of the year 1950 pretty much

for the vast majority of history no one

gave a damn about 1950 in 1700 and 1800

and 1900 no one cared

through the 30s and 40s no one cared

suddenly in the mid 40s

there’s started to be a buzz people

realized that 1950 was going to happen

and it could be big but nothing got

people interested in 1950 like the year

1950 people were walking around obsessed

they couldn’t stop talking about all the

things they did in 1950 all the things

they were planning to do in 1950 all the

dreams of what they wanted to accomplish

in 1950 in fact 1950 was so fascinating

that for years thereafter people just

kept talking about all the amazing

things that happen in 51 52 53 finally

in 1954 someone woke up and realized

that 1950 had gotten somewhat passe and

just like that the bubble burst at the

story of 1950 is the story of every year

that we have on record with a little

twist because now we’ve got these nice

charts and because we have these nice

charts we can measure things we can say

wow how fast does the bubble burst and

it turns out that we can measure that

very precisely equations were derived

graphs were produced and the net result

is that we find that the bubble bursts

faster and faster with each passing year

we are losing interest in the past more

rapidly now a little piece of career

advice so for those of you who stick to

be famous you can learn from the most

famous 25 most famous political figures

offers our actors and so on so if you

wanna become famous early on you should

be an actor because then Fame starts

rising by the end of your 20s you’re

still young it’s really great now if you

can wait a little bit

you should be another because then you

rise to very great heights like Mark

Twain for instance extremely famous but

if you want to reach the very top you

should delay gratification and of course

become a politician right so here you

will become famous by the end of your

50s and become very very famous

afterwards so scientists also tend to

get famous when they’re much much among

old

like for the biologists and physicists

can be almost as famous as actors one

mistake you should not do is become a

mathematician

if you if you do that you might think oh

great I’m going to do my best work on a

midnight 20s but guess what nobody will

really care

there are more sobering notes among the

engrams for instance here’s the

trajectory of marc chagall an artist

born in 1887 and this looks like the

normal trajectory of a famous person he

gets more and more and more and more

famous except if you look in German you

live in German you see something

completely bizarre something you pretty

much never see which is he cut becomes

extremely famous and then all of a

sudden plummets going through a nadir

between 1933 and 1945 before rebounding

afterwards and of course what we’re

seeing is the fact that Marc Chagall was

a Jewish artist in Nazi Germany now

these signals are actually so strong

that we don’t need to know that someone

was censored we can actually figure it

out using really basic signal processing

here’s a simple way to do it well a

reasonable expectation is that

somebody’s Fame in a given period of

time should be roughly the average of

their fame before and their fame after

so that’s sort of what we expect now we

compare that to the fame that we observe

and we just divide one by the other to

produce something we call the

suppression index if the suppression

index is very very very small then you

very well might be being suppressed if

it’s very large

maybe you’re benefiting from propaganda

now you can actually look at the

distribution of depression indices over

whole populations the presence here

decision indices for 5000 people picked

in the English books where there’s no

known suppression would be like this

basically tightly centered around one

what you expect is basicly what you

observe this is the distribution used in

Nazi Germany it’s very different it

shifted to the left

people are talked about voiceless as it

should have been but much more

importantly the distribution is much

wider there are many people who end up

on the far left of this distribution who

are talked about ten times fewer and

they should have been but then also many

people on the far right who seem to

benefit from propaganda this picture

here is the hallmark of censorship in

the book record so culture omics is what

we call this method it’s kind of like

genome except genomics is kind of a lens

on biology through the window of the

sequence of bases in the human genome

culture omics is similar it’s the

application of massive scale data

collection analysis to the study

human culture here instead through the

length of a genome through the lens of

digitized pieces of the historical

record the great thing about culture

omics is that everyone can do it why can

everyone do it everyone can do it

because three guys John Orr want Matt

gray and will Brockman over at Google

saw the prototype of the Ngram viewer

and they said this is so fun we have to

make this available people and so in two

weeks flat the two weeks before our

paper came out they coded up a version

the Ngram viewer for the general public

and so you too can type in any word or

phrase that you’re interested in and see

it’s an Graham immediately and also

browse examples of all the various books

in which your Engram appears now this

was used over a million times in the

first day and this is really the best of

all the queries right the people want to

be their best put their best foot

forward but it turns out in the 18th

century before we didn’t really care

about that at all they didn’t want to be

their best they want to be their best so

what happens is of course this is just a

mistake right it’s not that they strive

for mediocrity is just that the earth

used to be written differently well kind

of like an F now of course because OCR

didn’t pick this up at the time so we

know we reported this in the sense

article that we wrote but it turns out

that it should just stand as a reminder

that although this is a lot of fun when

you interpret these graphs you have to

be very careful you have to adopt the

best standards in the sciences people

been using this for all kinds of fun

purposes

I actually we’re not going to have to

talk we’ll just show you all the slides

and remain silent okay papers with

interest in the history of frustration

there’s various various types of

frustration if you sub your toe that’s a

1 a argh if the planet Earth is

annihilated by the Vogons to make room

for an interstellar bypass that’s an 8 a

argh

this person studied all the arms from 1

through 8 s and turns out that the less

frequent arms are of course the ones

that correspond the things that are more

frustrating except oddly in the early

80s we think that might have something

to do with regen all right the bottom

line is ok there are many usages of this

data but the bottom line is that the

historical record is being digitized

Google has started to decide 15 million

books at 4% of all the books that have

ever been published it’s pretty big size

its accessible chunk of human culture

there’s much more in culture there’s

manuscripts there’s newspapers the

things that are not text like art and

paintings this will happen to be on our

computers on computers across the world

and when that happens that will

transform the way we have to understand

our past our present and human culture

thank you very much

[Applause]

[音乐]

每个人都知道一张照片值

一千字，但我们在哈佛，我们

想知道这是否真的如此，所以我们

组建了一个专家团队，涵盖

哈佛麻省理工学院、美国遗产

词典、大英百科全书

，甚至我们自豪的赞助商谷歌

我们对此思考了大约

四年，得出了一个令人吃惊的

结论女士们先生们，一张

图片不值一千字

事实上我们发现了一些

价值 5000 亿字的图片所以我们是如何

得出这个结论的，

所以亚利桑那州思考

如何全面了解人类文化和

人类历史会随着时间的推移而发生变化

现在当然，如果有一个尺度来衡量它有多

棒，那么它的排名必须

非常高，现在问题

是有一个 x 轴表示

实际轴，这是 v 现在非常低

现在人们倾向于使用

另一种方法，即获取

一些资源并非常仔细地阅读它们

这是非常实用但不

太棒你真正想做的事情

你真正想做的就是达到

棒极了但是这个空间的实用部分，

所以事实证明

，河对岸有一家名为谷歌的公司，

几年前开始了数字化项目

，可能只是启用这种

方法，他们已经数字化了数百万

本书，所以这意味着人们可以

使用计算方法只需

单击一个按钮即可阅读所有书籍，

这非常实用且非常棒

让我告诉您一些关于

书籍的来源自古以来

就有作者这些作者

一直在努力

写书，而使用

几个世纪前印刷机的发展

现在的书这些书并

没有像图书馆的某个地方那样丢失历史，

并且其中许多书已从

图书馆检索并由谷

歌数字化，当谷歌将他们放入的书数字化时，到目前为止已有 1500 万本书

非常好的格式现在他

有了数据加上我们有元数据我们有

关于

它在哪里出版的信息谁是作者什么

时候出版我们所做的

是检查所有这些记录并排除

所有不是最高

质量数据的东西我们只剩

下五百万本书五

千亿字一串

字符

比人类基因组长一千倍的文本当

写出来时，它会在一个真正的碎片上从这里延伸

到月球并返回十倍

当然，我们的文化基因组在面对

如此离谱的夸张时我们所做的是任何有

自尊心的研究人员都会

做的

我们从 xkcd 中取出一页我们说

退后一步，我们现在要尝试科学

当然我们想得很好让

我们先把数据放在那里，让

人们对它做科学现在我们在

想我可以发布什么数据

当然你想要现在拿这些书并

发布这五

百万本书的全文谷歌和

约翰特别告诉我们比尔

方程我们应该学习所以我们有

五百万本书正在发生

作者是五百万原告是

一场大规模的诉讼所以虽然这将

是非常真实的，而且

非常不切实际的空间现在

我们可以再次陷入困境，我们做了

非常实际的购买，

我们说得好一点，而不是

提高全文，我们将

减轻统计数据书，所以

我们要以

幸福为例，它是四个词，我们

称之为四克，我们将告诉

你这个特定的 4

克在书中出现了多少次解决方案 1801 1802

1803 所有一直到 2008 年，这为

我们提供了 10 个系列，说明

随着时间的推移，这个特定句子的使用频率

我们对这些书中出现的所有单词和短语都这样做，

这给了我们

一个 20 亿行的大表格，

告诉我们如何使用文化一直在

变化，所以我们

称之为 20 亿印记的那 20 亿行它们能

很好地告诉我们什么个体

印记衡量文化趋势让我

给你举个例子让我们假设

我正在蓬勃发展，那么明天我想告诉

你我有多好做了，所以我可能会

说昨天我扔掉了我

可以说昨天我茁壮成长

我应该使用哪一个嗯嗯如何

知道大约六个月前，但

这个领域的最新技术是你

会烧掉东西给下面

这位头发漂亮的心理学家，

你会说

史蒂夫，你是不规则动词方面的专家，

我应该怎么做，他告诉

你大多数人说茁壮成长，但有些

人说投掷你也知道更多

或者更少，如果你

回到 200 年前，问下面这位

有着同样美妙头发的政治家汤姆，

我应该说他在我那个时代说得很好，

但大多数人都扔了，但有些人很成功，

所以现在我要说的是向您展示

这张包含 20 亿条条目的表中的两行原始数据

您所看到的是每年

蓬勃发展的频率和随时间推移而抛出的频率

现在这只是 20

亿行中的 2 行，因此整个数据集是

10 亿倍比这张

幻灯片真棒现在还有许多其他

图片价值 5000 亿字

例如这张图片如果你只输入

流感你会看到

在你知道大

金字塔实际上正在杀死全球数百万人的时候的高峰

如果你还不

相信海平面正在上升，

大气中的二氧化碳和全球温度也在上升，

您可能还想看看

都铎王朝和祖母巢的这个，并告诉

尼奇克上帝没有死，尽管

您可能同意他可能会不需要

更好的封面系统，你可以

用这种东西得到一些非常抽象的概念，

例如，让我告诉你

1950 年的历史，

在绝大多数历史中，没有人

在乎 1950 年和 1700 年和 1800 年

和 1900 没有人

关心 30 年代和 40 年代没有人在乎

40 年代中期突然

开始有一种嗡嗡声，人们

意识到 1950 年即将发生

，而且可能很大，但没有什么能

像 1950 年那样让人们对 1950 年感兴趣

痴迷的

他们无法停止谈论

他们在 1950 年所做的所有事情

他们计划在 1950 年做的所有事情他们在 1950 年

想要完成的所有梦想

事实上 1950 年是如此迷人

以至于多年来人们

一直保持谈论

在 51 52 53 发生的所有令人惊奇的事情终于

在 1954 年有人醒来并

意识到 1950 年已经过去了，

就像

1950 年故事中的泡沫破灭一样我们记录的每一年

的记录都有点

扭曲，因为现在我们有这些漂亮的

图表，因为我们有这些漂亮的

图表，我们可以测量我们可以说

哇泡沫破灭的速度有多快

，事实证明我们可以测量

得出非常精确的方程式生成了

图表，最终结果

是我们发现泡沫破灭的

速度越来越快，

我们对过去失去兴趣的

速度越来越快，现在

给你们一些职业建议谁

坚持成名你可以向最

有名的人学习 25 位最著名的政治人物

为我们的演员提供等等，所以如果你

想早日成名，你应该

成为一名演员，因为

到 20 岁末你的名气开始上升

还年轻，现在真的很棒，如果你

能稍等片刻，

你应该成为另一个，因为那时你

会上升到非常高的高度，例如马克

吐温，非常有名，但

如果你想达到最高点，你

应该 ld延迟满足，当然要

成为一名政治家，所以在这里

你会在

50多岁时成名，之后变得非常有名

，所以科学家们也倾向于

在他们年纪很大的时候出名，

就像生物学家和物理学家

一样几乎和演员一样出名一个

错误你不应该做的就是成为一名

数学家

如果你这样做了你可能会想哦

太好了我会在 20 多岁的午夜做我最好的工作

但猜猜没人会

真正关心

还有更多印迹中发人深省的笔记

例如，这

是马克夏加尔的轨迹，一位

出生于 1887 年的艺术家，这看起来像是

一个名人的正常轨迹，他

越来越

有名，除非你用德语看你

生活在德国你看到一些

你

几乎从未见过的完全奇怪的东西，他被剪掉了，变得

非常有名，然后突然

在 1933 年到 1945 年间跌至最低点，然后反弹

之后，当然我们

看到的是马克夏加尔

是纳粹德国的犹太艺术家现在

这些信号实际上是如此强烈

以至于我们不需要

知道有人被审查了我们实际上可以

用真的这里的基本信号处理

是一种简单的方法来做好它一个

合理的期望是

某人在给定时间段内的名气

应该大致是

他们之前和之后

的名气的平均值，所以这就是我们现在所期望的，我们

将其与我们观察到的名声

，我们只是将一个除以另一个

产生我们称之为

抑制指数的

东西实际上可以查看

抑郁指数在

整个人群中的分布这里的存在

决定指数

在英语书籍中挑选的 5000 人没有

已知的支持会议将是这样的

基本上紧紧围绕一个

你所期望的基本上是你

观察到的这是纳粹德国使用的分布

它非常不同它

向左移动

人们谈论的是无声的，因为它

应该是，但更

重要的是分布更

广泛的是，有很多人最终

在这个分布的最左边，他们

被谈论的次数要少十倍，

他们应该是，但也有很多

最右边的人似乎

从宣传中受益，这张

照片是标志

书籍记录中的审查制度所以文化组学就是

我们所说的这种方法它有点像

基因组除了基因组学是

通过

人类基因组中碱基序列的窗口观察生物学的一种镜头

文化组学是相似的它

是大规模的应用将数据

收集分析扩展到

这里的人类文化研究，而不是通过

基因组的长度，通过

h 的数字化片段的镜头历史

记录文化

组学的伟大之处在于每个人都可以做到为什么

每个人都可以做到每个人都可以做到

因为三个人 John Orr 想要 Matt

Gray，Brockman 在 Google

看到了 Ngram 查看器的原型

，他们说这是有趣的是，我们必须

让人们可以使用它，所以在

我们的论文发表前的两周内

，他们

为公众编写了 Ngram 查看器的版本

，因此您也可以输入您感兴趣的任何单词或

短语立即进入并查看

它是格雷厄姆，并

浏览所有各种书籍的示例，

其中您的 Engram 现在出现了这

在第一天被使用了超过一百万次

，这确实

是人们想要成为他们的所有查询中最好的

最好把他们最好的

一面展现出来，但事实证明，在 18

世纪，我们根本不

关心这一点，他们不想

成为最好的，他们想成为最好的，所以

发生的事情当然只是一个

错误，对，并不是他们

追求平庸，只是地球

过去的写法不同

，现在就像 F

我们写的，但事实

证明它应该只是一个提醒

，虽然当你解释这些图表时这很有趣，但

你

必须非常小心，你必须采用

科学中的最佳标准，人们

一直在使用它各种有趣的

目的

我实际上我们不必

谈论我们只会向您展示所有幻灯片

并保持沉默好吧论文

对挫折的历史感兴趣

如果您将脚趾放在

1上就会有各种类型的挫败感啊，如果地球

被沃贡人歼灭

为星际绕行腾出空间，那就是 8

啊，

这个人研究了从 1

到 8 秒的所有手臂，结果发现

频率较低的手臂当然是那些

这对应于更令人沮丧的事情，

除了奇怪的是在 80 年代初期，

我们认为这

可能与再生有关，好吧

底线是好的，这些数据有很多用途，

但底线是

历史记录正在数字化

谷歌已经开始决定 1500 万

本书，占所有已出版书籍的 4%

它的规模相当大

其可访问的人类文化文化领域

还有更多

手稿有报纸

非文本的东西，比如艺术和

绘画将碰巧在我们的

计算机上，在世界各地的计算机上

，当这种情况发生时，这将

改变我们理解

过去的方式，我们现在的文化和人类文化

，非常感谢

[掌声]