What we learned from 5 million books Erez Lieberman Aiden and JeanBaptiste Michel

[Music]

everyone knows that a picture is worth a

thousand words but we at Harvard we’re

wondering if this was really true so we

assembled a team of experts spanning

Harvard MIT the American Heritage

Dictionary the Encyclopedia Britannica

and even our proud sponsors the Google

and we cogitated about this for about

four years and we came to a startling

conclusion ladies and gentlemen a

picture is not worth a thousand words in

fact we found some pictures that are

worth 500 billion words so how did we

get to this conclusion

so Arizona were thinking about ways to

get a big picture of human culture and

human history change over time so many

books actually have been written about

over the years so we’re thinking well

the best way to learn from them is to

read all of these millions of books now

of course if there’s a scale for how

awesome that is that has to rank

extremely sternly high now the problem

is there’s an x axis for that which is

the practical axis this is very very low

now now people tend to use an

alternative approach which is to take a

few sources and read them very carefully

this is extremely practical but not too

awesome what you really want to do what

you really want to do is to get to the

awesome yet practical part of this space

so it turns out there’s a company across

the river called Google who has started

the digitization project a few years

back that might just enable this

approach they have digitized millions of

books so what that means is one could

use computational methods to read all

the books in the click of a button

that’s very practical and extremely

awesome

let me tell you a little bit about where

books come from time immemorial

there have been authors these authors

have been striving to write books and

this became considerably easier with the

development of the printing press some

centuries ago I think then the authors

have one on one hundred twenty nine

million distinct occasions publishing

books now those books are not lost the

history than they are somewhere in a

library and many of those books have

been getting retrieved from the

libraries and digitized by Google which

is stands 15 million books to date now

when Google digitize a book they put

into a really nice format now he’s got

the data plus we have metadata we have

information about things like where was

it published who’s the author when was

it published and what we do is go

through all of those records and exclude

everything that’s not the highest

quality data what we’re left with is a

collection of five million books five

hundred billion words a string of

characters a thousand times longer than

the human genome a text which when

written out would stretch from here to

the moon and back ten times over a

veritable shard of our cultural genome

of course what we did when faced with

such outrageous hyperbole was what any

self-respecting researchers would have

done

we took a page out of xkcd and we said

stand back we’re going to try science

now of course we’re thinking well let’s

just first put the data out there for

people to do science to it now we’re

thinking what data I can release well of

course you want to take the books and

release the full text of these five

millions of books now Google and John

are ones particular told us Bill

equation that we should learn so we have

five million books that’s happening

authors that is 5 million plaintiffs is

a massive lawsuit so although that would

be really real and again that extremely

extremely impractical the space now

again we can a caved-in and we did the

very practical purchase a bit less

awesome we we said well instead of

raising the full text we’re going to

relieve statistics about the books so

we’re going to take for instance a

glimmer of happiness it’s four words we

call it a four gram we’re going to tell

you how many times that particular 4

gram appeared in book solution 1801 1802

1803 all the way up to 2008 that gives

us a 10 series of how frequently this

particular sentence was used over time

we do that for all the words and phrases

that appear in those books that gives us

a big table of two billion lines that

tell us about the way culture has been

changing so those two billion lines we

call them two billion engrams

what do they tell us well the individual

engrams measure cultural trends let me

give you an example let’s suppose that I

am thriving then tomorrow I want to tell

you about how well I did and so I might

say yesterday I throw alternatively I

could say yesterday I thrived well which

one should I use hmmm how to know well

as of about six months ago but the state

of the art in this field is that you

would burn things go up to the following

psychologist with fabulous hair and

you’d say

Steve you’re an expert on the irregular

verbs what should I do and he tell you

well most people say thrive but some

people say throws are you also knew more

or less that if you were to go back in

time 200 years and ask the following

statesman with equally fabulous hair Tom

what should I say he’d say well in my

day most people throws but some thrived

so now what I’m just going to show you

is raw data two rows from this table of

two billion entries

what you’re seeing is year by year

frequency of thrived and throws over

time now this is just two out of two

billion rows so the entire data set is a

billion times more awesome than this

slide now there are many other pictures

that are worth 500 billion words for

instance this one if you just type in

influenza you will see peaks at the time

where you knew big influent pyramids

were actually killing millions of people

around the globe if you were not yet

convinced sea levels are rising so is

atmospheric co2 and global temperature

you might also want to have a look at

this for Tudor and Grandma nest and tell

Nitschke that God is not dead although

you might agree that he might need a

better cover system you can get some

pretty abstract concepts with this sort

of thing for instance let me tell you

the history of the year 1950 pretty much

for the vast majority of history no one

gave a damn about 1950 in 1700 and 1800

and 1900 no one cared

through the 30s and 40s no one cared

suddenly in the mid 40s

there’s started to be a buzz people

realized that 1950 was going to happen

and it could be big but nothing got

people interested in 1950 like the year

1950 people were walking around obsessed

they couldn’t stop talking about all the

things they did in 1950 all the things

they were planning to do in 1950 all the

dreams of what they wanted to accomplish

in 1950 in fact 1950 was so fascinating

that for years thereafter people just

kept talking about all the amazing

things that happen in 51 52 53 finally

in 1954 someone woke up and realized

that 1950 had gotten somewhat passe and

just like that the bubble burst at the

story of 1950 is the story of every year

that we have on record with a little

twist because now we’ve got these nice

charts and because we have these nice

charts we can measure things we can say

wow how fast does the bubble burst and

it turns out that we can measure that

very precisely equations were derived

graphs were produced and the net result

is that we find that the bubble bursts

faster and faster with each passing year

we are losing interest in the past more

rapidly now a little piece of career

advice so for those of you who stick to

be famous you can learn from the most

famous 25 most famous political figures

offers our actors and so on so if you

wanna become famous early on you should

be an actor because then Fame starts

rising by the end of your 20s you’re

still young it’s really great now if you

can wait a little bit

you should be another because then you

rise to very great heights like Mark

Twain for instance extremely famous but

if you want to reach the very top you

should delay gratification and of course

become a politician right so here you

will become famous by the end of your

50s and become very very famous

afterwards so scientists also tend to

get famous when they’re much much among

old

like for the biologists and physicists

can be almost as famous as actors one

mistake you should not do is become a

mathematician

if you if you do that you might think oh

great I’m going to do my best work on a

midnight 20s but guess what nobody will

really care

there are more sobering notes among the

engrams for instance here’s the

trajectory of marc chagall an artist

born in 1887 and this looks like the

normal trajectory of a famous person he

gets more and more and more and more

famous except if you look in German you

live in German you see something

completely bizarre something you pretty

much never see which is he cut becomes

extremely famous and then all of a

sudden plummets going through a nadir

between 1933 and 1945 before rebounding

afterwards and of course what we’re

seeing is the fact that Marc Chagall was

a Jewish artist in Nazi Germany now

these signals are actually so strong

that we don’t need to know that someone

was censored we can actually figure it

out using really basic signal processing

here’s a simple way to do it well a

reasonable expectation is that

somebody’s Fame in a given period of

time should be roughly the average of

their fame before and their fame after

so that’s sort of what we expect now we

compare that to the fame that we observe

and we just divide one by the other to

produce something we call the

suppression index if the suppression

index is very very very small then you

very well might be being suppressed if

it’s very large

maybe you’re benefiting from propaganda

now you can actually look at the

distribution of depression indices over

whole populations the presence here

decision indices for 5000 people picked

in the English books where there’s no

known suppression would be like this

basically tightly centered around one

what you expect is basicly what you

observe this is the distribution used in

Nazi Germany it’s very different it

shifted to the left

people are talked about voiceless as it

should have been but much more

importantly the distribution is much

wider there are many people who end up

on the far left of this distribution who

are talked about ten times fewer and

they should have been but then also many

people on the far right who seem to

benefit from propaganda this picture

here is the hallmark of censorship in

the book record so culture omics is what

we call this method it’s kind of like

genome except genomics is kind of a lens

on biology through the window of the

sequence of bases in the human genome

culture omics is similar it’s the

application of massive scale data

collection analysis to the study

human culture here instead through the

length of a genome through the lens of

digitized pieces of the historical

record the great thing about culture

omics is that everyone can do it why can

everyone do it everyone can do it

because three guys John Orr want Matt

gray and will Brockman over at Google

saw the prototype of the Ngram viewer

and they said this is so fun we have to

make this available people and so in two

weeks flat the two weeks before our

paper came out they coded up a version

the Ngram viewer for the general public

and so you too can type in any word or

phrase that you’re interested in and see

it’s an Graham immediately and also

browse examples of all the various books

in which your Engram appears now this

was used over a million times in the

first day and this is really the best of

all the queries right the people want to

be their best put their best foot

forward but it turns out in the 18th

century before we didn’t really care

about that at all they didn’t want to be

their best they want to be their best so

what happens is of course this is just a

mistake right it’s not that they strive

for mediocrity is just that the earth

used to be written differently well kind

of like an F now of course because OCR

didn’t pick this up at the time so we

know we reported this in the sense

article that we wrote but it turns out

that it should just stand as a reminder

that although this is a lot of fun when

you interpret these graphs you have to

be very careful you have to adopt the

best standards in the sciences people

been using this for all kinds of fun

purposes

I actually we’re not going to have to

talk we’ll just show you all the slides

and remain silent okay papers with

interest in the history of frustration

there’s various various types of

frustration if you sub your toe that’s a

1 a argh if the planet Earth is

annihilated by the Vogons to make room

for an interstellar bypass that’s an 8 a

argh

this person studied all the arms from 1

through 8 s and turns out that the less

frequent arms are of course the ones

that correspond the things that are more

frustrating except oddly in the early

80s we think that might have something

to do with regen all right the bottom

line is ok there are many usages of this

data but the bottom line is that the

historical record is being digitized

Google has started to decide 15 million

books at 4% of all the books that have

ever been published it’s pretty big size

its accessible chunk of human culture

there’s much more in culture there’s

manuscripts there’s newspapers the

things that are not text like art and

paintings this will happen to be on our

computers on computers across the world

and when that happens that will

transform the way we have to understand

our past our present and human culture

thank you very much

[Applause]