What we learned from 5 million books Erez Lieberman Aiden and JeanBaptiste Michel
[Music]
everyone knows that a picture is worth a
thousand words but we at Harvard we’re
wondering if this was really true so we
assembled a team of experts spanning
Harvard MIT the American Heritage
Dictionary the Encyclopedia Britannica
and even our proud sponsors the Google
and we cogitated about this for about
four years and we came to a startling
conclusion ladies and gentlemen a
picture is not worth a thousand words in
fact we found some pictures that are
worth 500 billion words so how did we
get to this conclusion
so Arizona were thinking about ways to
get a big picture of human culture and
human history change over time so many
books actually have been written about
over the years so we’re thinking well
the best way to learn from them is to
read all of these millions of books now
of course if there’s a scale for how
awesome that is that has to rank
extremely sternly high now the problem
is there’s an x axis for that which is
the practical axis this is very very low
now now people tend to use an
alternative approach which is to take a
few sources and read them very carefully
this is extremely practical but not too
awesome what you really want to do what
you really want to do is to get to the
awesome yet practical part of this space
so it turns out there’s a company across
the river called Google who has started
the digitization project a few years
back that might just enable this
approach they have digitized millions of
books so what that means is one could
use computational methods to read all
the books in the click of a button
that’s very practical and extremely
awesome
let me tell you a little bit about where
books come from time immemorial
there have been authors these authors
have been striving to write books and
this became considerably easier with the
development of the printing press some
centuries ago I think then the authors
have one on one hundred twenty nine
million distinct occasions publishing
books now those books are not lost the
history than they are somewhere in a
library and many of those books have
been getting retrieved from the
libraries and digitized by Google which
is stands 15 million books to date now
when Google digitize a book they put
into a really nice format now he’s got
the data plus we have metadata we have
information about things like where was
it published who’s the author when was
it published and what we do is go
through all of those records and exclude
everything that’s not the highest
quality data what we’re left with is a
collection of five million books five
hundred billion words a string of
characters a thousand times longer than
the human genome a text which when
written out would stretch from here to
the moon and back ten times over a
veritable shard of our cultural genome
of course what we did when faced with
such outrageous hyperbole was what any
self-respecting researchers would have
done
we took a page out of xkcd and we said
stand back we’re going to try science
now of course we’re thinking well let’s
just first put the data out there for
people to do science to it now we’re
thinking what data I can release well of
course you want to take the books and
release the full text of these five
millions of books now Google and John
are ones particular told us Bill
equation that we should learn so we have
five million books that’s happening
authors that is 5 million plaintiffs is
a massive lawsuit so although that would
be really real and again that extremely
extremely impractical the space now
again we can a caved-in and we did the
very practical purchase a bit less
awesome we we said well instead of
raising the full text we’re going to
relieve statistics about the books so
we’re going to take for instance a
glimmer of happiness it’s four words we
call it a four gram we’re going to tell
you how many times that particular 4
gram appeared in book solution 1801 1802
1803 all the way up to 2008 that gives
us a 10 series of how frequently this
particular sentence was used over time
we do that for all the words and phrases
that appear in those books that gives us
a big table of two billion lines that
tell us about the way culture has been
changing so those two billion lines we
call them two billion engrams
what do they tell us well the individual
engrams measure cultural trends let me
give you an example let’s suppose that I
am thriving then tomorrow I want to tell
you about how well I did and so I might
say yesterday I throw alternatively I
could say yesterday I thrived well which
one should I use hmmm how to know well
as of about six months ago but the state
of the art in this field is that you
would burn things go up to the following
psychologist with fabulous hair and
you’d say
Steve you’re an expert on the irregular
verbs what should I do and he tell you
well most people say thrive but some
people say throws are you also knew more
or less that if you were to go back in
time 200 years and ask the following
statesman with equally fabulous hair Tom
what should I say he’d say well in my
day most people throws but some thrived
so now what I’m just going to show you
is raw data two rows from this table of
two billion entries
what you’re seeing is year by year
frequency of thrived and throws over
time now this is just two out of two
billion rows so the entire data set is a
billion times more awesome than this
slide now there are many other pictures
that are worth 500 billion words for
instance this one if you just type in
influenza you will see peaks at the time
where you knew big influent pyramids
were actually killing millions of people
around the globe if you were not yet
convinced sea levels are rising so is
atmospheric co2 and global temperature
you might also want to have a look at
this for Tudor and Grandma nest and tell
Nitschke that God is not dead although
you might agree that he might need a
better cover system you can get some
pretty abstract concepts with this sort
of thing for instance let me tell you
the history of the year 1950 pretty much
for the vast majority of history no one
gave a damn about 1950 in 1700 and 1800
and 1900 no one cared
through the 30s and 40s no one cared
suddenly in the mid 40s
there’s started to be a buzz people
realized that 1950 was going to happen
and it could be big but nothing got
people interested in 1950 like the year
1950 people were walking around obsessed
they couldn’t stop talking about all the
things they did in 1950 all the things
they were planning to do in 1950 all the
dreams of what they wanted to accomplish
in 1950 in fact 1950 was so fascinating
that for years thereafter people just
kept talking about all the amazing
things that happen in 51 52 53 finally
in 1954 someone woke up and realized
that 1950 had gotten somewhat passe and
just like that the bubble burst at the
story of 1950 is the story of every year
that we have on record with a little
twist because now we’ve got these nice
charts and because we have these nice
charts we can measure things we can say
wow how fast does the bubble burst and
it turns out that we can measure that
very precisely equations were derived
graphs were produced and the net result
is that we find that the bubble bursts
faster and faster with each passing year
we are losing interest in the past more
rapidly now a little piece of career
advice so for those of you who stick to
be famous you can learn from the most
famous 25 most famous political figures
offers our actors and so on so if you
wanna become famous early on you should
be an actor because then Fame starts
rising by the end of your 20s you’re
still young it’s really great now if you
can wait a little bit
you should be another because then you
rise to very great heights like Mark
Twain for instance extremely famous but
if you want to reach the very top you
should delay gratification and of course
become a politician right so here you
will become famous by the end of your
50s and become very very famous
afterwards so scientists also tend to
get famous when they’re much much among
old
like for the biologists and physicists
can be almost as famous as actors one
mistake you should not do is become a
mathematician
if you if you do that you might think oh
great I’m going to do my best work on a
midnight 20s but guess what nobody will
really care
there are more sobering notes among the
engrams for instance here’s the
trajectory of marc chagall an artist
born in 1887 and this looks like the
normal trajectory of a famous person he
gets more and more and more and more
famous except if you look in German you
live in German you see something
completely bizarre something you pretty
much never see which is he cut becomes
extremely famous and then all of a
sudden plummets going through a nadir
between 1933 and 1945 before rebounding
afterwards and of course what we’re
seeing is the fact that Marc Chagall was
a Jewish artist in Nazi Germany now
these signals are actually so strong
that we don’t need to know that someone
was censored we can actually figure it
out using really basic signal processing
here’s a simple way to do it well a
reasonable expectation is that
somebody’s Fame in a given period of
time should be roughly the average of
their fame before and their fame after
so that’s sort of what we expect now we
compare that to the fame that we observe
and we just divide one by the other to
produce something we call the
suppression index if the suppression
index is very very very small then you
very well might be being suppressed if
it’s very large
maybe you’re benefiting from propaganda
now you can actually look at the
distribution of depression indices over
whole populations the presence here
decision indices for 5000 people picked
in the English books where there’s no
known suppression would be like this
basically tightly centered around one
what you expect is basicly what you
observe this is the distribution used in
Nazi Germany it’s very different it
shifted to the left
people are talked about voiceless as it
should have been but much more
importantly the distribution is much
wider there are many people who end up
on the far left of this distribution who
are talked about ten times fewer and
they should have been but then also many
people on the far right who seem to
benefit from propaganda this picture
here is the hallmark of censorship in
the book record so culture omics is what
we call this method it’s kind of like
genome except genomics is kind of a lens
on biology through the window of the
sequence of bases in the human genome
culture omics is similar it’s the
application of massive scale data
collection analysis to the study
human culture here instead through the
length of a genome through the lens of
digitized pieces of the historical
record the great thing about culture
omics is that everyone can do it why can
everyone do it everyone can do it
because three guys John Orr want Matt
gray and will Brockman over at Google
saw the prototype of the Ngram viewer
and they said this is so fun we have to
make this available people and so in two
weeks flat the two weeks before our
paper came out they coded up a version
the Ngram viewer for the general public
and so you too can type in any word or
phrase that you’re interested in and see
it’s an Graham immediately and also
browse examples of all the various books
in which your Engram appears now this
was used over a million times in the
first day and this is really the best of
all the queries right the people want to
be their best put their best foot
forward but it turns out in the 18th
century before we didn’t really care
about that at all they didn’t want to be
their best they want to be their best so
what happens is of course this is just a
mistake right it’s not that they strive
for mediocrity is just that the earth
used to be written differently well kind
of like an F now of course because OCR
didn’t pick this up at the time so we
know we reported this in the sense
article that we wrote but it turns out
that it should just stand as a reminder
that although this is a lot of fun when
you interpret these graphs you have to
be very careful you have to adopt the
best standards in the sciences people
been using this for all kinds of fun
purposes
I actually we’re not going to have to
talk we’ll just show you all the slides
and remain silent okay papers with
interest in the history of frustration
there’s various various types of
frustration if you sub your toe that’s a
1 a argh if the planet Earth is
annihilated by the Vogons to make room
for an interstellar bypass that’s an 8 a
argh
this person studied all the arms from 1
through 8 s and turns out that the less
frequent arms are of course the ones
that correspond the things that are more
frustrating except oddly in the early
80s we think that might have something
to do with regen all right the bottom
line is ok there are many usages of this
data but the bottom line is that the
historical record is being digitized
Google has started to decide 15 million
books at 4% of all the books that have
ever been published it’s pretty big size
its accessible chunk of human culture
there’s much more in culture there’s
manuscripts there’s newspapers the
things that are not text like art and
paintings this will happen to be on our
computers on computers across the world
and when that happens that will
transform the way we have to understand
our past our present and human culture
thank you very much
[Applause]