Massivescale online collaboration Luis von Ahn
how many of you have had to fill out
some sort of web form or even asked to
read a distorted sequence of characters
like this how many of you found it
really really annoying okay outstanding
so I invented that or I was one of the
people who did it that thing is called a
CAPTCHA and the reason is there’s to
make sure that you the entity filling
out the form are actually a human and
not some sort of computer program that
was written to submit the form millions
and millions of times the reason it
works is because humans at least non
visually impaired humans have no trouble
reading these distorted squiggly
characters whereas computer programs
simply can’t do it as well yet so for
example in the case of Ticketmaster the
reason you have to type these distorted
characters is to prevent scalpers from
writing a program that can buy millions
of tickets two at a time they’re not
captures are used all over the Internet
and since they’re used so often a lot of
times the precise sequence of random
characters that are shown to the user is
not so fortunate so this is an example
from the Yahoo registration page the
random characters that happen to be
shown to the user where waait which of
course spell a word but the best part is
the message that the Yahoo help this got
about 20 minutes later
this person thought may need it way this
of course is not a bad as this poor
person who okay now captcha project is
something that we did here at carnagie
mellon over ten years ago and it’s been
used everywhere let me not tell you
about a project that we did a few years
later which is sort of the next
evolution of CAPTCHAs just a project
that we call recaptcha which is
something that we started here at
carnegie mellon then we turned it into a
startup company and then about a year
and a half ago google actually acquired
this company so let me tell you what
this project started ok so this project
started from the following realization
it turns out that approximately 200
million CAPTCHAs are typed every day by
people around the world ok when I first
heard this I was quite proud of myself I
thought look at the impact that my
research has had but then I started
feeling bad so here’s the thing each
time you type a CAPTCHA essentially you
waste 10 seconds of your time and if you
multiply that by 200 million you get
that humanity as a whole is wasting
about 500,000 hours every day typing
these annoying CAPTCHAs so then I
started feeling bad
and then I started thinking well of
course we can’t just get rid of CAPTCHAs
music it’s the security of the web sort
of depends on them but then I started
thinking is there any way we can use
this pepper for something that is good
for Humanity so see here’s the thing
while you’re typing a CAPTCHA during
those 10 seconds your brain is doing
something amazing your brain is doing
something that computers can not yet do
so can we get you to do use for work for
those 10 seconds another way of putting
it is there’s some humongous problem
that we cannot yet get computers to
solve but that somehow we can split into
tiny 10 second chunks such that each
time somebody solves a CAPTCHA they
solve a little bit of this problem and
the answer to that is yes and this is
what we’re doing now so what you may not
know is that nowadays while you’re
typing a CAPTCHA not only are you
authenticating yourself as a human but
in addition you’re actually helping us
to digitize books okay so let me explain
how this works so there’s a lot of
projects out there trying to digitize
books google has won the Internet
Archive has won amazon now with the
kindle store in the digitize books
basically the way this works is you
start with an old book like a physical
thing you’ve seen those things right
like a Laker hope so you start with a
book and then you scan it now scanning a
book is like taking a digital photograph
of every page of the book it gives you
an image for every page of the book this
is an image with text for every page of
the book the next step in the process is
that the computer needs to be able to
decipher all of the words in this image
that’s done using a technology called
OCR for optical character recognition
which takes a picture of text and tries
to figure out what the text is in there
now the problem is that OCR is not
perfect especially for older books where
the ink has faded and the pages have
turned yellow OCR cannot recognize a lot
of the words for example the things that
we’re in more than 50 years ago the
computer cannot recognize about thirty
percent of the words so what we’re doing
now is we’re taking all of the words
that the computer cannot recognize and
we’re getting people to read them for us
while they’re typing a CAPTCHA on the
Internet okay so next time you type a
CAPTCHA these words that you’re typing
are actually words that are coming from
books that are being digitized that the
computer could not recognize I know the
reason we have two words nowadays
instead of one is because you see one of
the words is a word that we that the
system just got out of a book it didn’t
know what it was and it’s going to
present it to you but since it doesn’t
know the answer for it it cannot grade
it for you so what we do is we give you
another word one for which the system
does know the answer okay we don’t tell
you which ones which and we say please
type both and if you type the correct
word for the one for which the system
already knows the answer
it assumes you’re human and it also get
some confidence that you tap the other
word correctly and if we repeat this
process to like 10 different people and
all of them agree on what the new word
is then we get one more word digitized
accurately so this is how the system
works and basically since we released it
about three or four years ago a lot of
websites that started switching from the
old capture where people waste at the
time to the new capture where people are
helping to digitize books so for example
Ticketmaster so every time you buy
tickets on Ticketmaster you have to
digitize a book Facebook every time you
add a friend or poke somebody you help
to digitize a book Twitter and about
350,000 other sites are all using
reCAPTCHA and in fact the number of
sites that are using reCAPTCHA so high
that the number of words that we’re
digitizing per day is really really
large it’s about 100 million a day which
is the equivalent of about two and a
half million books a year and this is
all being done one word at a time by
just people typing CAPTCHAs on the
Internet
now of course since we’re doing so many
words per day funny things can happen
and this is especially true because see
now we’re giving people two randomly
chosen English words next to each other
ok so funny things can happen so for
example we present that this word it’s
the word Christians there’s nothing
wrong with it but if you presented along
with another randomly chosen word bad
things can happen but it’s even worse
because the particular website where we
show this actually happened to be called
the Embassy of the kingdom of God
yep here’s another really bad one at
John Edwards com
so we keep on insulting people are left
and right every day now of course we’re
not just insulting people see here’s the
thing since we’re presenting two
randomly chosen words there’s
interesting things can happen so this is
actually has given rise to a really big
internet meme that tens of thousands of
people have participated in which is
called CAPTCHA art I’m sure some of you
have heard about it here’s here’s how it
works okay imagine your is in you’re
using the internet and you see a capture
that you think is somewhat peculiar like
this capture then what you’re supposed
to do is you take a screenshot of it
then of course you fill out the CAPTCHA
because you helped us digitize a book
but then first you take a screenshot and
then you draw something that is related
to it that’s how it works there are tens
of thousands of these some of them are
very cute
some of them are funnier
and some of them like paleontological SH
fizzle they contain Snoop Dogg okay so
this is my favorite number of reCAPTCHA
so this is the favorite thing that I
like about this whole project this is
the number of distinct people that have
helped us digitize at least one word out
of a book to recapture 750 million which
is a little over ten percent of the
world’s population has helped us
digitize human knowledge and it is words
it is numbers like these that motivate
my research agenda so the question that
motivates my research is the following
if you look at humanity’s large-scale
achievements these really big things
that humanity has gotten together and
done like historically like for example
building the pyramids of Egypt or the
Panama Canal or putting a man on the
moon there’s a curious fact about them
and it is that they were all done with
about the same number of people it’s
weird they’re all done with about a
hundred thousand people and the reason
for that is because before the internet
coordinating more than 100,000 people
let alone paying them was essentially
impossible but see now with the internet
I’ve just shown you a project where
we’ve gotten 750 million people to help
us digitize human knowledge so the
question that motivates my research is
if we can put a man on the moon with a
hundred thousand what can we do with 100
million so based on this question we’ve
had a lot of different projects that
we’ve been working on let me tell you
about one that I’m most excited about
this is something we’ve been so send me
quietly working on for the last year and
a half or so it hasn’t yet been launched
its called duolingo since it hasn’t been
launched
I can trust the other that okay so this
is a project here’s how it started it
started with me posting a question to my
graduate student Severin hacker okay
that’s Severin hacker so I post a
question to my graduate student by the
way is that you did hear me correctly
his last name is hacker so I post this
question to him how can we get a hundred
million people translating the web into
every major language for free okay so
there’s a lot of things to say about
this question first of all translating
the web so right now the web is
partitioned into multiple languages a
large fraction of it is in English if
you don’t know any English you can’t
access it but there’s large fractions in
other different languages and if you
don’t know those languages you can’t
access it so i would like to translate
all of the web or at least most of the
web into every major language okay so
that’s that’s what I would like to do
now some of you may say well why can’t
we use computers to translate so why
can’t we use machine translation machine
translation nowadays it starting to
translate some sentences here and there
why can’t we use it to translate the
whole web well the problem with that is
that it’s not yet good enough and it
probably won’t be for the next 15 to 20
years it makes a lot of mistakes even
when it doesn’t make a mistake since it
makes so many mistakes you don’t know
whether to trust it or not so let me
show you an example of something that
was translated with a machine it’s
actually it was a forum post with
somebody who’s trying to ask a question
about JavaScript it was translated from
Japanese into English so I’ll just let
you read this person starts apologizing
from the fact that it’s translated with
a computer okay so the next sentence is
going to be the preamble to the question
so he’s just explaining something I
remember it’s a question about
JavaScript
okay then
then comes the first part of the
question
then comes my favorite part of the
question
and then comes the ending which is my
favorite part of the whole thing
okay so computer translation not yet
good enough okay so back to the question
so we need people to translate the whole
web okay so now the next question you
may have is well why can’t we just pay
people to do is we could pay
professional language translators
translate the whole web we could do that
unfortunately it would be extremely
expensive for example translating a tiny
tiny fraction of the whole web Wikipedia
into one other language Spanish it you
know the Wikipedia exists in Spanish but
it’s very small compared to the size of
English is about twenty percent is the
size of English if we wanted to
translate the other eighty percent into
Spanish it would cost at least 50
million dollars and this is even at the
most exploitive outsourcing country out
there so to be very expensive so what we
want to do is wanna get 100 million
people translating the web into every
major language for free okay now if this
is what you want to do you pretty
quickly realize you’re going to run into
two pretty big hurdles two big obstacles
okay the first one is a lack of
bilinguals okay so I don’t even know if
there exists a hundred million people
out there using the web who are
bilingual enough to help us translate
that’s a big problem the other problem
that you’re going to run into is a lack
of motivation how are we going to
motivate people to actually translate
the web for free this is normally you
have to pay people to do this so now
we’re going to motivate him to do it for
free now when we were starting to think
about this we were blocked by these two
things but then we realized there’s
actually a way to solve both of these
problems with the same solution there’s
a way to kill two birds with one stone
and that is to transform language
translation into something that millions
of people want to do and that also helps
with the problem lack of bilinguals and
that is language education okay so it
turns out that today there are over 1.2
billion people learning a foreign
language people really really want to
learn a foreign language and it’s not
just because they’re being forced to do
so in school for example in the United
States alone there over 5 million people
who have paid over five hundred dollars
for software to learn a new language
okay so people really really want to
learn a new language so what we’ve been
working on for the last year and a half
is a new website it’s called duolingo
where the basic idea is people learn a
new language for free while
simultaneously translating the web and
so basically they’re learning by doing
okay so the way this works is whenever
just a beginner we give you very very
simple sentences there’s of course a lot
of very simple sentences on the web we
give you very very simple sentences
along with what each word means okay and
as you translate them and as you see how
other people translate them you start
learning the language and as you as you
get more and more advanced we give you
more and more complex sentences to
translate but at all times you’re
learning by doing ok now the crazy thing
about this master method is that it
actually really works okay first of all
people are really really learning a
language we’re mostly done building it
and now we’re testing it people really
can learn a language with it and and
they learn it about as well as the
leading language learning software so
people really do learn a language and
not only do they learn it as well but
actually it’s way more interesting
because you see we do a lingual people
are actually learning with real content
as opposed to learning with made up
sentences people are learning with real
content which is inherently interesting
they the other thing so people really do
learn a language but perhaps more
surprisingly the translations that we
get from people using the site even
though they’re just beginners the
translations that we get are as accurate
as those professional language
translators which is very surprising so
let me show you one example this is
sentence that was translated from german
into english the top is the german the
middle is an english translation that
was done by somebody who is a
professional language translator who we
paid twenty cents a word for this
translation and the bottom is a
translation by users of duolingo none of
whom knew any German before they started
using the site if you can see it’s
pretty much perfect now of course we
play a trick here to make the
translations as good as professional
language translators we combine the
translations of multiple beginners to
get the quality of a single professional
translator ok now even though we’re
combining the translations the the site
actually can translate pretty fast so
let me show you this is our estimates of
how fast we could translate wikipedia
from english into spanish remember this
is 50 million dollars worth of value ok
so if we wanted to translate Wikipedia
into Spanish we could do it in five
weeks with a hundred thousand active
users and we could do it in about 80
hours with a million active users since
all the projects that my group has
worked on so far have gotten millions of
users were hopeful that we’ll be able to
translate extremely fast with this
project now the thing that I’m most
excited about with duolingo is that I
think this provides a fair business
model for language education so here’s
the thing the current business model for
language education is the student pace
in particular the student pays rosetta
stone five hundred dollars
that’s the current business model the
problem with this business model is that
ninety-five percent of the world’s
population doesn’t have five hundred
dollars so it’s extremely unfair towards
the poor okay this is totally biased or
so rich now let’s see in duolingo
because while you learn you’re actually
creating value you are translating stuff
which for example we could charge
somebody for translations so this is how
we could monetize this since people are
creating value while you’re learning
they don’t have to pay with their money
they pay with their time but the magical
thing here is that they’re paying with
their time but that is time that would
have have to be in spent anyways
learning the language okay so the nice
thing about dueling was I think it
provides a fair business model one that
doesn’t discriminate against poor people
so here’s the site
here’s the site we haven’t yet launched
but if you go there you can sign up to
be part of our private beta which are
probably going to start in about three
or four weeks we haven’t yet launched
this duel and go by the way I’m the one
talking here but actually duolingo is
the work of a really awesome team some
of whom are here so thank you