Massivescale online collaboration Luis von Ahn

how many of you have had to fill out

some sort of web form or even asked to

read a distorted sequence of characters

like this how many of you found it

really really annoying okay outstanding

so I invented that or I was one of the

people who did it that thing is called a

CAPTCHA and the reason is there’s to

make sure that you the entity filling

out the form are actually a human and

not some sort of computer program that

was written to submit the form millions

and millions of times the reason it

works is because humans at least non

visually impaired humans have no trouble

reading these distorted squiggly

characters whereas computer programs

simply can’t do it as well yet so for

example in the case of Ticketmaster the

reason you have to type these distorted

characters is to prevent scalpers from

writing a program that can buy millions

of tickets two at a time they’re not

captures are used all over the Internet

and since they’re used so often a lot of

times the precise sequence of random

characters that are shown to the user is

not so fortunate so this is an example

from the Yahoo registration page the

random characters that happen to be

shown to the user where waait which of

course spell a word but the best part is

the message that the Yahoo help this got

about 20 minutes later

this person thought may need it way this

of course is not a bad as this poor

person who okay now captcha project is

something that we did here at carnagie

mellon over ten years ago and it’s been

used everywhere let me not tell you

about a project that we did a few years

later which is sort of the next

evolution of CAPTCHAs just a project

that we call recaptcha which is

something that we started here at

carnegie mellon then we turned it into a

startup company and then about a year

and a half ago google actually acquired

this company so let me tell you what

this project started ok so this project

started from the following realization

it turns out that approximately 200

million CAPTCHAs are typed every day by

people around the world ok when I first

heard this I was quite proud of myself I

thought look at the impact that my

research has had but then I started

feeling bad so here’s the thing each

time you type a CAPTCHA essentially you

waste 10 seconds of your time and if you

multiply that by 200 million you get

that humanity as a whole is wasting

about 500,000 hours every day typing

these annoying CAPTCHAs so then I

started feeling bad

and then I started thinking well of

course we can’t just get rid of CAPTCHAs

music it’s the security of the web sort

of depends on them but then I started

thinking is there any way we can use

this pepper for something that is good

for Humanity so see here’s the thing

while you’re typing a CAPTCHA during

those 10 seconds your brain is doing

something amazing your brain is doing

something that computers can not yet do

so can we get you to do use for work for

those 10 seconds another way of putting

it is there’s some humongous problem

that we cannot yet get computers to

solve but that somehow we can split into

tiny 10 second chunks such that each

time somebody solves a CAPTCHA they

solve a little bit of this problem and

the answer to that is yes and this is

what we’re doing now so what you may not

know is that nowadays while you’re

typing a CAPTCHA not only are you

authenticating yourself as a human but

in addition you’re actually helping us

to digitize books okay so let me explain

how this works so there’s a lot of

projects out there trying to digitize

books google has won the Internet

Archive has won amazon now with the

kindle store in the digitize books

basically the way this works is you

start with an old book like a physical

thing you’ve seen those things right

like a Laker hope so you start with a

book and then you scan it now scanning a

book is like taking a digital photograph

of every page of the book it gives you

an image for every page of the book this

is an image with text for every page of

the book the next step in the process is

that the computer needs to be able to

decipher all of the words in this image

that’s done using a technology called

OCR for optical character recognition

which takes a picture of text and tries

to figure out what the text is in there

now the problem is that OCR is not

perfect especially for older books where

the ink has faded and the pages have

turned yellow OCR cannot recognize a lot

of the words for example the things that

we’re in more than 50 years ago the

computer cannot recognize about thirty

percent of the words so what we’re doing

now is we’re taking all of the words

that the computer cannot recognize and

we’re getting people to read them for us

while they’re typing a CAPTCHA on the

Internet okay so next time you type a

CAPTCHA these words that you’re typing

are actually words that are coming from

books that are being digitized that the

computer could not recognize I know the

reason we have two words nowadays

instead of one is because you see one of

the words is a word that we that the

system just got out of a book it didn’t

know what it was and it’s going to

present it to you but since it doesn’t

know the answer for it it cannot grade

it for you so what we do is we give you

another word one for which the system

does know the answer okay we don’t tell

you which ones which and we say please

type both and if you type the correct

word for the one for which the system

already knows the answer

it assumes you’re human and it also get

some confidence that you tap the other

word correctly and if we repeat this

process to like 10 different people and

all of them agree on what the new word

is then we get one more word digitized

accurately so this is how the system

works and basically since we released it

about three or four years ago a lot of

websites that started switching from the

old capture where people waste at the

time to the new capture where people are

helping to digitize books so for example

Ticketmaster so every time you buy

tickets on Ticketmaster you have to

digitize a book Facebook every time you

add a friend or poke somebody you help

to digitize a book Twitter and about

350,000 other sites are all using

reCAPTCHA and in fact the number of

sites that are using reCAPTCHA so high

that the number of words that we’re

digitizing per day is really really

large it’s about 100 million a day which

is the equivalent of about two and a

half million books a year and this is

all being done one word at a time by

just people typing CAPTCHAs on the

Internet

now of course since we’re doing so many

words per day funny things can happen

and this is especially true because see

now we’re giving people two randomly

chosen English words next to each other

ok so funny things can happen so for

example we present that this word it’s

the word Christians there’s nothing

wrong with it but if you presented along

with another randomly chosen word bad

things can happen but it’s even worse

because the particular website where we

show this actually happened to be called

the Embassy of the kingdom of God

yep here’s another really bad one at

John Edwards com

so we keep on insulting people are left

and right every day now of course we’re

not just insulting people see here’s the

thing since we’re presenting two

randomly chosen words there’s

interesting things can happen so this is

actually has given rise to a really big

internet meme that tens of thousands of

people have participated in which is

called CAPTCHA art I’m sure some of you

have heard about it here’s here’s how it

works okay imagine your is in you’re

using the internet and you see a capture

that you think is somewhat peculiar like

this capture then what you’re supposed

to do is you take a screenshot of it

then of course you fill out the CAPTCHA

because you helped us digitize a book

but then first you take a screenshot and

then you draw something that is related

to it that’s how it works there are tens

of thousands of these some of them are

very cute

some of them are funnier

and some of them like paleontological SH

fizzle they contain Snoop Dogg okay so

this is my favorite number of reCAPTCHA

so this is the favorite thing that I

like about this whole project this is

the number of distinct people that have

helped us digitize at least one word out

of a book to recapture 750 million which

is a little over ten percent of the

world’s population has helped us

digitize human knowledge and it is words

it is numbers like these that motivate

my research agenda so the question that

motivates my research is the following

if you look at humanity’s large-scale

achievements these really big things

that humanity has gotten together and

done like historically like for example

building the pyramids of Egypt or the

Panama Canal or putting a man on the

moon there’s a curious fact about them

and it is that they were all done with

about the same number of people it’s

weird they’re all done with about a

hundred thousand people and the reason

for that is because before the internet

coordinating more than 100,000 people

let alone paying them was essentially

impossible but see now with the internet

I’ve just shown you a project where

we’ve gotten 750 million people to help

us digitize human knowledge so the

question that motivates my research is

if we can put a man on the moon with a

hundred thousand what can we do with 100

million so based on this question we’ve

had a lot of different projects that

we’ve been working on let me tell you

about one that I’m most excited about

this is something we’ve been so send me

quietly working on for the last year and

a half or so it hasn’t yet been launched

its called duolingo since it hasn’t been

launched

I can trust the other that okay so this

is a project here’s how it started it

started with me posting a question to my

graduate student Severin hacker okay

that’s Severin hacker so I post a

question to my graduate student by the

way is that you did hear me correctly

his last name is hacker so I post this

question to him how can we get a hundred

million people translating the web into

every major language for free okay so

there’s a lot of things to say about

this question first of all translating

the web so right now the web is

partitioned into multiple languages a

large fraction of it is in English if

you don’t know any English you can’t

access it but there’s large fractions in

other different languages and if you

don’t know those languages you can’t

access it so i would like to translate

all of the web or at least most of the

web into every major language okay so

that’s that’s what I would like to do

now some of you may say well why can’t

we use computers to translate so why

can’t we use machine translation machine

translation nowadays it starting to

translate some sentences here and there

why can’t we use it to translate the

whole web well the problem with that is

that it’s not yet good enough and it

probably won’t be for the next 15 to 20

years it makes a lot of mistakes even

when it doesn’t make a mistake since it

makes so many mistakes you don’t know

whether to trust it or not so let me

show you an example of something that

was translated with a machine it’s

actually it was a forum post with

somebody who’s trying to ask a question

about JavaScript it was translated from

Japanese into English so I’ll just let

you read this person starts apologizing

from the fact that it’s translated with

a computer okay so the next sentence is

going to be the preamble to the question

so he’s just explaining something I

remember it’s a question about

JavaScript

okay then

then comes the first part of the

question

then comes my favorite part of the

question

and then comes the ending which is my

favorite part of the whole thing

okay so computer translation not yet

good enough okay so back to the question

so we need people to translate the whole

web okay so now the next question you

may have is well why can’t we just pay

people to do is we could pay

professional language translators

translate the whole web we could do that

unfortunately it would be extremely

expensive for example translating a tiny

tiny fraction of the whole web Wikipedia

into one other language Spanish it you

know the Wikipedia exists in Spanish but

it’s very small compared to the size of

English is about twenty percent is the

size of English if we wanted to

translate the other eighty percent into

Spanish it would cost at least 50

million dollars and this is even at the

most exploitive outsourcing country out

there so to be very expensive so what we

want to do is wanna get 100 million

people translating the web into every

major language for free okay now if this

is what you want to do you pretty

quickly realize you’re going to run into

two pretty big hurdles two big obstacles

okay the first one is a lack of

bilinguals okay so I don’t even know if

there exists a hundred million people

out there using the web who are

bilingual enough to help us translate

that’s a big problem the other problem

that you’re going to run into is a lack

of motivation how are we going to

motivate people to actually translate

the web for free this is normally you

have to pay people to do this so now

we’re going to motivate him to do it for

free now when we were starting to think

about this we were blocked by these two

things but then we realized there’s

actually a way to solve both of these

problems with the same solution there’s

a way to kill two birds with one stone

and that is to transform language

translation into something that millions

of people want to do and that also helps

with the problem lack of bilinguals and

that is language education okay so it

turns out that today there are over 1.2

billion people learning a foreign

language people really really want to

learn a foreign language and it’s not

just because they’re being forced to do

so in school for example in the United

States alone there over 5 million people

who have paid over five hundred dollars

for software to learn a new language

okay so people really really want to

learn a new language so what we’ve been

working on for the last year and a half

is a new website it’s called duolingo

where the basic idea is people learn a

new language for free while

simultaneously translating the web and

so basically they’re learning by doing

okay so the way this works is whenever

just a beginner we give you very very

simple sentences there’s of course a lot

of very simple sentences on the web we

give you very very simple sentences

along with what each word means okay and

as you translate them and as you see how

other people translate them you start

learning the language and as you as you

get more and more advanced we give you

more and more complex sentences to

translate but at all times you’re

learning by doing ok now the crazy thing

about this master method is that it

actually really works okay first of all

people are really really learning a

language we’re mostly done building it

and now we’re testing it people really

can learn a language with it and and

they learn it about as well as the

leading language learning software so

people really do learn a language and

not only do they learn it as well but

actually it’s way more interesting

because you see we do a lingual people

are actually learning with real content

as opposed to learning with made up

sentences people are learning with real

content which is inherently interesting

they the other thing so people really do

learn a language but perhaps more

surprisingly the translations that we

get from people using the site even

though they’re just beginners the

translations that we get are as accurate

as those professional language

translators which is very surprising so

let me show you one example this is

sentence that was translated from german

into english the top is the german the

middle is an english translation that

was done by somebody who is a

professional language translator who we

paid twenty cents a word for this

translation and the bottom is a

translation by users of duolingo none of

whom knew any German before they started

using the site if you can see it’s

pretty much perfect now of course we

play a trick here to make the

translations as good as professional

language translators we combine the

translations of multiple beginners to

get the quality of a single professional

translator ok now even though we’re

combining the translations the the site

actually can translate pretty fast so

let me show you this is our estimates of

how fast we could translate wikipedia

from english into spanish remember this

is 50 million dollars worth of value ok

so if we wanted to translate Wikipedia

into Spanish we could do it in five

weeks with a hundred thousand active

users and we could do it in about 80

hours with a million active users since

all the projects that my group has

worked on so far have gotten millions of

users were hopeful that we’ll be able to

translate extremely fast with this

project now the thing that I’m most

excited about with duolingo is that I

think this provides a fair business

model for language education so here’s

the thing the current business model for

language education is the student pace

in particular the student pays rosetta

stone five hundred dollars

that’s the current business model the

problem with this business model is that

ninety-five percent of the world’s

population doesn’t have five hundred

dollars so it’s extremely unfair towards

the poor okay this is totally biased or

so rich now let’s see in duolingo

because while you learn you’re actually

creating value you are translating stuff

which for example we could charge

somebody for translations so this is how

we could monetize this since people are

creating value while you’re learning

they don’t have to pay with their money

they pay with their time but the magical

thing here is that they’re paying with

their time but that is time that would

have have to be in spent anyways

learning the language okay so the nice

thing about dueling was I think it

provides a fair business model one that

doesn’t discriminate against poor people

so here’s the site

here’s the site we haven’t yet launched

but if you go there you can sign up to

be part of our private beta which are

probably going to start in about three

or four weeks we haven’t yet launched

this duel and go by the way I’m the one

talking here but actually duolingo is

the work of a really awesome team some

of whom are here so thank you