Kalika Bali The giant leaps in language technology and whos left behind TED

Transcriber:

I’m Kalika Bali,
I’m a linguist by training

and a technologist by profession,

I have worked in academia,

in startups, in small companies
and multinationals for over two decades,

doing research in and building
language technology systems.

My dream is to see technology work
across the language barrier.

As a researcher
at Microsoft Research Labs India

I work in the field of language technology
and speech technology.

And I worry about how
can we make technology accessible

to people across the board,

you know, irrespective
of the language that they speak.

So natural language processing,

artificial intelligence,
speech technology,

these are very big words,
they are buzzwords right now.

Everybody is talking about what exactly
is NLP or natural language processing.

So in very simple terms,

this is the part
of computer science engineering

that makes machines process,

understand and generate natural language,

which is the language that humans speak.

When you are interacting with a bot
trying to book your train tickets

or flight tickets,

when you are speaking to a voice-based
digital assistant in your phone,

it’s natural language processing

that underpins the entire technology
that makes that work.

But how does this work?

How does NLP work?

In a very, very basic way,

it’s about data.

So a huge amount of data
of how actually humans use language

is then processed
by certain algorithms and techniques

that make the machines learn the patterns

of natural language of humans, right?

These days, another buzzword that you
hear a lot about is deep neural networks.

And these are the advanced techniques

that underpin a lot of the NLP stuff
that happens right now.

And I will not go into the details
of how that works,

but the thing that you really
have to understand and keep in mind

is that all of this requires
a humungous amount of data,

natural language data.

If you want a speech system
to converse with you in Gujarati,

the first thing you require

is a lot of data of Gujarati people
speaking to each other

in their own language.

So 2017, Microsoft came up
with a speech recognition system

which was able
to transcribe speech into text

better than a human did.

And this system was trained

on 200 million transcribed words.

In 2018, an English-Chinese
machine translation system

was able to translate
from English to Chinese

as well as any human bilingual could.

And this was trained
on 18 million bilingual sentence pairs.

This is a very, very exciting time
in natural language processing

and in technology as such.

You know, we are seeing science fiction,
which we had read about and watched,

kind of come true
in front of our own eyes.

We are making giant leaps
in technical advancement.

But these giant leaps
are limited to very few languages.

So Monojit Choudhury,

who’s like a very good friend of mine

and a colleague,

he has studied this in some detail

and he has looked at resource distribution
across languages in the world.

And he says that these follow
what is called a power-law distribution,

which essentially means
that there are four languages,

Arabic, Chinese, English and Spanish,

which have the maximum amount
of resources available.

There are another handful of languages
which can also benefit from, you know,

the resources and the technology
that’s available right now.

But there are 90 percent
of the world’s languages

which have no resources

or very little resources available.

This revolution that we are talking about

has essentially bypassed
5,000 languages of the world.

Now, what this means is
that resource-rich languages

have technologies built for them,

so researchers and technologists
get attracted towards them.

They build more technologies for them.
They create more resources.

So it’s like a rich getting richer
kind of a cycle.

And the resource-poor languages stay poor,

there’s no technology for them,
nobody works for them.

And this divide,
digital divide between languages

is ever-expanding

and by implication also the divide
between the communities

that speak these languages is expanding.

So in Microsoft, in Project Ellora,
we aim to bridge this gap.

We are trying to see how can we create
more data by innovative methods,

have more techniques to build technology
without having a lot of resources,

and what are the applications
that can truly benefit these communities.

So at the moment,
this might seem very theoretical,

like what is he talking about,
data and techniques and technology.

So let me give you
a very concrete example here.

I’m a linguist at heart, I love languages,
and that’s what I love talking about.

So let me tell you about a language
that many of you might not know about.

Gondi.

Gondi is a South-Central
Dravidian language.

It is spoken by three million people
in five states of India.

And to put this
in some kind of perspective,

Norwegian is spoken by five million people

and Welsh by a little under a million.

So Gondi is actually a pretty robust
and pretty large community

of the Gond tribals in India.

But by UNESCO’s
Atlas of Languages in Danger,

Gondi is designated vulnerable status.

CGNet Swara is an NGO
that provides a citizen journalism portal

for the Gond community

by making local stories
accessible through mobile phones.

There’s absolutely
no tech support for Gondi.

There is no data available for Gondi,
no resources available for Gondi.

So all content that is created,
moderated and edited is done manually.

Now, under Project Ellora,

what we did was that we
brought together all the stakeholders,

an NGOs like CGNet Swara,

and academic institutions,
like IIIT Naya Raipur,

a not-for-profit
children’s book publisher,

like Pratham Books,

and most importantly,
the speakers of the community.

The Gond tribals themselves
participated in this activity

and for the first time edited
and translated children’s books in Gondi.

We were able to put out 200 books
for the very first time in Gondi,

so that the children had access to stories
and books in their own language.

Another extension of this
was Adivasi Radio,

which was like an app that we built
and developed in Microsoft Research,

and then put out there,
along with our stakeholders,

which takes a Hindi text-to-speech system

and allows it to read out news
and articles provided by CGNet Swara

in Gondi language.

Users can now use this app to read,

watch news and access any information

through text and voice
in their own language.

A very interesting thing is that this app
is now being used to translate –

by the community to translate text
from Hindi to Gondi.

Now, what that will result in
is a lot of parallel data,

that we call parallel data,

that will allow us to build
machine translation systems for Gondi,

which will truly open up a window
for the Gond community to the world.

And what is even more important
is now we know how to do this.

We have the entire pipeline
and we can replicate this for any language

and any language community

which is in a similar situation
as the Gond tribals.

Also education – yes, you know,
information access – yes,

but what about earning a living?

Right? What about – how can we make
these people earn a living

through the digital tools that all of us
just take for granted these days?

Vivek Seshadri,
who’s another researcher at MSR,

and his collaborator, Manu Chopra,

they’ve designed a platform called Karya

for providing digital microtasks
to the underserved communities.

His aim was basically to find a way
to provide a means of dignified labor

to the populations, the rural populations

and the urban poor populations
of this country.

They don’t have access
to all the knowledge

to use the digital platforms

that all of us use every day
without even thinking, right?

But …

Here is a large

literate population
that wants to work, right,

and how can we make this
possible for them?

So Karya is one such way

through which this population
can get on to the digital world

and, you know,

through that find work and do tasks
that can then earn them money.

So we saw this and we thought,
oh, this is wonderful.

We could probably use this
for data collection as well.

So we went to Amale,

which is a small village of 200 people

in the Wada district of Maharashtra

and decided to use Karya
to collect Marathi data.

Now, I know what you are thinking –

I’m sure a lot of Marathi speakers
also in the audience –

that Marathi is not
a low-resource language.

Marathi is definitely
a mainstream language of the country.

But as far as language
technology is concerned,

Marathi is a low-resource language.

So we went to this village

and we had a very successful
data-collection trip.

And, you know,
this village is very remote.

They have no TV, they have no electricity,

they have no mobile signal.

You have to climb a hill
and wave your phone around

if you want to, you know,
use your mobile to call anyone.

So they gave us all this data.

But more than that, they gave us
very valuable lessons in life.

One is this pride in one’s own language.

The people of Amale
were thrilled to be doing this

because they were advancing
their own language by doing this.

The second was the value of community.

Very quickly, this became
a village community effort.

People would gather together in tasks
and do this together as a group.

And the third is
the importance of storytelling.

People of Amale were so starved of content
that in the morning, during the daytime,

they would do recordings
of stories in Karya

and then in the evening
they would gather the entire village

and retell and recount
these stories to the village.

So as scientists, we get so caught up

in the science and technology
part of what we are doing, you know –

which is the next best model to have,

how can we increase
the accuracy of my system,

how can I build
the next best system there is –

that we forget the reason
why we are doing this: the people.

And any successful technology is the one
that keeps the people and the users

up front and center.

And when they start doing that,

we also realize that technology
is probably a very small part of this

and there are other things in the story.

Maybe there are social, cultural
and policy interventions

that are required, as much as technology.

So some time back,
I worked on a project called VideoKheti

that allowed Hindi-speaking
farmers in Central India

to search for agricultural videos
by speaking into a phone-based app.

So we went to Madhya Pradesh
to collect data for this,

and we came back
and we were training our models

and we discovered
we’re getting very bad results.

This is not working.

So we were very confused.
Why is this happening?

So we looked deeper
and deeper into the data

and discovered that, yes,
we had collected data

from what we thought was a very silent,
quiet village in the evening.

But what we hadn’t heard
while we were doing this

was that there was this
constant buzz of night insects, you know?

So throughout the recordings,
we had this “bzz” of the insects,

which was actually distorting our speech.

The second thing was
that when we went there

to kind of test our app in the village,

I and my colleague Indrani Medhi,

who is a very well-regarded
design researcher,

we found that the women
couldn’t pronounce the sanskritized words

that we had for some of the search terms.

So, like …

(speaks Hindi)

Which is like the term
for chemical pesticides, right?

Because we got these terms
from the agricultural extension center

and the women,
even though they are farming,

do not interact with that center at all.

The men do, the women probably
use something much simpler, like …

(speaks Hindi)

Which basically means
killing pests with medicine.

So what I have learned through my journey

and what I would like
to put across to you –

by now, I hope you’ve understood me,

is that there is the majority
of the world’s languages

that require intensive investment
for resource creation

if they are to benefit
from language technology.

And this is unlikely to happen
in a very fast and efficient manner.

So it is extremely important
for us to ensure

that the community derives maximum benefit

from whatever that we are doing
in the language tech area.

And to do this and deliver
a positive social impact

on these communities,

we follow what we call the modified
4-D design thinking methodology.

So the 4-D means:
discover, design, develop and deploy.

So discover the problem
that language technology can solve

for a particular language community.

This observation-led approach
can help allocate resources

where they are most needed,

designed for the users and their language,

understand the diversity
in the linguistic properties

and the languages of the world.

And don’t think,
oh, this is made for English.

Now, how can we just adapt it
for Marathi or for Gondi, right?

Develop rapidly and deploy frequently.

It’s an iterative process
that will help you fail fast

and early failures
will eventually lead to success.

The important thing is to persevere.

Do not give up.

And I remember the story
of these two Aborigine Australian women,

Patricia O’Connor and Ysola Best.

In the mid-90s, they went
to the University of Queensland

and they wanted to learn
their own language, called Yugambeh,

and they were told very bluntly,
“Your language is dead.

It’s been dead for three decades.

You cannot work on this.
Find something else to work on.”

They did not give up.

They went to the community,

they dug up oral memories,
oral traditions, oral literature,

and founded the Yugambeh Museum,

which became the most important cultural
and linguistic center for the language

and its community.

They did not have technology.
They only had their willpower.

Now, with the power of technology,

we can ensure that the next page
is written in Salmi from Finland,

Lillooet from Canada
or Mundari from India.

Thank you.

抄写员:

我是 Kalika Bali,
我是一名受过培训的语言学家

和专业的技术专家,

我在学术界

、初创公司、小公司
和跨国公司工作了二十多年,

从事
语言技术系统的研究和构建。

我的梦想是看到技术
跨越语言障碍。

作为
Microsoft Research Labs India 的研究员,

我从事语言技术
和语音技术领域的工作。

我担心我们如何
才能让所有人都能使用技术

你知道的,
不管他们说什么语言。

所以自然语言处理、

人工智能、
语音技术,

这些都是非常大的词,
它们是现在的流行语。

每个人都在谈论究竟什么
是 NLP 或自然语言处理。

所以简单来说,


是计算机科学工程的一部分,

它使机器处理、

理解和生成自然语言,

这是人类所说的语言。

当您与
尝试预订火车票

或机票的机器人交互时,

当您与手机中的基于语音的
数字助理交谈时,

自然语言

处理支撑着整个技术
的工作。

但这是如何工作的?

NLP 是如何工作的?

以非常非常基本的方式,

它与数据有关。

因此,大量
关于人类实际使用语言

的数据然后
通过某些算法和技术

进行处理,使机器学习

人类自然语言的模式,对吧?

这些天来,另一个你经常
听到的流行词是深度神经网络。

这些

是支持现在发生的许多 NLP 内容
的先进技术。

我不会详细介绍它
是如何工作的,

但你真正
必须理解并记住的

是,所有这些都
需要大量的数据,即

自然语言数据。

如果你想要一个语音系统
用古吉拉特语与你交谈

,你首先需要的

是大量古吉拉特人

用他们自己的语言互相交谈的数据。

所以 2017 年,微软提出
了一个语音识别系统

,它能够
比人类更好地将语音转录成文本

这个系统接受

了 2 亿个转录单词的训练。

2018 年,英汉
机器翻译

系统能够

像任何人类双语一样将英文翻译成中文。

这是
在 1800 万个双语句子对上训练的。

在自然语言处理

和技术领域,这是一个非常非常激动人心的时刻。

你知道,我们正在看到
我们读过和看过的科幻小说,

在我们自己的眼前实现了。

我们
正在技术进步方面取得巨大飞跃。

但这些巨大的
飞跃仅限于极少数语言。

因此,Monojit Choudhury

,就像我的一个非常好的朋友

和同事一样,

他对此进行了一些详细的研究,并研究

了世界上跨语言的资源分布。

他说这些遵循
所谓的幂律分布,

这基本上
意味着有四种语言,

阿拉伯语,中文,英语和西班牙语,

它们拥有最大
的可用资源。

还有一些语言
也可以从现在可用

的资源和技术
中受益。

但世界上 90%

语言没有

资源或可用资源很少。

我们正在谈论的这场革命

基本上已经绕过
了世界上 5,000 种语言。

现在,这
意味着资源丰富的语言

拥有为它们构建的技术,

因此研究人员和技术人员
被它们所吸引。

他们为他们构建了更多的技术。
他们创造了更多的资源。

所以这就像一个富人越来越
富的循环。

资源匮乏的语言仍然很穷,

没有适合他们的技术,
没有人为他们工作。

这种鸿沟,
语言之间的数字鸿沟

正在不断扩大,

并且

暗示说这些语言的社区之间的鸿沟也在扩大。

因此,在 Microsoft 的 Project Ellora 中,
我们的目标是弥合这一差距。

我们正在尝试了解如何
通过创新方法创建更多数据,


没有大量资源的情况下拥有更多技术来构建技术,

以及哪些应用
程序可以真正使这些社区受益。

所以目前,
这似乎非常理论化,

就像他在说什么,
数据、技术和技术。

所以让我
在这里给你一个非常具体的例子。

我本质上是一个语言学家,我喜欢语言
,这就是我喜欢谈论的。

所以让我告诉你一种你们
中的许多人可能不知道的语言。

贡迪。

Gondi 是中南部
德拉威语。

印度五个州的 300 万人使用这种语言

从某种角度来看,

挪威语有 500 万人使用

,威尔士语则不到 100 万人。

所以贡迪实际上是印度贡德部落的一个相当强大
和相当大的

社区。

但根据联合国教科文组织的《
濒危语言地图集》,

贡迪被指定为弱势群体。

CGNet Swara 是一个非政府组织
,通过手机访问本地故事,为 Gond 社区提供公民新闻门户

Gondi 绝对
没有技术支持。

没有可用于 Gondi 的数据,也
没有可用于 Gondi 的资源。

因此,所有创建、
审核和编辑的内容都是手动完成的。

现在,在 Ellora 项目

下,我们所做的是
将所有利益相关者聚集在一起,

像 CGNet Swara 这样的非政府组织,

像 IIIT Naya Raipur 这样的学术机构,

像 Pratham Books 这样的非营利
儿童图书出版商

,最重要的是 ,
社区的发言人。

贡德部落自己也
参与了这项活动,

并首次
在贡地编辑和翻译了儿童读物。

我们第一次在贡迪推出了 200 本书,

让孩子们能够接触到
用他们自己的语言编写的故事和书籍。

另一个扩展
是 Adivasi Radio,

它就像我们
在微软研究院构建和开发的一个应用程序,

然后
与我们的利益相关者一起推出,

它采用印地语文本到语音系统

并允许它读出新闻
以及由 CGNet Swara

以 Gondi 语言提供的文章。

用户现在可以使用此应用程序以自己的语言通过文本和语音来阅读、

观看新闻和访问任何信息

一个非常有趣的事情是,这个应用
程序现在被用于翻译——

被社区用来将文本
从印地语翻译成贡地语。

现在,这将
产生大量并行数据

,我们称之为并行数据,

这将使我们能够
为 Gondi 构建机器翻译系统,

这将真正
为 Gond 社区打开一扇通往世界的窗口。

更重要的
是现在我们知道如何做到这一点。

我们拥有整个管道
,我们可以将其复制到与 Gond 部落情况相似的任何语言

和任何语言社区

还有教育——是的,你知道,
信息访问——是的,

但是谋生呢?

对? 怎么样 - 我们如何让
这些人

通过这些天我们所有人都认为理所当然的数字工具谋生

MSR 的另一位研究员 Vivek Seshadri

和他的合作者 Manu

Chopra 设计了一个名为 Karya 的平台,

用于为
服务不足的社区提供数字微任务。

他的目标基本上是找到一种方法
,为这个国家

的人口、农村人口

和城市贫困人口提供有尊严的劳动

他们无法
获得所有知识

来使用我们所有人每天都在使用的数字平台


甚至不需要思考,对吧?

但是……

这里有大量

有文化的
人口想要工作,对

,我们如何才能
让他们成为可能?

因此,Karya 是

这种
人群进入数字世界

的一种方式,你知道,

通过这种方式可以找到工作并完成
可以赚钱的任务。

所以我们看到了这个,我们想,
哦,这太棒了。

我们也可以将其
用于数据收集。

于是我们去了马哈拉施特拉邦 Wada 区

一个 200 人的小村庄 Amale

,决定使用 Karya
来收集马拉地语数据。

现在,我知道你在想什么——

我相信听众中也有很多讲马拉地语的人

——马拉地语不是
一种资源匮乏的语言。

马拉地语绝对
是该国的主流语言。

但就语言
技术而言,

马拉地语是一种资源匮乏的语言。

所以我们去了这个村庄

,我们进行了一次非常成功的
数据收集之旅。

而且,你知道,
这个村子很偏僻。

他们没有电视,没有电,

没有移动信号。

如果你想
用手机打电话给任何人,你必须爬山并挥动手机。

所以他们给了我们所有这些数据。

但更重要的是,他们给了我们
非常宝贵的人生教训。

一是对自己语言的自豪感。

Amale 的
人们很高兴能这样做,

因为他们这样做是在推进
自己的语言。

第二是社区的价值。

很快,这成为
了乡村社区的努力。

人们会聚集在一起完成任务
,并作为一个团队一起完成。

第三
是讲故事的重要性。

Amale 的人们是如此的缺乏满足感,
以至于早上,白天,

他们会
在 Karya 录制故事

,然后晚上
他们会召集整个村庄,向村庄

复述和讲述
这些故事。

因此,作为科学家,我们非常专注

于我们正在做的科学和技术
部分,你知道 -

这是下一个最好的模型,

我们如何提高
我的系统的准确性,我

如何
构建下一个 最好的系统

——我们忘记了
我们这样做的原因:人。

任何成功的技术都是
让人们和用户

保持领先和中心的技术。

当他们开始这样做时,

我们也意识到
技术可能只是其中很小的一部分,

而且故事中还有其他内容。

也许需要社会、文化
和政策干预

,就像技术一样。

所以前段时间,
我参与了一个名为 VideoKheti 的项目,该项目

允许印度中部讲印地语的
农民

通过对基于电话的应用程序说话来搜索农业视频。

所以我们去中央邦
为此收集数据,

然后我们回来了
,我们正在训练我们的模型

,我们发现
我们得到了非常糟糕的结果。

这是行不通的。

所以我们很困惑。
为什么会这样?

所以我们
越来越深入地研究数据

,发现,是的,
我们

从一个我们认为晚上非常安静、
安静的村庄收集了数据。

但是我们在做这件事的时候没有听到的

是,
夜里的昆虫不断地嗡嗡作响,你知道吗?

所以在整个录音过程中,
我们都有这种昆虫的“嗡嗡声”,

这实际上扭曲了我们的讲话。

第二件事是
,当我们去

村里测试我们的应用程序时,

我和我的同事 Indrani Medhi,

他是一位非常受人尊敬的
设计研究员,

我们发现女性
无法发音我们所说的梵文

词。 有一些搜索词。

所以,就像……

(说印地语)

这就像
化学杀虫剂的术语,对吧?

因为我们
从农业推广中心获得了这些条款,

而这些妇女,
即使她们在务农,

也根本不与该中心互动。

男人会,女人可能会
使用更简单的东西,比如……

(用印地语说)

这基本上意味着
用药物杀死害虫。

因此,我在旅途中学到的

东西以及我想
传达给你的东西

  • 到目前为止,我希望你已经理解我的意思

是,
世界上大多数语言

都需要大量投资
来创造资源,

如果它们 将受益
于语言技术。

这不太可能
以非常快速和有效的方式发生。

因此

确保社区

从我们
在语言技术领域所做的任何事情中获得最大利益对我们来说非常重要。

为了做到这一点并对这些社区
产生积极的社会影响

我们遵循我们所谓的修改后的
4-D 设计思维方法。

所以 4-D 意味着:
发现、设计、开发和部署。

因此,
发现语言技术可以

为特定语言社区解决的问题。

这种以观察为主导的方法
可以帮助将资源分配

到最需要的地方,

专为用户及其语言而设计,

了解
语言特性

和世界语言的多样性。

不要以为,
哦,这是为英语制作的。

现在,我们怎样才能让它适应
马拉地语或贡迪语,对吧?

快速开发和频繁部署。

这是一个反复的过程
,可以帮助你快速失败

,早期的失败
最终会导致成功。

重要的是坚持。

不要放弃。

我还记得
这两位澳大利亚原住民女性

Patricia O’Connor 和 Ysola Best 的故事。

90 年代中期,他们
去了昆士兰大学,

想学
自己的语言,叫做 Yugambeh,结果

很直白地告诉他们,
“你的语言已经死了。

它已经死了 30 年了。

你不能在这方面工作。” .
找点别的工作。”

他们没有放弃。

他们走进社区,

挖掘口述记忆、
口述传统、口述文学,

并建立了尤甘贝博物馆

,成为该语言及其社区最重要的文化
和语言中心

他们没有技术。
他们只有意志力。

现在,借助技术的力量,

我们可以确保下一页
是用芬兰的 Salmi、

加拿大的 Lillooet
或印度的 Mundari 写的。

谢谢你。