How bad data keeps us from good AI Mainak Mazumdar

Transcriber: Leslie Gauthier
Reviewer: Joanna Pietrulewicz

AI could add 16 trillion dollars
to the global economy

in the next 10 years.

This economy is not going
to be built by billions of people

or millions of factories,

but by computers and algorithms.

We have already seen
amazing benefits of AI

in simplifying tasks,

bringing efficiencies

and improving our lives.

However, when it comes to fair
and equitable policy decision-making,

AI has not lived up to its promise.

AI is becoming a gatekeeper
to the economy,

deciding who gets a job

and who gets an access to a loan.

AI is only reinforcing
and accelerating our bias

at speed and scale

with societal implications.

So, is AI failing us?

Are we designing these algorithms
to deliver biased and wrong decisions?

As a data scientist, I’m here to tell you,

it’s not the algorithm,

but the biased data

that’s responsible for these decisions.

To make AI possible
for humanity and society,

we need an urgent reset.

Instead of algorithms,

we need to focus on the data.

We’re spending time and money to scale AI

at the expense of designing and collecting
high-quality and contextual data.

We need to stop the data,
or the biased data that we already have,

and focus on three things:

data infrastructure,

data quality

and data literacy.

In June of this year,

we saw embarrassing bias
in the Duke University AI model

called PULSE,

which enhanced a blurry image

into a recognizable
photograph of a person.

This algorithm incorrectly enhanced
a nonwhite image into a Caucasian image.

African-American images
were underrepresented in the training set,

leading to wrong decisions
and predictions.

Probably this is not the first time

you have seen an AI misidentify
a Black person’s image.

Despite an improved AI methodology,

the underrepresentation
of racial and ethnic populations

still left us with biased results.

This research is academic,

however, not all data biases are academic.

Biases have real consequences.

Take the 2020 US Census.

The census is the foundation

for many social
and economic policy decisions,

therefore the census is required
to count 100 percent of the population

in the United States.

However, with the pandemic

and the politics
of the citizenship question,

undercounting of minorities
is a real possibility.

I expect significant undercounting
of minority groups

who are hard to locate, contact, persuade
and interview for the census.

Undercounting will introduce bias

and erode the quality
of our data infrastructure.

Let’s look at undercounts
in the 2010 census.

16 million people were omitted
in the final counts.

This is as large as the total population

of Arizona, Arkansas, Oklahoma
and Iowa put together for that year.

We have also seen about a million kids
under the age of five undercounted

in the 2010 Census.

Now, undercounting of minorities

is common in other national censuses,

as minorities can be harder to reach,

they’re mistrustful towards the government

or they live in an area
under political unrest.

For example,

the Australian Census in 2016

undercounted Aboriginals
and Torres Strait populations

by about 17.5 percent.

We estimate undercounting in 2020

to be much higher than 2010,

and the implications
of this bias can be massive.

Let’s look at the implications
of the census data.

Census is the most trusted, open
and publicly available rich data

on population composition
and characteristics.

While businesses
have proprietary information

on consumers,

the Census Bureau
reports definitive, public counts

on age, gender, ethnicity,

race, employment, family status,

as well as geographic distribution,

which are the foundation
of the population data infrastructure.

When minorities are undercounted,

AI models supporting
public transportation,

housing, health care,

insurance

are likely to overlook the communities
that require these services the most.

First step to improving results

is to make that database representative

of age, gender, ethnicity and race

per census data.

Since census is so important,

we have to make every effort
to count 100 percent.

Investing in this data
quality and accuracy

is essential to making AI possible,

not for only few and privileged,

but for everyone in the society.

Most AI systems use the data
that’s already available

or collected for some other purposes

because it’s convenient and cheap.

Yet data quality is a discipline
that requires commitment –

real commitment.

This attention to the definition,

data collection
and measurement of the bias,

is not only underappreciated –

in the world of speed,
scale and convenience,

it’s often ignored.

As part of Nielsen data science team,

I went to field visits to collect data,

visiting retail stores
outside Shanghai and Bangalore.

The goal of that visit was to measure
retail sales from those stores.

We drove miles outside the city,

found these small stores –

informal, hard to reach.

And you may be wondering –

why are we interested
in these specific stores?

We could have selected a store in the city

where the electronic data could be
easily integrated into a data pipeline –

cheap, convenient and easy.

Why are we so obsessed with the quality

and accuracy of the data
from these stores?

The answer is simple:

because the data
from these rural stores matter.

According to the International
Labour Organization,

40 percent Chinese

and 65 percent of Indians
live in rural areas.

Imagine the bias in decision

when 65 percent of consumption
in India is excluded in models,

meaning the decision will favor
the urban over the rural.

Without this rural-urban context

and signals on livelihood,
lifestyle, economy and values,

retail brands will make wrong investments
on pricing, advertising and marketing.

Or the urban bias will lead
to wrong rural policy decisions

with regards to health
and other investments.

Wrong decisions are not the problem
with the AI algorithm.

It’s a problem of the data

that excludes areas intended
to be measured in the first place.

The data in the context is a priority,

not the algorithms.

Let’s look at another example.

I visited these remote,
trailer park homes in Oregon state

and New York City apartments

to invite these homes
to participate in Nielsen panels.

Panels are statistically
representative samples of homes

that we invite to participate
in the measurement

over a period of time.

Our mission to include everybody
in the measurement

led us to collect data
from these Hispanic and African homes

who use over-the-air
TV reception to an antenna.

Per Nielsen data,

these homes constitute
15 percent of US households,

which is about 45 million people.

Commitment and focus on quality
means we made every effort

to collect information

from these 15 percent,
hard-to-reach groups.

Why does it matter?

This is a sizeable group

that’s very, very important
to the marketers, brands,

as well as the media companies.

Without the data,

the marketers and brands and their models

would not be able to reach these folks,

as well as show ads to these
very, very important minority populations.

And without the ad revenue,

the broadcasters such as
Telemundo or Univision,

would not be able to deliver free content,

including news media,

which is so foundational to our democracy.

This data is essential
for businesses and society.

Our once-in-a-lifetime opportunity
to reduce human bias in AI

starts with the data.

Instead of racing to build new algorithms,

my mission is to build
a better data infrastructure

that makes ethical AI possible.

I hope you will join me
in my mission as well.

Thank you.

抄写员：Leslie Gauthier
审稿人：Joanna Pietrulewicz

AI 可以

在未来 10 年为全球经济增加 16 万亿美元。

这种经济
不会由数十亿人

或数百万工厂建立，

而是由计算机和算法建立。

我们已经看到
人工智能

在简化任务、

提高效率

和改善生活方面的惊人优势。

然而，在
公平公正的政策决策方面，

人工智能并没有兑现其承诺。

人工智能正在成为
经济的守门人，

决定谁获得工作

以及谁获得贷款。

人工智能只会

在

具有社会影响的速度和规模上加强和加速我们的偏见。

那么，人工智能让我们失望了吗？

我们是否在设计这些算法
来提供有偏见和错误的决策？

作为一名数据科学家，我在这里告诉你

，造成这些决策的不是算法，

而是有偏见的数据

。

为了让人工智能
为人类和社会服务成为可能，

我们需要紧急重置。

我们需要专注于数据，而不是算法。

我们花费时间和金钱来扩展人工智能

，但代价是设计和收集
高质量的上下文数据。

我们需要停止
数据，或者我们已经拥有的有偏见的数据，

并专注于三件事：

数据基础设施、

数据质量

和数据素养。

今年 6 月，

我们
在杜克大学 AI 模型

PULSE 中看到了令人尴尬的偏见，

该模型将模糊的图像增强

为可识别
的人物照片。

该算法错误地
将非白色图像增强为高加索图像。

非裔美国人的图像
在训练集中的代表性不足，

导致错误的决策
和预测。

可能这不是您第一次

看到 AI 错误识别
黑人的图像。

尽管 AI 方法有所改进，

但
种族和族裔人口的代表性不足

仍然给我们留下了有偏见的结果。

这项研究是学术性的，

但是，并非所有数据偏差都是学术性的。

偏见会产生真正的后果。

以 2020 年美国人口普查为例。

人口普查

是许多社会
和经济政策决策的基础，

因此人口普查
需要计算美国 100% 的

人口。

然而，随着大流行

和
公民身份问题的政治化

，少数族裔
被低估是一种现实的可能性。

我预计在人口普查

中难以找到、联系、说服
和采访的少数群体会被严重低估。

低估将引入偏见

并削弱
我们数据基础设施的质量。

让我们看一下
2010 年人口普查中的低估情况。最终统计

中遗漏了 1600 万人
。

这与当年

亚利桑那州、阿肯色州、俄克拉荷马州
和爱荷华州的总人口一样多。

我们还看到，在 2010 年人口普查中，约有 100 万
五岁以下的儿童被低估

了。

现在，少数族裔

在其他国家人口普查中很常见，

因为少数族裔很难接触到，

他们对政府不信任，

或者他们生活
在政治动荡的地区。

例如，

2016 年的澳大利亚人口普查将

原住民
和托雷斯海峡人口低估

了约 17.5%。

我们估计 2020 年的低估

将远高于 2010 年，

这种偏差的影响可能是巨大的。

让我们看看
人口普查数据的含义。

人口普查是关于人口组成和特征的最值得信赖、开放
和公开的丰富数据

。

虽然企业
拥有

关于消费者的专有信息

，但人口普查局
报告了

关于年龄、性别、种族、

种族、就业、家庭状况

以及地理分布的明确的公开统计数据，

这些
是人口数据基础设施的基础。

当少数族裔被低估时，

支持
公共交通、

住房、医疗保健、

保险

的人工智能模型可能会忽视
最需要这些服务的社区。

改善结果的第一步

是使该数据库能够

代表年龄、性别、种族和种族的

人口普查数据。

由于人口普查如此重要，

我们必须尽一切
努力计算 100%。

投资于这种数据
质量和

准确性对于使 AI 成为可能至关重要，

不仅对少数人和特权

阶层，而且对社会中的每个人都是如此。

大多数人工智能系统使用
已经可用

或为其他目的收集的数据，

因为它既方便又便宜。

然而，数据质量是一门
需要承诺的学科——

真正的承诺。

这种对偏差的定义、

数据收集
和测量的关注

不仅被低估了——

在速度、
规模和便利性的世界中，

它经常被忽视。

作为尼尔森数据科学团队的一员，

我去实地考察收集数据，

参观了
上海和班加罗尔以外的零售店。

那次访问的目的是衡量
这些商店的零售额。

我们驱车数英里外的城市，

发现这些小商店 -

非正式的，难以到达。

您可能想知道——

我们为什么
对这些特定商店感兴趣？

我们本可以在城市中选择一家商店，

那里的电子数据可以很
容易地集成到数据管道中——

便宜、方便、容易。

为什么我们如此痴迷于这些商店数据的质量

和准确性
？

答案很简单：

因为
来自这些农村商店的数据很重要。

根据
国际劳工组织的数据，

40% 的中国

人和 65% 的印度人
生活在农村地区。

想象一下，

当
模型中排除了印度 65% 的消费时，决策的偏差，

这意味着该决策将有
利于城市而不是农村。

如果没有这种城乡背景

以及有关生计、
生活方式、经济和价值观的信号，

零售品牌将
在定价、广告和营销方面做出错误的投资。

否则，城市偏见

将导致有关健康
和其他投资的错误农村政策决策。

错误的决定不是
人工智能算法的问题。

这是一个数据问题

，它首先排除
了要测量的区域。

上下文中的数据是优先级，

而不是算法。

让我们看另一个例子。

我参观
了俄勒冈州

和纽约市公寓

的这些偏远的拖车公园房屋，邀请这些
房屋参加尼尔森小组讨论。

面板

是我们邀请

在一段时间内参与测量的具有统计代表性的房屋样本。

我们将每个人都
纳入测量的使命

使我们
从这些

使用
无线电视接收天线的西班牙裔和非洲家庭收集数据。

根据尼尔森的数据，

这些家庭
占美国家庭的 15%，

即约 4500 万人。

对质量的承诺和关注
意味着我们尽一切努力

从这 15% 的
难以接触到的群体中收集信息。

为什么这有关系？

这是一个庞大的群体

，
对营销人员、品牌

以及媒体公司都非常非常重要。

如果没有这些数据

，营销人员和品牌及其模型

将无法接触到这些人，也无法

向这些
非常非常重要的少数群体展示广告。

如果没有广告收入，

Telemundo 或 Univision 等广播公司

将无法提供免费内容，

包括

对我们的民主至关重要的新闻媒体。

这些数据
对企业和社会至关重要。

我们
减少人工智能中人类偏见的千载难逢的机会

始于数据。我的任务

不是竞相构建新算法，而是

构建
一个更好的数据基础设施

，使道德人工智能成为可能。

我希望你也能加入
我的使命。

谢谢你。