How bad data keeps us from good AI Mainak Mazumdar

Transcriber: Leslie Gauthier
Reviewer: Joanna Pietrulewicz

AI could add 16 trillion dollars
to the global economy

in the next 10 years.

This economy is not going
to be built by billions of people

or millions of factories,

but by computers and algorithms.

We have already seen
amazing benefits of AI

in simplifying tasks,

bringing efficiencies

and improving our lives.

However, when it comes to fair
and equitable policy decision-making,

AI has not lived up to its promise.

AI is becoming a gatekeeper
to the economy,

deciding who gets a job

and who gets an access to a loan.

AI is only reinforcing
and accelerating our bias

at speed and scale

with societal implications.

So, is AI failing us?

Are we designing these algorithms
to deliver biased and wrong decisions?

As a data scientist, I’m here to tell you,

it’s not the algorithm,

but the biased data

that’s responsible for these decisions.

To make AI possible
for humanity and society,

we need an urgent reset.

Instead of algorithms,

we need to focus on the data.

We’re spending time and money to scale AI

at the expense of designing and collecting
high-quality and contextual data.

We need to stop the data,
or the biased data that we already have,

and focus on three things:

data infrastructure,

data quality

and data literacy.

In June of this year,

we saw embarrassing bias
in the Duke University AI model

called PULSE,

which enhanced a blurry image

into a recognizable
photograph of a person.

This algorithm incorrectly enhanced
a nonwhite image into a Caucasian image.

African-American images
were underrepresented in the training set,

leading to wrong decisions
and predictions.

Probably this is not the first time

you have seen an AI misidentify
a Black person’s image.

Despite an improved AI methodology,

the underrepresentation
of racial and ethnic populations

still left us with biased results.

This research is academic,

however, not all data biases are academic.

Biases have real consequences.

Take the 2020 US Census.

The census is the foundation

for many social
and economic policy decisions,

therefore the census is required
to count 100 percent of the population

in the United States.

However, with the pandemic

and the politics
of the citizenship question,

undercounting of minorities
is a real possibility.

I expect significant undercounting
of minority groups

who are hard to locate, contact, persuade
and interview for the census.

Undercounting will introduce bias

and erode the quality
of our data infrastructure.

Let’s look at undercounts
in the 2010 census.

16 million people were omitted
in the final counts.

This is as large as the total population

of Arizona, Arkansas, Oklahoma
and Iowa put together for that year.

We have also seen about a million kids
under the age of five undercounted

in the 2010 Census.

Now, undercounting of minorities

is common in other national censuses,

as minorities can be harder to reach,

they’re mistrustful towards the government

or they live in an area
under political unrest.

For example,

the Australian Census in 2016

undercounted Aboriginals
and Torres Strait populations

by about 17.5 percent.

We estimate undercounting in 2020

to be much higher than 2010,

and the implications
of this bias can be massive.

Let’s look at the implications
of the census data.

Census is the most trusted, open
and publicly available rich data

on population composition
and characteristics.

While businesses
have proprietary information

on consumers,

the Census Bureau
reports definitive, public counts

on age, gender, ethnicity,

race, employment, family status,

as well as geographic distribution,

which are the foundation
of the population data infrastructure.

When minorities are undercounted,

AI models supporting
public transportation,

housing, health care,

insurance

are likely to overlook the communities
that require these services the most.

First step to improving results

is to make that database representative

of age, gender, ethnicity and race

per census data.

Since census is so important,

we have to make every effort
to count 100 percent.

Investing in this data
quality and accuracy

is essential to making AI possible,

not for only few and privileged,

but for everyone in the society.

Most AI systems use the data
that’s already available

or collected for some other purposes

because it’s convenient and cheap.

Yet data quality is a discipline
that requires commitment –

real commitment.

This attention to the definition,

data collection
and measurement of the bias,

is not only underappreciated –

in the world of speed,
scale and convenience,

it’s often ignored.

As part of Nielsen data science team,

I went to field visits to collect data,

visiting retail stores
outside Shanghai and Bangalore.

The goal of that visit was to measure
retail sales from those stores.

We drove miles outside the city,

found these small stores –

informal, hard to reach.

And you may be wondering –

why are we interested
in these specific stores?

We could have selected a store in the city

where the electronic data could be
easily integrated into a data pipeline –

cheap, convenient and easy.

Why are we so obsessed with the quality

and accuracy of the data
from these stores?

The answer is simple:

because the data
from these rural stores matter.

According to the International
Labour Organization,

40 percent Chinese

and 65 percent of Indians
live in rural areas.

Imagine the bias in decision

when 65 percent of consumption
in India is excluded in models,

meaning the decision will favor
the urban over the rural.

Without this rural-urban context

and signals on livelihood,
lifestyle, economy and values,

retail brands will make wrong investments
on pricing, advertising and marketing.

Or the urban bias will lead
to wrong rural policy decisions

with regards to health
and other investments.

Wrong decisions are not the problem
with the AI algorithm.

It’s a problem of the data

that excludes areas intended
to be measured in the first place.

The data in the context is a priority,

not the algorithms.

Let’s look at another example.

I visited these remote,
trailer park homes in Oregon state

and New York City apartments

to invite these homes
to participate in Nielsen panels.

Panels are statistically
representative samples of homes

that we invite to participate
in the measurement

over a period of time.

Our mission to include everybody
in the measurement

led us to collect data
from these Hispanic and African homes

who use over-the-air
TV reception to an antenna.

Per Nielsen data,

these homes constitute
15 percent of US households,

which is about 45 million people.

Commitment and focus on quality
means we made every effort

to collect information

from these 15 percent,
hard-to-reach groups.

Why does it matter?

This is a sizeable group

that’s very, very important
to the marketers, brands,

as well as the media companies.

Without the data,

the marketers and brands and their models

would not be able to reach these folks,

as well as show ads to these
very, very important minority populations.

And without the ad revenue,

the broadcasters such as
Telemundo or Univision,

would not be able to deliver free content,

including news media,

which is so foundational to our democracy.

This data is essential
for businesses and society.

Our once-in-a-lifetime opportunity
to reduce human bias in AI

starts with the data.

Instead of racing to build new algorithms,

my mission is to build
a better data infrastructure

that makes ethical AI possible.

I hope you will join me
in my mission as well.

Thank you.