Mining the Datasphere for Social Good

[Music]

good morning tedx tulane

i’m thrilled to be part of this exciting

exchange of ideas and i am especially

grateful that you want to learn more

about what i believe is one of the keys

to progress as we look to the future

and that is computer code applying

computer programming language to data

can spark important conversations around

some of our biggest problems

and machine learning can help us harness

big data to learn more about our society

without a doubt we live in a world that

is increasingly digitized and globalized

and the amount of raw data that is

accessible to the general public

is rapidly growing in fact the quantity

is quite staggering

to understand it let’s use this analysis

consider the numbers 1 million

1 billion and 1 trillion and their

relation to time

we know that 1 million seconds is equal

to about 12 days

but how long do you think one billion

seconds is

the answer is about 31.7 years

if we continue this example we can also

ask ourselves

how long is one trillion seconds one

trillion seconds

equates to about 31 688 years

but we hear these numbers all the time

when we talk about economics

world population space it is only when

we think about them closely and in

comparison to each other

that their meaning and magnitude starts

to come into focus

the same can be said for understanding

how much raw data is available

the quantity is so immense that it can

be hard to conceptualize

we can use our understanding of one

million one billion and one trillion

from our recent example

to think about how much data exists the

average photo you take with your

smartphone

is anywhere from one to five megabytes

and the thousand megabytes is one

gigabyte

we’ve all become accustomed to gigabytes

before mobile plans became unlimited

you could get a plan of 10 gigabytes or

perhaps a family plan of 20 gigabytes

before being charged for additional data

our phones usually have storage capacity

of 32

64 128 gigabytes and so on

recall 1 trillion seconds equals 32

years

well 1 trillion gigabytes is a zeta byte

and there are nearly 60 zetabytes of

data that exists throughout our world

today

the market research company

international data corporation recently

published that 59 zeta bikes exists

exist in the global data sphere the

global data sphere is defined as the

amount of data created

captured and replicated in any given

year across the world

that is an incredible amount of

information that can be mined and

analyzed for important connections and

patterns

but in the face of that sheer volume of

information how can we dig through the

seemingly unassociated variables to find

the connections and patterns

that could be the foundations for

positive change

data mining tools applied through code

are an effective way to analyze this

data

today we will consider how we can

capitalize on the information in the

global data sphere

for social good my personal experience

with data mining began when i had the

opportunity to assist dr cecilia alcala

in her research her work analyzes

information gathered by the caribbean

consortium for environmental

and occupational health for possible

correlation between environmental

factors and human health

her sample population was from the

republic of surinam a country on the

northeastern atlantic coast of south

america

the research included analyzing hundreds

of urine samples of pregnant ceremonies

women for concentrations of pesticide

metabolites

metabolites are the trace compounds left

behind after our bodies process

substances

when exposed to pesticides the compounds

produced contain chemicals we can detect

and identify

these concentrations were coated graphed

and visualized

and they gave us the possible insight as

to whether or not pesticides were being

used for agricultural purposes

or residential use they also gave us the

intel

as to the level of pesticides being used

in general and gave way to the primary

reasons tsuranami and women

were being exposed to these harmful

chemicals during their pregnancies

these sorts of facts can be foundations

for change when data can be cleaned

mined and visualized it can catalyze

change in public policy

environmental regulations and workers

rights applying code to raw data gives

us the ability to gauge the

effectiveness of public policy

there are a multitude of coding programs

that can be used for such work

my personal experience is primarily an r

but languages such as python

java c plus plus also have high utility

in the field of data mining

in addition to revealing hidden

connections between variables

two of the biggest benefits of applying

code to raw data

are visualization and access in the area

of visualization

code can help turn volumes of raw data

into graphs and diagrams that make the

information

much easier to interpret for instance r

is a variety of packages that can

accomplish this

packages are collections of functions

and data sets that can be imported into

the coding system

here are a few ways in which r has been

used to create dynamic visuals that make

data easier to understand

all three of these graphs were created

with the package ggplot2

it only took a few lines of code to

generate graph1 was generated with just

three lines of code since r is equipped

with open source packages

that allow us to quickly and efficiently

visualize data sets

not only can packages be downloaded to

create visuals they can also be used to

access information

and easily import large data sets for

example the package

rnhanes allows the importation of all

the national health and nutrition

examination survey data

conducted by the centers for disease

control with just a single line of code

that reads install.packages are enhanced

take a moment to contemplate that with

just a few lines of code a programmer

can access studies that assess the

health

and nutritional status of thousands of

adults and children in the united states

it includes demographic socioeconomic

dietary and health-related information

but the potential for analysis is

staggering now

let’s circle back to the analysis of the

pregnant woman in suriname

it’s important to remember that those

connections discovered in suriname

represent real people and real health

impacts but we don’t need to look so far

away for an example of how data can be

used as a tool for change

it’s also been put to use in our own

backyard so to speak this time to impact

public safety

certain populations are more susceptible

to injury and death when there is a fire

children the elderly and those with

limited mobility are the most at risk

during a fire in the home

right here in new orleans data has been

examined and visualized to help mitigate

home fire casualties

the city of new orleans wondered how

data could be mined to assist

the most vulnerable citizens in 2016

the new orleans office of performance

and accountability developed a model

using data from the u.s census bureau’s

american housing survey

and an american community survey to do

just that

the city identified variables that might

correlate to whether a resident was

likely to have a smoke alarm

these included the age of the structure

the length of tenancy and whether there

were additions to the property

these variables were then applied to

census data and other historical records

to identify with high risk factors and

the likelihood of no smoke detector

the analysis of these variables resulted

in a map of the city that highlighted

where residents were at greater risk for

home fire death

using that data the city went

door-to-door with free smoke alarms

that free service was always available

but it was up to the resident to seek it

out

the combination of data code and civic

duty allowed officials to bring

potential life-saving devices

to those who may have needed it most

this geospatial model was so successful

it was adapted at the national level and

replicated in cities such as new york

and chicago

so far we’ve talked about how code and

data can be combined to discover

connections

visualize findings and integrate data it

can also be used for machine learning

machine learning is a branch of

artificial intelligence that uses

algorithms and statistical models to

analyze and draw inferences between

patterns and data

these systems improve their experience

essentially learning while doing

there are many analogies that help

clarify what machine learning

accomplishes

and how it works the one that makes the

most sense to me

is to think of machine learning

algorithms as math students

imagine that they’ve been given hundreds

of homework problems but instead

of just being asked to solve for the

answer to each and move on

they are also asked to find the patterns

in the solutions

that they can do so that they can do the

work more efficiently

the intersection of big data code and

machine learning

has proved to be useful in the hospital

setting in hospital cardiac arrests

typically trigger what is called a code

blue

an urgent medical emergency usually

involving respiratory distress

currently code blues have a 20 survival

rate in attempts to increase the

positive outcomes

researchers are able to write a code for

a machine learning system to predict a

code blue

hours before it occurs the model was

trained with data from electronic

medical records

and vital signs during the course of a

hospital stay

the data that was used to train the

algorithm included 29 different vitals

and lab results some examples of vital

signs used were respiratory rate

systolic and diastolic blood pressures

pulse oximetry and temperature

and the lab tests included hemoglobin

platelet clout

hematocrit creatinine and sodium levels

the machine learning algorithm was able

to predict a potential code blue as many

as four

hours before a potential occurrence the

essence of this model

is that we can use our data on vital

signs in combination with machine

learning to improve code blue outcomes

and ultimately improve patient outcomes

machine learning has also been used to

expose potential corruption and inequity

consider the recent study using machine

learning to help vulnerable tenants in

new york city

to help keep housing affordable the city

of new york uses rent stabilization

policies to restrict the rate at which

the rent of units may be increased

annually

there are cases in which landlords will

attempt to evade these laws

a machine learning model was implemented

to identify units that landlords might

target for harassment

the study framed the tenant harassment

risk prediction as a binary

classification problem by trying to

predict if there will be any

cases of harassment in a given time

frame of one month

data was combined from different sources

and efforts to identify variables that

signal possible landlord harassment

some of the variables chosen included

whether or not a knock was answered at

the door

income of the tenants age distributions

of the tenants

renovation history of the building and

code violations of the building

using this model the city’s tenant

support unit was able to identify

at-risk units with 59

more accuracy like our subjects in

suriname making these connections

results in real impacts on the human

experience

those occupants are potentially more

housing secure because of computer code

and the public servants who employed it

part of the reason why such an enormous

amount of data exists is because of

something we call

data exhaust data exhaust can be thought

of as the trail of data we leave behind

as we use the internet

every time you like an instagram picture

download a pdf

or reply to a tweet you’re creating a

piece of data

this data exhaust can be harnessed for

social good in the fight against cyber

bullying and cyber aggression

in 2018 a machine learning system was

trained with millions of tweets

with the goal of distinguishing users

that are bullies and aggressors

tweets were analyzed on a variety of

factors such as crowdsourced

identification of abusive keywords

the use of curse words and user and

account based attributes

such as number of tweets followers and

account age

the use of hate-related words seems like

easy evidence of cyber bully behavior

however the machine learning algorithm

also gave insight to the semantic

patterns of cyberbullies

it was discovered that users classified

as cyberbullies use a lower number of

adjectives and adverbs compared to

normal users

and that aggressive users post tweets

with higher number of words per tweet

compared to normal users

the algorithm was ultimately able to

identify cyberbullies and cyber

aggressors with over 90

accuracy machine learning may prove to

be a critical component

in the efforts to make the cybersphere a

more respectful and safer space

while protecting free speech should

always be a core value in our democratic

society

machine learning may help create a

system that can help to identify users

who violate important terms of use and

potentially suspend accounts of those

who are frequent offenders

often when we hear references to data it

is in the context of concern

our rights to privacy being overlooked

or worse purposefully disregarded

are companies using available data to

manipulate consumers

or twist systems in their favor these

are certainly valid concerns

but we need to make sure that we also

recognize that the data sphere can be

mined for good

computer code and machine learning can

be used to find connections

to help us identify problems and

critically think about solutions

public safety health care the

environment

internet safety these and so many more

areas

are all on the table and ready for

analysis so the key is this

as the global data sphere continues to

grow exponentially

let’s think about ways we can harness

this data to learn more about our

society

and ultimately code solutions to some of

humanity’s greatest challenges

the connections are there waiting to be

discovered and through code

we have the power to find them thank you