Mining the Datasphere for Social Good
[Music]
good morning tedx tulane
i’m thrilled to be part of this exciting
exchange of ideas and i am especially
grateful that you want to learn more
about what i believe is one of the keys
to progress as we look to the future
and that is computer code applying
computer programming language to data
can spark important conversations around
some of our biggest problems
and machine learning can help us harness
big data to learn more about our society
without a doubt we live in a world that
is increasingly digitized and globalized
and the amount of raw data that is
accessible to the general public
is rapidly growing in fact the quantity
is quite staggering
to understand it let’s use this analysis
consider the numbers 1 million
1 billion and 1 trillion and their
relation to time
we know that 1 million seconds is equal
to about 12 days
but how long do you think one billion
seconds is
the answer is about 31.7 years
if we continue this example we can also
ask ourselves
how long is one trillion seconds one
trillion seconds
equates to about 31 688 years
but we hear these numbers all the time
when we talk about economics
world population space it is only when
we think about them closely and in
comparison to each other
that their meaning and magnitude starts
to come into focus
the same can be said for understanding
how much raw data is available
the quantity is so immense that it can
be hard to conceptualize
we can use our understanding of one
million one billion and one trillion
from our recent example
to think about how much data exists the
average photo you take with your
smartphone
is anywhere from one to five megabytes
and the thousand megabytes is one
gigabyte
we’ve all become accustomed to gigabytes
before mobile plans became unlimited
you could get a plan of 10 gigabytes or
perhaps a family plan of 20 gigabytes
before being charged for additional data
our phones usually have storage capacity
of 32
64 128 gigabytes and so on
recall 1 trillion seconds equals 32
years
well 1 trillion gigabytes is a zeta byte
and there are nearly 60 zetabytes of
data that exists throughout our world
today
the market research company
international data corporation recently
published that 59 zeta bikes exists
exist in the global data sphere the
global data sphere is defined as the
amount of data created
captured and replicated in any given
year across the world
that is an incredible amount of
information that can be mined and
analyzed for important connections and
patterns
but in the face of that sheer volume of
information how can we dig through the
seemingly unassociated variables to find
the connections and patterns
that could be the foundations for
positive change
data mining tools applied through code
are an effective way to analyze this
data
today we will consider how we can
capitalize on the information in the
global data sphere
for social good my personal experience
with data mining began when i had the
opportunity to assist dr cecilia alcala
in her research her work analyzes
information gathered by the caribbean
consortium for environmental
and occupational health for possible
correlation between environmental
factors and human health
her sample population was from the
republic of surinam a country on the
northeastern atlantic coast of south
america
the research included analyzing hundreds
of urine samples of pregnant ceremonies
women for concentrations of pesticide
metabolites
metabolites are the trace compounds left
behind after our bodies process
substances
when exposed to pesticides the compounds
produced contain chemicals we can detect
and identify
these concentrations were coated graphed
and visualized
and they gave us the possible insight as
to whether or not pesticides were being
used for agricultural purposes
or residential use they also gave us the
intel
as to the level of pesticides being used
in general and gave way to the primary
reasons tsuranami and women
were being exposed to these harmful
chemicals during their pregnancies
these sorts of facts can be foundations
for change when data can be cleaned
mined and visualized it can catalyze
change in public policy
environmental regulations and workers
rights applying code to raw data gives
us the ability to gauge the
effectiveness of public policy
there are a multitude of coding programs
that can be used for such work
my personal experience is primarily an r
but languages such as python
java c plus plus also have high utility
in the field of data mining
in addition to revealing hidden
connections between variables
two of the biggest benefits of applying
code to raw data
are visualization and access in the area
of visualization
code can help turn volumes of raw data
into graphs and diagrams that make the
information
much easier to interpret for instance r
is a variety of packages that can
accomplish this
packages are collections of functions
and data sets that can be imported into
the coding system
here are a few ways in which r has been
used to create dynamic visuals that make
data easier to understand
all three of these graphs were created
with the package ggplot2
it only took a few lines of code to
generate graph1 was generated with just
three lines of code since r is equipped
with open source packages
that allow us to quickly and efficiently
visualize data sets
not only can packages be downloaded to
create visuals they can also be used to
access information
and easily import large data sets for
example the package
rnhanes allows the importation of all
the national health and nutrition
examination survey data
conducted by the centers for disease
control with just a single line of code
that reads install.packages are enhanced
take a moment to contemplate that with
just a few lines of code a programmer
can access studies that assess the
health
and nutritional status of thousands of
adults and children in the united states
it includes demographic socioeconomic
dietary and health-related information
but the potential for analysis is
staggering now
let’s circle back to the analysis of the
pregnant woman in suriname
it’s important to remember that those
connections discovered in suriname
represent real people and real health
impacts but we don’t need to look so far
away for an example of how data can be
used as a tool for change
it’s also been put to use in our own
backyard so to speak this time to impact
public safety
certain populations are more susceptible
to injury and death when there is a fire
children the elderly and those with
limited mobility are the most at risk
during a fire in the home
right here in new orleans data has been
examined and visualized to help mitigate
home fire casualties
the city of new orleans wondered how
data could be mined to assist
the most vulnerable citizens in 2016
the new orleans office of performance
and accountability developed a model
using data from the u.s census bureau’s
american housing survey
and an american community survey to do
just that
the city identified variables that might
correlate to whether a resident was
likely to have a smoke alarm
these included the age of the structure
the length of tenancy and whether there
were additions to the property
these variables were then applied to
census data and other historical records
to identify with high risk factors and
the likelihood of no smoke detector
the analysis of these variables resulted
in a map of the city that highlighted
where residents were at greater risk for
home fire death
using that data the city went
door-to-door with free smoke alarms
that free service was always available
but it was up to the resident to seek it
out
the combination of data code and civic
duty allowed officials to bring
potential life-saving devices
to those who may have needed it most
this geospatial model was so successful
it was adapted at the national level and
replicated in cities such as new york
and chicago
so far we’ve talked about how code and
data can be combined to discover
connections
visualize findings and integrate data it
can also be used for machine learning
machine learning is a branch of
artificial intelligence that uses
algorithms and statistical models to
analyze and draw inferences between
patterns and data
these systems improve their experience
essentially learning while doing
there are many analogies that help
clarify what machine learning
accomplishes
and how it works the one that makes the
most sense to me
is to think of machine learning
algorithms as math students
imagine that they’ve been given hundreds
of homework problems but instead
of just being asked to solve for the
answer to each and move on
they are also asked to find the patterns
in the solutions
that they can do so that they can do the
work more efficiently
the intersection of big data code and
machine learning
has proved to be useful in the hospital
setting in hospital cardiac arrests
typically trigger what is called a code
blue
an urgent medical emergency usually
involving respiratory distress
currently code blues have a 20 survival
rate in attempts to increase the
positive outcomes
researchers are able to write a code for
a machine learning system to predict a
code blue
hours before it occurs the model was
trained with data from electronic
medical records
and vital signs during the course of a
hospital stay
the data that was used to train the
algorithm included 29 different vitals
and lab results some examples of vital
signs used were respiratory rate
systolic and diastolic blood pressures
pulse oximetry and temperature
and the lab tests included hemoglobin
platelet clout
hematocrit creatinine and sodium levels
the machine learning algorithm was able
to predict a potential code blue as many
as four
hours before a potential occurrence the
essence of this model
is that we can use our data on vital
signs in combination with machine
learning to improve code blue outcomes
and ultimately improve patient outcomes
machine learning has also been used to
expose potential corruption and inequity
consider the recent study using machine
learning to help vulnerable tenants in
new york city
to help keep housing affordable the city
of new york uses rent stabilization
policies to restrict the rate at which
the rent of units may be increased
annually
there are cases in which landlords will
attempt to evade these laws
a machine learning model was implemented
to identify units that landlords might
target for harassment
the study framed the tenant harassment
risk prediction as a binary
classification problem by trying to
predict if there will be any
cases of harassment in a given time
frame of one month
data was combined from different sources
and efforts to identify variables that
signal possible landlord harassment
some of the variables chosen included
whether or not a knock was answered at
the door
income of the tenants age distributions
of the tenants
renovation history of the building and
code violations of the building
using this model the city’s tenant
support unit was able to identify
at-risk units with 59
more accuracy like our subjects in
suriname making these connections
results in real impacts on the human
experience
those occupants are potentially more
housing secure because of computer code
and the public servants who employed it
part of the reason why such an enormous
amount of data exists is because of
something we call
data exhaust data exhaust can be thought
of as the trail of data we leave behind
as we use the internet
every time you like an instagram picture
download a pdf
or reply to a tweet you’re creating a
piece of data
this data exhaust can be harnessed for
social good in the fight against cyber
bullying and cyber aggression
in 2018 a machine learning system was
trained with millions of tweets
with the goal of distinguishing users
that are bullies and aggressors
tweets were analyzed on a variety of
factors such as crowdsourced
identification of abusive keywords
the use of curse words and user and
account based attributes
such as number of tweets followers and
account age
the use of hate-related words seems like
easy evidence of cyber bully behavior
however the machine learning algorithm
also gave insight to the semantic
patterns of cyberbullies
it was discovered that users classified
as cyberbullies use a lower number of
adjectives and adverbs compared to
normal users
and that aggressive users post tweets
with higher number of words per tweet
compared to normal users
the algorithm was ultimately able to
identify cyberbullies and cyber
aggressors with over 90
accuracy machine learning may prove to
be a critical component
in the efforts to make the cybersphere a
more respectful and safer space
while protecting free speech should
always be a core value in our democratic
society
machine learning may help create a
system that can help to identify users
who violate important terms of use and
potentially suspend accounts of those
who are frequent offenders
often when we hear references to data it
is in the context of concern
our rights to privacy being overlooked
or worse purposefully disregarded
are companies using available data to
manipulate consumers
or twist systems in their favor these
are certainly valid concerns
but we need to make sure that we also
recognize that the data sphere can be
mined for good
computer code and machine learning can
be used to find connections
to help us identify problems and
critically think about solutions
public safety health care the
environment
internet safety these and so many more
areas
are all on the table and ready for
analysis so the key is this
as the global data sphere continues to
grow exponentially
let’s think about ways we can harness
this data to learn more about our
society
and ultimately code solutions to some of
humanity’s greatest challenges
the connections are there waiting to be
discovered and through code
we have the power to find them thank you