Mining the Datasphere for Social Good

[Music]

good morning tedx tulane

i’m thrilled to be part of this exciting

exchange of ideas and i am especially

grateful that you want to learn more

about what i believe is one of the keys

to progress as we look to the future

and that is computer code applying

computer programming language to data

can spark important conversations around

some of our biggest problems

and machine learning can help us harness

big data to learn more about our society

without a doubt we live in a world that

is increasingly digitized and globalized

and the amount of raw data that is

accessible to the general public

is rapidly growing in fact the quantity

is quite staggering

to understand it let’s use this analysis

consider the numbers 1 million

1 billion and 1 trillion and their

relation to time

we know that 1 million seconds is equal

to about 12 days

but how long do you think one billion

seconds is

the answer is about 31.7 years

if we continue this example we can also

ask ourselves

how long is one trillion seconds one

trillion seconds

equates to about 31 688 years

but we hear these numbers all the time

when we talk about economics

world population space it is only when

we think about them closely and in

comparison to each other

that their meaning and magnitude starts

to come into focus

the same can be said for understanding

how much raw data is available

the quantity is so immense that it can

be hard to conceptualize

we can use our understanding of one

million one billion and one trillion

from our recent example

to think about how much data exists the

average photo you take with your

smartphone

is anywhere from one to five megabytes

and the thousand megabytes is one

gigabyte

we’ve all become accustomed to gigabytes

before mobile plans became unlimited

you could get a plan of 10 gigabytes or

perhaps a family plan of 20 gigabytes

before being charged for additional data

our phones usually have storage capacity

of 32

64 128 gigabytes and so on

recall 1 trillion seconds equals 32

years

well 1 trillion gigabytes is a zeta byte

and there are nearly 60 zetabytes of

data that exists throughout our world

today

the market research company

international data corporation recently

published that 59 zeta bikes exists

exist in the global data sphere the

global data sphere is defined as the

amount of data created

captured and replicated in any given

year across the world

that is an incredible amount of

information that can be mined and

analyzed for important connections and

patterns

but in the face of that sheer volume of

information how can we dig through the

seemingly unassociated variables to find

the connections and patterns

that could be the foundations for

positive change

data mining tools applied through code

are an effective way to analyze this

data

today we will consider how we can

capitalize on the information in the

global data sphere

for social good my personal experience

with data mining began when i had the

opportunity to assist dr cecilia alcala

in her research her work analyzes

information gathered by the caribbean

consortium for environmental

and occupational health for possible

correlation between environmental

factors and human health

her sample population was from the

republic of surinam a country on the

northeastern atlantic coast of south

america

the research included analyzing hundreds

of urine samples of pregnant ceremonies

women for concentrations of pesticide

metabolites

metabolites are the trace compounds left

behind after our bodies process

substances

when exposed to pesticides the compounds

produced contain chemicals we can detect

and identify

these concentrations were coated graphed

and visualized

and they gave us the possible insight as

to whether or not pesticides were being

used for agricultural purposes

or residential use they also gave us the

intel

as to the level of pesticides being used

in general and gave way to the primary

reasons tsuranami and women

were being exposed to these harmful

chemicals during their pregnancies

these sorts of facts can be foundations

for change when data can be cleaned

mined and visualized it can catalyze

change in public policy

environmental regulations and workers

rights applying code to raw data gives

us the ability to gauge the

effectiveness of public policy

there are a multitude of coding programs

that can be used for such work

my personal experience is primarily an r

but languages such as python

java c plus plus also have high utility

in the field of data mining

in addition to revealing hidden

connections between variables

two of the biggest benefits of applying

code to raw data

are visualization and access in the area

of visualization

code can help turn volumes of raw data

into graphs and diagrams that make the

information

much easier to interpret for instance r

is a variety of packages that can

accomplish this

packages are collections of functions

and data sets that can be imported into

the coding system

here are a few ways in which r has been

used to create dynamic visuals that make

data easier to understand

all three of these graphs were created

with the package ggplot2

it only took a few lines of code to

generate graph1 was generated with just

three lines of code since r is equipped

with open source packages

that allow us to quickly and efficiently

visualize data sets

not only can packages be downloaded to

create visuals they can also be used to

access information

and easily import large data sets for

example the package

rnhanes allows the importation of all

the national health and nutrition

examination survey data

conducted by the centers for disease

control with just a single line of code

that reads install.packages are enhanced

take a moment to contemplate that with

just a few lines of code a programmer

can access studies that assess the

health

and nutritional status of thousands of

adults and children in the united states

it includes demographic socioeconomic

dietary and health-related information

but the potential for analysis is

staggering now

let’s circle back to the analysis of the

pregnant woman in suriname

it’s important to remember that those

connections discovered in suriname

represent real people and real health

impacts but we don’t need to look so far

away for an example of how data can be

used as a tool for change

it’s also been put to use in our own

backyard so to speak this time to impact

public safety

certain populations are more susceptible

to injury and death when there is a fire

children the elderly and those with

limited mobility are the most at risk

during a fire in the home

right here in new orleans data has been

examined and visualized to help mitigate

home fire casualties

the city of new orleans wondered how

data could be mined to assist

the most vulnerable citizens in 2016

the new orleans office of performance

and accountability developed a model

using data from the u.s census bureau’s

american housing survey

and an american community survey to do

just that

the city identified variables that might

correlate to whether a resident was

likely to have a smoke alarm

these included the age of the structure

the length of tenancy and whether there

were additions to the property

these variables were then applied to

census data and other historical records

to identify with high risk factors and

the likelihood of no smoke detector

the analysis of these variables resulted

in a map of the city that highlighted

where residents were at greater risk for

home fire death

using that data the city went

door-to-door with free smoke alarms

that free service was always available

but it was up to the resident to seek it

out

the combination of data code and civic

duty allowed officials to bring

potential life-saving devices

to those who may have needed it most

this geospatial model was so successful

it was adapted at the national level and

replicated in cities such as new york

and chicago

so far we’ve talked about how code and

data can be combined to discover

connections

visualize findings and integrate data it

can also be used for machine learning

machine learning is a branch of

artificial intelligence that uses

algorithms and statistical models to

analyze and draw inferences between

patterns and data

these systems improve their experience

essentially learning while doing

there are many analogies that help

clarify what machine learning

accomplishes

and how it works the one that makes the

most sense to me

is to think of machine learning

algorithms as math students

imagine that they’ve been given hundreds

of homework problems but instead

of just being asked to solve for the

answer to each and move on

they are also asked to find the patterns

in the solutions

that they can do so that they can do the

work more efficiently

the intersection of big data code and

machine learning

has proved to be useful in the hospital

setting in hospital cardiac arrests

typically trigger what is called a code

blue

an urgent medical emergency usually

involving respiratory distress

currently code blues have a 20 survival

rate in attempts to increase the

positive outcomes

researchers are able to write a code for

a machine learning system to predict a

code blue

hours before it occurs the model was

trained with data from electronic

medical records

and vital signs during the course of a

hospital stay

the data that was used to train the

algorithm included 29 different vitals

and lab results some examples of vital

signs used were respiratory rate

systolic and diastolic blood pressures

pulse oximetry and temperature

and the lab tests included hemoglobin

platelet clout

hematocrit creatinine and sodium levels

the machine learning algorithm was able

to predict a potential code blue as many

as four

hours before a potential occurrence the

essence of this model

is that we can use our data on vital

signs in combination with machine

learning to improve code blue outcomes

and ultimately improve patient outcomes

machine learning has also been used to

expose potential corruption and inequity

consider the recent study using machine

learning to help vulnerable tenants in

new york city

to help keep housing affordable the city

of new york uses rent stabilization

policies to restrict the rate at which

the rent of units may be increased

annually

there are cases in which landlords will

attempt to evade these laws

a machine learning model was implemented

to identify units that landlords might

target for harassment

the study framed the tenant harassment

risk prediction as a binary

classification problem by trying to

predict if there will be any

cases of harassment in a given time

frame of one month

data was combined from different sources

and efforts to identify variables that

signal possible landlord harassment

some of the variables chosen included

whether or not a knock was answered at

the door

income of the tenants age distributions

of the tenants

renovation history of the building and

code violations of the building

using this model the city’s tenant

support unit was able to identify

at-risk units with 59

more accuracy like our subjects in

suriname making these connections

results in real impacts on the human

experience

those occupants are potentially more

housing secure because of computer code

and the public servants who employed it

part of the reason why such an enormous

amount of data exists is because of

something we call

data exhaust data exhaust can be thought

of as the trail of data we leave behind

as we use the internet

every time you like an instagram picture

download a pdf

or reply to a tweet you’re creating a

piece of data

this data exhaust can be harnessed for

social good in the fight against cyber

bullying and cyber aggression

in 2018 a machine learning system was

trained with millions of tweets

with the goal of distinguishing users

that are bullies and aggressors

tweets were analyzed on a variety of

factors such as crowdsourced

identification of abusive keywords

the use of curse words and user and

account based attributes

such as number of tweets followers and

account age

the use of hate-related words seems like

easy evidence of cyber bully behavior

however the machine learning algorithm

also gave insight to the semantic

patterns of cyberbullies

it was discovered that users classified

as cyberbullies use a lower number of

adjectives and adverbs compared to

normal users

and that aggressive users post tweets

with higher number of words per tweet

compared to normal users

the algorithm was ultimately able to

identify cyberbullies and cyber

aggressors with over 90

accuracy machine learning may prove to

be a critical component

in the efforts to make the cybersphere a

more respectful and safer space

while protecting free speech should

always be a core value in our democratic

society

machine learning may help create a

system that can help to identify users

who violate important terms of use and

potentially suspend accounts of those

who are frequent offenders

often when we hear references to data it

is in the context of concern

our rights to privacy being overlooked

or worse purposefully disregarded

are companies using available data to

manipulate consumers

or twist systems in their favor these

are certainly valid concerns

but we need to make sure that we also

recognize that the data sphere can be

mined for good

computer code and machine learning can

be used to find connections

to help us identify problems and

critically think about solutions

public safety health care the

environment

internet safety these and so many more

areas

are all on the table and ready for

analysis so the key is this

as the global data sphere continues to

grow exponentially

let’s think about ways we can harness

this data to learn more about our

society

and ultimately code solutions to some of

humanity’s greatest challenges

the connections are there waiting to be

discovered and through code

we have the power to find them thank you

[音乐]

早上好，tedx

tulane 将

计算机编程语言应用于数据的计算机代码

可以引发围绕

我们一些最大问题的重要对话

，机器学习可以帮助我们利用

大数据更多地了解我们的社会

，毫无疑问，我们生活在一个

日益数字化和全球化的世界

中公众可以访问的原始数据量

正在迅速增长，事实上，理解它的数量

是相当惊人的

让我们使用这个分析来

考虑数字 100 万、

10 亿和 1 万亿以及它们

与时间的关系，

我们知道 100 万秒是

大约等于 12 天，

但你认为 10 亿秒是多长时间？

答案大约是 31.7 年

如果我们继续这个例子，我们也可以

问自己

一万亿秒有多长一万亿秒

大约等于 31688 年，

但是

当我们谈论经济

世界人口空间时，我们总是听到这些数字，只有当

我们仔细考虑它们并

相互比较时

，它们的含义并且量级

开始成为

焦点同样可以说是为了了解有

多少原始数据

可用数量如此之大以至于

很难概念化

我们可以使用我们对最近示例中的一

百万十亿和一万亿的理解

来想想

你用智能手机拍摄的平均照片

有多少数据，从 1 到 5 兆字节不等

，千兆字节就是 1

千兆

字节，

在移动计划变得无限之前，我们都已经习惯了千兆字节，

你可以获得 10 千兆字节的计划或

可能是 20 GB 的家庭计划，

然后再收取额外数据费用

我们的手机通常具有

64 128 GB 的存储容量 es 等等

回忆 1 万亿秒等于 32

年 1 万亿 GB 是一个 zeta 字节

，今天我们的世界存在近 60 个 zeta 字节的

数据

市场研究公司

国际数据公司最近

发布说，世界上存在 59 辆 zeta 自行车

全球数据领域

全球数据领域被定义为在全球

任何给定年份捕获和复制的数据量，这些数据

量令人难以置信

，可以挖掘和

分析重要的联系和

模式大量

信息我们如何挖掘

看似不相关的变量以

找到可能成为积极变化基础的联系和模式

通过代码应用的数据挖掘工具

是分析这些数据的有效方法

今天我们将考虑如何

利用全球数据领域中的信息以

造福

社会我个人

的数据体验当我有

机会协助 cecilia alcala 博士

进行研究时，采矿开始在南美洲的

东北大西洋沿岸，

这项研究包括分析数百

个孕妇的尿液样本，

以了解农药

代谢

物的浓度

这些浓度被绘制成图形

并可视化

，它们让我们

了解农药是否

用于农业目的

或住宅用途，它们还为我们提供了一般

使用的农药水平的信息，

并给出了 w 关于

海啸和妇女

在怀孕期间接触这些有害

化学物质的主要原因，这些

事实可以

成为改变的基础，当数据可以被清理

挖掘和可视化时，它可以

促进公共政策

环境法规和工人

权利的改变，将代码应用于原始数据使

我们能够

衡量公共政策的有效性

有大量的编码

程序可用于此类工作

我个人的经验主要是 r

但诸如 python

java c plus plus 之类的语言在该领域也具有很高的实用性

数据

挖掘除了揭示变量之间隐藏的联系之外

，将

代码应用于原始数据的两个最大好处

是可视化和可视化领域的访问

代码可以帮助将大量原始数据

转化为图形和图表，从而使

信息

更容易获得解释例如r

是可以

完成这个

包的各种包

可以导入编码系统的函数和数据集的集合

这里有几种方法，其中 r 已

用于创建动态视觉效果，使

数据更易于理解

所有这三个图表都是

使用 ggplot2 包创建

的生成graph1的几行代码仅用三行代码

就生成了

，因为r配备

了开源包

，使我们能够快速有效地

可视化数据集，

不仅可以下载包以

创建视觉效果，它们还可以用于

访问信息

并轻松导入大型数据集，

例如软件包

rnhanes 允许导入由疾病控制中心进行的

所有国家健康和营养

检查调查数据

，只需一行代码即可

读取 install.packages 得到增强

花点时间考虑

只需几行代码，程序员

就可以访问评估

健康

和营养状况的研究成千上万的

美国成人和儿童，

它包括人口社会经济

饮食和健康相关信息，

但分析的潜力是

惊人的现在

让我们回到对苏里南孕妇的分析，

重要的是要记住，

在苏里南发现的这些联系

代表真实的人和真实的健康

影响，但我们不需要看

太远的例子来说明如何将数据

用作改变的工具，

它也被用于我们自己的

后院，可以说这次影响

公共安全

发生火灾时，某些人群更容易受伤和死亡

儿童老人和

行动不便的人

在新奥尔良的家中发生火灾时最危险

新奥尔良市想知道如何

在 2016 年挖掘数据以帮助最脆弱的公民

。绩效

和问责制

使用来自美国人口普查局的

美国住房调查

和美国社区调查的数据开发了一个模型，该模型

只是为了

让该市确定可能

与居民是否

可能有烟雾报警器相关的变量，

其中包括结构的年龄

租期的长度以及

房产是否增加了

这些变量，然后将这些变量应用于

人口普查数据和其他历史记录，

以识别高风险因素和

没有烟雾探测器的可能性

对这些变量的分析产生

了一张城市地图，该地图表明使用该数据突出显示

居民

在家中火灾死亡的风险更大

城市

挨家挨户提供免费烟雾警报器

始终提供免费服务

但由居民自行寻找

数据代码和公民的组合

职责允许官员将

潜在的救生设备

带给那些可能最需要它的人

这个地理空间模型非常成功，

它在国家层面进行了调整，并

在纽约和芝加哥等城市进行了复制

到目前为止，我们已经讨论了如何结合代码和

数据来发现

连接，

可视化发现并整合数据，它

也可以用于机器学习

机器学习是人工智能的一个分支，

它使用

算法和统计模型来

分析和推断

模式和数据之间的关系，

这些系统从本质上改善了他们的体验，

同时学习

有许多类比有助于

阐明机器学习

完成

了什么以及它是如何工作的

对我来说最有意义的

是把机器学习

算法想象成数学学生

想象他们已经完成了数百

个作业问题，

但不仅仅是被要求解决每个问题的

答案并继续前进，

他们还被要求找到

他们可以做的解决方案中的模式，以便他们可以

更有效地完成工作

int 大数据代码和

机器学习的删除

已被证明在医院

环境中很有用医院心脏骤停

通常会触发所谓的蓝色代码

紧急医疗紧急情况通常

涉及呼吸窘迫

目前蓝色代码有 20 的

存活率以试图提高

积极成果

研究人员能够

为机器学习系统编写代码，以在

代码

发生前几小时预测蓝色代码该模型

在住院期间使用来自电子病历和生命体征

的数据进行了训练用于训练的数据该

算法包括 29 种不同的生命体征

和实验室结果使用的一些生命

体征示例是呼吸频率

收缩压和舒张压

脉搏血氧饱和度和

体温实验室测试包括血红蛋白

血小板影响

血细胞比容肌酐和钠

水平机器学习算法

能够预测潜在的

多达四个

小时前的蓝色代码一个潜在的事件

这个模型的

本质是我们可以使用我们的生命

体征数据与机器学习相结合

来改善蓝色代码结果

并最终改善患者结果

机器学习也被用来

暴露潜在的腐败和不公平

考虑最近的研究使用机器

学习帮助

纽约市的弱势租户，

以帮助保持住房负担得起

纽约市使用租金稳定

政策来限制

单位租金每年可能上涨的速度在

某些情况下，房东会

试图逃避这些法律

实施了机器学习模型

来识别房东可能

针对骚扰

的单元该研究将租户骚扰

风险预测框架为一个二元

分类问题，通过尝试

预测

在一个月的给定时间范围内是否会有任何骚扰案例

数据是结合不同的来源

和努力来确定

gnal 可能的房东骚扰

选择的一些变量包括

敲门是否得到回应

租户的收入租户的年龄分布

建筑物的翻新历史和建筑物的

违规行为

使用此模型城市的租户

支持单位能够

像我们在苏里南的受试者一样，以更高的准确度识别有风险的单位 59

建立这些联系

会对人类体验产生真正的影响，

因为计算机代码

和使用它的公务员可能使这些居住者的住房更加安全

，这也是为什么这样的部分原因

大量数据的存在是因为

我们称之为

数据耗尽数据耗尽可以被

认为是我们留下的数据痕迹，

因为我们每次使用互联网

时都喜欢Instagram图片

下载pdf

或回复你的推文' 重新创建

一段数据，

这些数据耗尽可用于

社会公益，以打击网络

欺凌和网络攻击 ber

攻击 2018 年，一个机器学习系统

接受了数百万条推文的训练，

目的

是区分欺凌和攻击者的用户对

推文进行了多种因素分析，

例如

滥用关键字

的众包识别、诅咒词的使用以及基于用户和

帐户

诸如推文关注者数量和

帐户年龄等属性

仇恨相关词的使用似乎

是网络欺凌行为的简单证据，

但是机器学习算法

还可以洞察

网络欺凌的语义模式，

发现归类

为网络欺凌的用户使用较低

与普通用户相比，形容词和副词的数量

以及攻击性用户发布的推

文与普通用户相比，每条推文的字数更高

该算法最终能够以超过 90 的准确度

识别网络欺凌者和网络

攻击者

机器学习可能被证明

是一个关键组成部分

在努力使网络领域

成为在保护言论自由的同时寻求尊重和更安全的空间

应该

始终是我们民主社会的核心价值

机器学习可能有助于创建一个

系统，该系统可以帮助识别

违反重要使用条款的用户，并

可能在以下情况下暂停

经常违规的用户的帐户

我们听到对数据的引用这

是在关注的背景下

我们的隐私权被忽视

或更糟的是故意

忽视公司是否使用可用数据来

操纵消费者

或扭曲系统以利于他们

这些当然是有道理的担忧，

但我们需要确保我们也

认识到可以挖掘数据领域以

获得良好的

计算机代码，并且可以使用机器学习

来寻找联系

以帮助我们识别问题并

批判性地思考解决方案

公共安全医疗保健

环境

互联网安全这些以及更多

领域

都在讨论中并准备好进行

分析，因此关键是

随着全球数据领域的不断发展 es

成倍增长

让我们思考如何利用

这些数据来更多地了解我们的

社会，

并最终为

人类面临的一些最大挑战编写解决

方案那里的联系等待被

发现，通过代码

我们有能力找到它们谢谢