Four Ingredients for K12 Data Science

i like to think of this as sort of a

cooking presentation right we’re going

to be talking about what the ingredients

need to be to teach data science in k-12

classes i’ve worn a lot of different

hats in my life i’ve i’ve been a

computer scientist and professional

programmer as john told you i’ve been a

math teacher right here in boston um

i’ve had the incredible privilege to

work alongside giants in the field like

sriram krishnamurthy and kathy fisler on

a research project called bootstrap

based at brown university in the field

of computer science education and most

recently i’ve donned a hat as the father

to the coolest girl in the world i

promised maya she’d be in here and while

i would love to spend the next nine and

a half minutes uh giving you a ted talk

focused on her instead we’re going to

focus on something slightly less

interesting which is what’s going on in

the cutting edge of computer science

research let me take you back a ways to

about 10 or 15 years remember when

everybody was saying cs for all see us

for all we got to get coding into

schools right

at the time

we made a very controversial bet at

bootstrap first we said you know we

don’t think siloed classes are the only

way to do this in fact they might not

even be the best way

second we gambled on the idea that we

could fuse computing and mathematics

authentically so that instead of

undermining the math the computing

actually reinforced it and third we bet

there was a way to do this so it worked

equitably for all students so

fast forward a little bit

this curriculum sort of busted out of

the lab and became one of the most

widely used computing curriculum

nationwide and while we’re thrilled with

our scale we’re proud of our diversity

and the reason that we have those

numbers is because we’re working with

the teachers that already reach every

child not the computer science teachers

but the mainstream math teachers who

have no computing background at all for

them it was just a powerful way to teach

mathematics now we didn’t want to be one

hit wonders right so we rinsed and

repeated the formula and extended this

for things like algebra physics and

beyond

and about a half decade ago we started

getting really excited about something

that nobody was terribly excited about

which was what if you could teach data

science in k12

fast forward to today it’s not cs for

all anymore it’s data science for

everyone and they’re asking the same

questions that we asked 10 years ago

what should these classes look like and

where do they fit curriculum design is

essentially a recipe and every recipe

has room for flexibility your cupcake

might involve cream cheese frosting and

your cupcake might involve you know

coconut shreds or something

maybe not coconut shreds i wouldn’t put

that on you but there’s room for flexing

with these ingredients but one thing we

can all agree is that if you leave an

ingredient out completely well it might

be delicious but you haven’t baked what

you set out to bake so the question

becomes what are the must-have

ingredients for a responsible k-12 data

science class

now the prevailing wisdom is that we can

all agree on at least two ingredients

mathematics and computing and when i say

computing i mean programming algorithms

structured data and for you data

scientists out there who may not be

familiar with the k12 math standards

those are the standards that cover the

statistics content right the concepts

that are necessary for rigor those are

the standards that cover the data

visualization right it’s those standards

that talk about histograms and lines of

best fit what we’re hearing and this is

sort of like the loudest voices in the

room is that the solution here we go the

solution is we’re going to take stats

classes right thank you r.a fisher from

100 years ago we’re going to add some

coding and boom we’ve baked ourselves a

data science class

and therefore we should elevate

statistics to be just as important as

calculus now as a former math teacher i

am all about elevating statistics to be

just as important as calculus i think

it’s great but if our goal is to bring

data science to k12 i’m here to tell you

that this formula is dangerously flawed

imagine an amazing cs class kids are

building virtual worlds and 3d games and

at the end of the class they spend like

two weeks being given a calculator and

they’re taught which stats buttons

to press

to do some statistics is that a data

science class

obviously not

now let’s flip that suppose you have an

amazing statistics class totally awesome

and then at the very end we’re going to

have two weeks where they learn what

commands to type into python

also not a data science class and as a

team that’s been working on this for the

last 15 years who knows something about

combining math and computing we know

it’s not that simple what you need to do

if you want to mix these ingredients is

find the computational concepts that

bridge these worlds there’s a lot of

them i’ll just give you three quick

examples

how do you take a complex problem break

it down into simpler pieces and know

that when you’ve solved those pieces

you’ve actually solved the original

problem you set out to answer

how do you trust a computation that’s

been performed on a data set with 10 000

rows that nobody could possibly check by

hand and how do you ensure that your

results are reproducible that anyone

else could take your data and your code

and see the same results that you did

these concepts were critical to our

success over a decade ago and they’re

just as critical now

and recognize that if you’re still

thinking that what’s necessary is just

teaching some coding

it doesn’t touch any of them

but it actually gets worse because there

are two other ingredients that are often

left out of the conversation

with disastrous results so i call this

when data science goes bad

this may come as a surprise to many of

you but we live in a society that’s kind

of racist

and when you do data analysis on that

data guess what models and algorithms

come out

kind of racist ones and this isn’t just

an isolated headline right this has

become essentially an epidemic where the

darkest and deepest divides in our

society are being institutionalized in

code affecting everything from medical

care to sentencing guidelines and racism

is not just where it stops

political consultants are mining voter

data and everything else to build

tactically precise gerrymandered

districts that serve to further deepen

the polarization in our democracy

and of course we all talk about how

important it is for students to learn

about

cyber security right we gotta teach them

what a good password is teach them not

to hand out that password

and yet what we really need to be doing

is teaching them enough data science to

understand why they should not be

filling out that survey that tells them

which harry potter character they are

most like because it turns out that when

you mine that freely available data on

social media it can be weaponized to

shift public opinion about issues as

major as the fracturing of the european

union brexit itself

so why are these being left out well

because just teaching math and computing

doesn’t get the job done there’s two

more ingredients that need to be part of

the conversation that are always left

out

the first is civic responsibility so

let’s talk about civic responsibility if

you’re viewing this as math and code

great i’m sure you’ll tell students the

dangers of taking a biased sample but

what we need to be doing is teaching

students the dangers of a good random

sample taken from a society filled with

bias

if you’re thinking of this as just math

and code well great i’ll teach you the

algorithms to help you aggregate data to

predict human behavior and find out

which of you in the crowd are most

likely to commit a crime

but without what we need is civic

responsibility that says whether it is

ethical to ask that question or gather

that data in the first place

now again if the strategy is we’re going

to put it all on math teachers

are they ready to have this conversation

and if they are is it fair to demand

that it falls solely on them i don’t

think so when you teach medicine without

civic responsibility you get the

tuskegee experiments when you teach data

science without this ingredient you get

racially biased algorithms and weaponize

social media

the next ingredient that we need to

consider is domain investment because i

could be the most incredible programmer

and statistician you’ve ever met but if

i don’t know anything about baseball i

cannot go down to yawkey way and analyze

sports statistics for the red sox so

imagine if a teacher decides that her

kids are going to analyze a data set

about the best vineyards in tuscany

which students are engaged

which students feel included which

students feel left out

it turns out that the choice of data the

actual investment in the domain is a

critical component not just of

engagement and relevance but also of

diversity equity and inclusion we’ve got

a paper coming out of this research

group that talks to specifically about

this in a couple of weeks so what we

need is to have teachers who can speak

to the content areas that matter to kids

and meet them where they are

again is it fair to put all of that on

the math teachers

disrespecting the domain expertise of

humanities folks has been standard

operating procedure for the stem world

for too long we cannot afford to repeat

that mistake

so i’m excited to share with you some of

the research results that we’ve had here

currently we’ve got a curriculum that is

in use around the country right now in

the nation’s largest school district new

york city we’ve got social studies

teachers having kids analyze the stop

and frisk data set teaching social

studies in a revolutionary new way

out in arizona we’ve got physics

teachers who already had their kids

gather experimental data but now their

kids can analyze the data and try to

figure out what kind of equation models

what i’m seeing and they can figure it

out before they even see the equation in

the book

students in california are looking at

climate data you can have students in a

phys ed class analyzing their free throw

percentages or in a nutrition class

looking at their snacking habits

this can be a full court press and it’s

happening now

where i want to leave this talk is by

saying this notion that mixing math and

coding is easy is flawed but even if you

do it right leaving it at math and

coding is fundamentally dangerous

for those of us who care about data

science if the headline becomes it’s the

new math 2.0 we are sunk

this needs to be an interdisciplinary

solution a full court press that engages

teachers across grade levels and across

disciplines we need to make sure these

ingredients are part of the conversation

we need to make sure that we’re not just

picking tools because they’re free or

because they’re popular but that we’re

choosing a tool that is appropriate for

the learning goals of the subject and

for the cognitive demands of the

students we need to make sure that we’re

not just dumping kids with more data

sets we need to make sure they’re

actually better data sets

are they engaging do they meet kids

where they need to be do the columns of

your data set actually are they

accessible because if it takes a student

a week to learn what a data set is even

about

we’ve lost

and finally because we believe in this

so thoroughly we think it’s important to

make it free all of our curricular

materials we’re giving away in the hopes

that all of you out there will join us

and engage teachers from across the

discipline to make data science real but

also make it responsible

i’m fortunate enough to work with an

incredible team

and i want to thank all of you for your

time