The method that can prove almost anything James A. Smith

In 2011, a group of researchers conducted
a scientific study

to find an impossible result:

that listening to certain songs
can make you younger.

Their study involved real people,
truthfully reported data,

and commonplace statistical analyses.

So how did they do it?

The answer lies in a statistical method
scientists often use

to try to figure out whether their results
mean something or if they’re random noise.

In fact, the whole point
of the music study

was to point out ways
this method can be misused.

A famous thought experiment
explains the method:

there are eight cups of tea,

four with the milk added first,
and four with the tea added first.

A participant must determine
which are which according to taste.

There are 70 different ways the cups
can be sorted into two groups of four,

and only one is correct.

So, can she taste the difference?

That’s our research question.

To analyze her choices, we define
what’s called a null hypothesis:

that she can’t distinguish the teas.

If she can’t distinguish the teas,

she’ll still get the right answer
1 in 70 times by chance.

1 in 70 is roughly .014.

That single number is called a p-value.

In many fields, a p-value of .05 or below
is considered statistically significant,

meaning there’s enough evidence to reject
the null hypothesis.

Based on a p-value of .014,

they’d rule out the null hypothesis
that she can’t distinguish the teas.

Though p-values are commonly
used by both researchers and journals

to evaluate scientific results,

they’re really confusing,
even for many scientists.

That’s partly because all a p-value
actually tells us

is the probability of getting
a certain result,

assuming the null hypothesis is true.

So if she correctly sorts the teas,

the p-value is the probability
of her doing so

assuming she can’t tell the difference.

But the reverse isn’t true:

the p-value doesn’t tell us
the probability

that she can taste the difference,

which is what we’re trying to find out.

So if a p-value doesn’t answer
the research question,

why does the scientific community use it?

Well, because even though a p-value
doesn’t directly state the probability

that the results are due to random chance,

it usually gives a pretty
reliable indication.

At least, it does when used correctly.

And that’s where many researchers,
and even whole fields,

have run into trouble.

Most real studies are more complex
than the tea experiment.

Scientists can test their research
question in multiple ways,

and some of these tests might produce
a statistically significant result,

while others don’t.

It might seem like a good idea
to test every possibility.

But it’s not,
because with each additional test,

the chance of a false positive increases.

Searching for a low p-value,
and then presenting only that analysis,

is often called p-hacking.

It’s like throwing darts
until you hit a bullseye

and then saying you only threw the dart
that hit the bull’s eye.

This is exactly what the
music researchers did.

They played three groups of participants
each a different song

and collected lots of information
about them.

The analysis they published included only
two out of the three groups.

Of all the information they collected,

their analysis only used
participants’ fathers’ age—

to “control for variation in baseline
age across participants.”

They also paused their experiment
after every ten participants,

and continued if the p-value
was above .05,

but stopped when it dipped
below .05.

They found that participants who heard
one song were 1.5 years younger

than those who heard the other song,
with a p-value of .04.

Usually it’s much tougher to spot
p-hacking,

because we don’t know the results
are impossible:

the whole point of doing experiments
is to learn something new.

Fortunately, there’s a simple way
to make p-values more reliable:

pre-registering a detailed plan
for the experiment and analysis

beforehand that others can check,

so researchers can’t keep trying
different analyses

until they find a significant result.

And, in the true spirit
of scientific inquiry,

there’s even a new field that’s basically
science doing science on itself:

studying scientific practices
in order to improve them.