The method that can prove almost anything James A. Smith

In 2011, a group of researchers conducted
a scientific study

to find an impossible result:

that listening to certain songs
can make you younger.

Their study involved real people,
truthfully reported data,

and commonplace statistical analyses.

So how did they do it?

The answer lies in a statistical method
scientists often use

to try to figure out whether their results
mean something or if they’re random noise.

In fact, the whole point
of the music study

was to point out ways
this method can be misused.

A famous thought experiment
explains the method:

there are eight cups of tea,

four with the milk added first,
and four with the tea added first.

A participant must determine
which are which according to taste.

There are 70 different ways the cups
can be sorted into two groups of four,

and only one is correct.

So, can she taste the difference?

That’s our research question.

To analyze her choices, we define
what’s called a null hypothesis:

that she can’t distinguish the teas.

If she can’t distinguish the teas,

she’ll still get the right answer
1 in 70 times by chance.

1 in 70 is roughly .014.

That single number is called a p-value.

In many fields, a p-value of .05 or below
is considered statistically significant,

meaning there’s enough evidence to reject
the null hypothesis.

Based on a p-value of .014,

they’d rule out the null hypothesis
that she can’t distinguish the teas.

Though p-values are commonly
used by both researchers and journals

to evaluate scientific results,

they’re really confusing,
even for many scientists.

That’s partly because all a p-value
actually tells us

is the probability of getting
a certain result,

assuming the null hypothesis is true.

So if she correctly sorts the teas,

the p-value is the probability
of her doing so

assuming she can’t tell the difference.

But the reverse isn’t true:

the p-value doesn’t tell us
the probability

that she can taste the difference,

which is what we’re trying to find out.

So if a p-value doesn’t answer
the research question,

why does the scientific community use it?

Well, because even though a p-value
doesn’t directly state the probability

that the results are due to random chance,

it usually gives a pretty
reliable indication.

At least, it does when used correctly.

And that’s where many researchers,
and even whole fields,

have run into trouble.

Most real studies are more complex
than the tea experiment.

Scientists can test their research
question in multiple ways,

and some of these tests might produce
a statistically significant result,

while others don’t.

It might seem like a good idea
to test every possibility.

But it’s not,
because with each additional test,

the chance of a false positive increases.

Searching for a low p-value,
and then presenting only that analysis,

is often called p-hacking.

It’s like throwing darts
until you hit a bullseye

and then saying you only threw the dart
that hit the bull’s eye.

This is exactly what the
music researchers did.

They played three groups of participants
each a different song

and collected lots of information
about them.

The analysis they published included only
two out of the three groups.

Of all the information they collected,

their analysis only used
participants’ fathers’ age—

to “control for variation in baseline
age across participants.”

They also paused their experiment
after every ten participants,

and continued if the p-value
was above .05,

but stopped when it dipped
below .05.

They found that participants who heard
one song were 1.5 years younger

than those who heard the other song,
with a p-value of .04.

Usually it’s much tougher to spot
p-hacking,

because we don’t know the results
are impossible:

the whole point of doing experiments
is to learn something new.

Fortunately, there’s a simple way
to make p-values more reliable:

pre-registering a detailed plan
for the experiment and analysis

beforehand that others can check,

so researchers can’t keep trying
different analyses

until they find a significant result.

And, in the true spirit
of scientific inquiry,

there’s even a new field that’s basically
science doing science on itself:

studying scientific practices
in order to improve them.

2011年,一组研究人员进行了
一项科学研究

,发现了一个不可能的结果

:听某些歌曲
可以让你更年轻。

他们的研究涉及真实的人,
如实报告的数据

和常见的统计分析。

那么他们是怎么做到的呢?

答案在于科学家经常使用的一种统计方法,

以试图确定他们的结果是否
有意义,或者它们是否是随机噪声。

事实上
,音乐研究的重点

是指出
这种方法可能被滥用的方式。

一个著名的思想实验
解释了这个方法:

有八杯茶,

四杯先加牛奶
,四杯先加茶。

参与者必须
根据口味确定哪些是哪些。

杯子有 70 种不同的方法
可以分成两组,每组四个

,只有一种是正确的。

那么,她能尝到不同的味道吗?

这是我们的研究问题。

为了分析她的选择,我们定义
了所谓的零假设

:她无法区分茶。

如果她不能区分茶,

她仍然会
偶然得到七十分之一的正确答案。

70 分之 1 大约是 0.014。

该单个数字称为 p 值。

在许多领域,0.05 或以下的 p 值
被认为具有统计学意义,

这意味着有足够的证据
拒绝原假设。

基于 0.014 的 p 值,

他们会排除
她无法区分茶的零假设。

尽管
研究人员和期刊通常使用 p 值

来评估科学结果,

但它们确实令人困惑,
即使对于许多科学家来说也是如此。

这部分是因为所有 p 值
实际上告诉我们的

是获得
某个结果的概率,

假设原假设为真。

因此,如果她正确分类茶,

则 p 值
是她这样做的概率,

假设她无法区分。

但反之则不然

:p 值并没有告诉我们

她能尝到差异的概率,

而这正是我们想要找出的。

因此,如果 p 值不能
回答研究问题,

科学界为什么要使用它?

好吧,因为即使 p 值
没有直接

说明结果是由于随机机会造成的概率,

它通常也会给出一个非常
可靠的指示。

至少,它在正确使用时会发生。

这就是许多研究人员,
甚至整个领域

遇到麻烦的地方。

大多数真实的研究
比茶实验更复杂。

科学家们可以通过多种方式测试他们的研究
问题,

其中一些测试可能会产生
具有统计学意义的结果,

而另一些则不会。

测试每一种可能性似乎是个好
主意。

但事实并非如此,
因为每增加一次测试,

误报的机会就会增加。

搜索一个低 p 值,
然后只呈现该分析,

通常称为 p-hacking。

这就像扔飞镖,
直到你击中靶心

,然后说你只扔
了击中靶心的飞镖。

这正是
音乐研究人员所做的。

他们为三组参与者播放
不同的歌曲,

并收集了很多关于他们的信息

他们发表的分析只包括
三组中的两组。

在他们收集的所有信息中,

他们的分析只使用了
参与者父亲的年龄——

以“控制
参与者基线年龄的变化”。

他们还在
每 10 名参与者后暂停实验,

如果 p
值高于 0.05 则继续,

但当 p 值
低于 0.05 时停止。

他们发现听过
一首歌的参与者比听过另一首歌的参与者年轻 1.5 岁

,p 值为 0.04。

通常发现
p-hacking 要困难得多,

因为我们不知道结果
是不可能的:

做实验的全部目的
是学习新东西。

幸运的是,有一种简单的方法
可以让 p 值更加可靠:

预先为实验和分析制定详细的计划

,以便其他人可以检查,

这样研究人员就不能继续尝试
不同的分析,

直到他们找到重要的结果。

而且,本着真正
的科学探究精神,

甚至还有一个新领域,基本上是
科学自己做科学:

研究科学实践
以改进它们。