How statistics can be misleading Mark Liddell

Statistics are persuasive.

So much so that people, organizations,
and whole countries

base some of their most important
decisions on organized data.

But there’s a problem with that.

Any set of statistics might have something
lurking inside it,

something that can turn the results
completely upside down.

For example, imagine you need to choose
between two hospitals

for an elderly relative’s surgery.

Out of each hospital’s
last 1000 patient’s,

900 survived at Hospital A,

while only 800 survived at Hospital B.

So it looks like Hospital A
is the better choice.

But before you make your decision,

remember that not all patients
arrive at the hospital

with the same level of health.

And if we divide each hospital’s
last 1000 patients

into those who arrived in good health
and those who arrived in poor health,

the picture starts to look very different.

Hospital A had only 100 patients
who arrived in poor health,

of which 30 survived.

But Hospital B had 400,
and they were able to save 210.

So Hospital B is the better choice

for patients who arrive
at hospital in poor health,

with a survival rate of 52.5%.

And what if your relative’s health
is good when she arrives at the hospital?

Strangely enough, Hospital B is still
the better choice,

with a survival rate of over 98%.

So how can Hospital A have a better
overall survival rate

if Hospital B has better survival rates
for patients in each of the two groups?

What we’ve stumbled upon is a case
of Simpson’s paradox,

where the same set of data can appear
to show opposite trends

depending on how it’s grouped.

This often occurs when aggregated data
hides a conditional variable,

sometimes known as a lurking variable,

which is a hidden additional factor
that significantly influences results.

Here, the hidden factor is the relative
proportion of patients

who arrive in good or poor health.

Simpson’s paradox isn’t just
a hypothetical scenario.

It pops up from time
to time in the real world,

sometimes in important contexts.

One study in the UK appeared to show

that smokers had a higher survival rate
than nonsmokers

over a twenty-year time period.

That is, until dividing the participants
by age group

showed that the nonsmokers
were significantly older on average,

and thus, more likely
to die during the trial period,

precisely because they were living longer
in general.

Here, the age groups
are the lurking variable,

and are vital to correctly
interpret the data.

In another example,

an analysis of Florida’s
death penalty cases

seemed to reveal
no racial disparity in sentencing

between black and white defendants
convicted of murder.

But dividing the cases by the race
of the victim told a different story.

In either situation,

black defendants were more likely
to be sentenced to death.

The slightly higher overall sentencing
rate for white defendants

was due to the fact
that cases with white victims

were more likely
to elicit a death sentence

than cases where the victim was black,

and most murders occurred between
people of the same race.

So how do we avoid
falling for the paradox?

Unfortunately,
there’s no one-size-fits-all answer.

Data can be grouped and divided
in any number of ways,

and overall numbers may sometimes
give a more accurate picture

than data divided into misleading
or arbitrary categories.

All we can do is carefully study the
actual situations the statistics describe

and consider whether lurking variables
may be present.

Otherwise, we leave ourselves
vulnerable to those who would use data

to manipulate others
and promote their own agendas.

统计数据很有说服力。

以至于人们、组织
和整个国家都

基于有组织的数据做出一些最重要的决定。

但这有一个问题。

任何一组统计数据都可能
隐藏着一些东西,

一些东西可以
完全颠倒结果。

例如,假设您需要
在两家医院中

为一位年长亲属的手术进行选择。

在每家医院的
最后 1000 名患者中,

A 医院有 900 人幸存,

而 B 医院只有 800 人幸存。

所以看起来 A 医院
是更好的选择。

但在您做出决定之前,

请记住,并非所有
到达医院的患者都

具有相同的健康水平。

如果我们将每家医院的
最后 1000 名患者

分为健康状况良好
的患者和健康状况不佳的患者

,情况就会开始变得非常不同。

A医院只有100名
健康不佳的患者

,其中30人幸存下来。

但是B医院有400个
,他们能救210个。

所以

对于那些身体不好的病人来说,
B医院是更好的选择,

存活率为52.5%。

如果您的亲戚
到达医院时身体状况良好怎么办?

奇怪的是,医院 B 仍然
是更好的选择

,存活率超过 98%。

那么,如果医院 B 对两组患者的生存率都更高,那么医院 A 的
总体生存率

又如何提高
呢?

我们偶然发现的
是辛普森悖论的一个案例,根据分组的方式

,同一组数据可能
会显示出相反的趋势

当聚合数据
隐藏了一个条件变量(

有时称为潜伏变量)时,通常会发生这种情况,

这是一个隐藏的附加因素
,会显着影响结果。

在这里,隐藏因素是

健康状况良好或状况不佳的患者的相对比例。

辛普森悖论不仅仅是
一个假设的场景。


不时出现在现实世界中,

有时出现在重要的环境中。

英国的一项研究似乎

表明,在 20 年的时间段内,吸烟者的存活率
高于不吸烟者

也就是说,直到将参与者
除以年龄组

显示,不吸烟者
的平均年龄明显更大

,因此
在试验期间更有可能死亡,这

正是因为他们总体上活得
更长。

在这里,年龄组
是潜伏的变量

,对于正确
解释数据至关重要。

在另一个例子中,

对佛罗里达州死刑案件的分析

似乎表明

被判犯有谋杀罪的黑人和白人被告之间的判决中没有种族差异。

但按受害者的种族划分案件却
讲述了一个不同的故事。

在这两种情况下,

黑人被告更有
可能被判处死刑。

白人被告的总体量刑率略高

是因为
白人受害者的案件比黑人受害者

的案件更有可能被
判处死刑

而且大多数谋杀发生在
同一种族的人之间。

那么我们如何避免
陷入悖论呢?

不幸的是,
没有万能的答案。

数据可以以多种方式进行分组和划分

,总体数字有时可能比划

分为误导性
或任意类别的数据更准确。

我们所能做的就是仔细研究
统计数据描述的实际情况,

并考虑是否
可能存在潜伏变量。

否则,我们会让自己
容易受到那些使用数据

来操纵他人
和宣传自己议程的人的攻击。