How were building the worlds largest family tree Yaniv Erlich

People use the internet
for various reasons.

It turns out that one of the most
popular categories of website

is something that people
typically consume in private.

It involves curiosity,

non-insignificant levels
of self-indulgence

and is centered around recording
the reproductive activities

of other people.

(Laughter)

Of course, I’m talking about genealogy –

(Laughter)

the study of family history.

When it comes to detailing family history,

in every family, we have this person
that is obsessed with genealogy.

Let’s call him Uncle Bernie.

Uncle Bernie is exactly the last person
you want to sit next to

in Thanksgiving dinner,

because he will bore you to death
with peculiar details

about some ancient relatives.

But as you know,

there is a scientific side for everything,

and we found that Uncle Bernie’s stories

have immense potential
for biomedical research.

We let Uncle Bernie
and his fellow genealogists

document their family trees through
a genealogy website called geni.com.

When users upload
their trees to the website,

it scans their relatives,

and if it finds matches to existing trees,

it merges the existing
and the new tree together.

The result is that large
family trees are created,

beyond the individual level
of each genealogist.

Now, by repeating this process
with millions of people

all over the world,

we can crowdsource the construction
of a family tree of all humankind.

Using this website,

we were able to connect 125 million people

into a single family tree.

I cannot draw the tree
on the screens over here

because they have less pixels

than the number of people in this tree.

But here is an example of a subset
of 6,000 individuals.

Each green node is a person.

The red nodes represent marriages,

and the connections represent parenthood.

In the middle of this tree,
you see the ancestors.

And as we go to the periphery,
you see the descendants.

This tree has seven
generations, approximately.

Now, this is what happens
when we increase the number of individuals

to 70,000 people –

still a tiny subset
of all the data that we have.

Despite that, you can already see
the formation of gigantic family trees

with many very distant relatives.

Thanks to the hard work
of our genealogists,

we can go back in time
hundreds of years ago.

For example, here is Alexander Hamilton,

who was born in 1755.

Alexander was the first
US Secretary of the Treasury,

but mostly known today
due to a popular Broadway musical.

We found that Alexander has deeper
connections in the showbiz industry.

In fact, he’s a blood relative of …

Kevin Bacon!

(Laughter)

Both of them are descendants
of a lady from Scotland

who lived in the 13th century.

So you can say that Alexander Hamilton

is 35 degrees of Kevin Bacon genealogy.

(Laughter)

And our tree has millions
of stories like that.

We invested significant efforts
to validate the quality of our data.

Using DNA, we found that .3 percent of
the mother-child connections in our data

are wrong,

which could match the adoption rate
in the US pre-Second World War.

For the father’s side,

the news is not as good:

1.9 percent of the father-child
connections in our data are wrong.

And I see some people smirk over here.

It is what you think –

there are many milkmen out there.

(Laughter)

However, this 1.9 percent error rate
in patrilineal connections

is not unique to our data.

Previous studies found
a similar error rate

using clinical-grade pedigrees.

So the quality of our data is good,

and that should not be a surprise.

Our genealogists have
a profound, vested interest

in correctly documenting
their family history.

We can leverage this data to learn
quantitative information about humanity,

for example, questions about demography.

Here is a look at all our profiles
on the map of the world.

Each pixel is a person
that lived at some point.

And since we have so much data,

you can see the contours
of many countries,

especially in the Western world.

In this clip, we stratified
the map that I’ve showed you

based on the year of births of individuals
from 1400 to 1900,

and we compared it
to known migration events.

The clip is going to show you
that the deepest lineages in our data

go all the way back to the UK,

where they had better record keeping,

and then they spread along
the routes of Western colonialism.

Let’s watch this.

(Music)

[Year of birth: ]

[1492 - Columbus sails the ocean blue]

[1620 - Mayflower lands in Massachusetts]

[1652 - Dutch settle in South Africa]

[1788 - Great Britain penal
transportation to Australia starts]

[1836 - First migrants use Oregon Trail]

[all activity]

I love this movie.

Now, since these migration events
are giving the context of families,

we can ask questions such as:

What is the typical distance
between the birth locations

of husbands and wives?

This distance plays
a pivotal role in demography,

because the patterns in which
people migrate to form families

determine how genes spread
in geographical areas.

We analyzed this distance using our data,

and we found that in the old days,

people had it easy.

They just married someone
in the village nearby.

But the Industrial Revolution
really complicated our love life.

And today, with affordable flights
and online social media,

people typically migrate more than
100 kilometers from their place of birth

to find their soul mate.

So now you might ask:

OK, but who does the hard work
of migrating from places to places

to form families?

Are these the males or the females?

We used our data to address this question,

and at least in the last 300 years,

we found that the ladies do the hard work

of migrating from places
to places to form families.

Now, these results
are statistically significant,

so you can take it as scientific fact
that males are lazy.

(Laughter)

We can move from questions
about demography

and ask questions about human health.

For example, we can ask

to what extent genetic variations
account for differences in life span

between individuals.

Previous studies analyzed the correlation
of longevity between twins

to address this question.

They estimated that the genetic
variations account for

about a quarter of the differences
in life span between individuals.

But twins can be correlated
due to so many reasons,

including various environmental effects

or a shared household.

Large family trees give us the opportunity
to analyze both close relatives,

such as twins,

all the way to distant relatives,
even fourth cousins.

This way we can build robust models

that can tease apart the contribution
of genetic variations

from environmental factors.

We conducted this analysis using our data,

and we found that genetic variations
explain only 15 percent

of the differences in life span
between individuals.

That is five years, on average.

So genes matter less than
what we thought before to life span.

And I find it great news,

because it means that
our actions can matter more.

Smoking, for example, determines
10 years of our life expectancy –

twice as much as what genetics determines.

We can even have more surprising findings

as we move from family trees

and we let our genealogists
document and crowdsource DNA information.

And the results can be amazing.

It might be hard to imagine,
but Uncle Bernie and his friends

can create DNA forensic capabilities

that even exceed
what the FBI currently has.

When you place the DNA
on a large family tree,

you effectively create a beacon

that illuminates the hundreds
of distant relatives

that are all connected to the person
that originated the DNA.

By placing multiple beacons
on a large family tree,

you can now triangulate the DNA
of an unknown person,

the same way that the GPS system
uses multiple satellites

to find a location.

The prime example
of the power of this technique

is capturing the Golden State Killer,

one of the most notorious criminals
in the history of the US.

The FBI had been searching
for this person for over 40 years.

They had his DNA,

but he never showed up
in any police database.

About a year ago, the FBI
consulted a genetic genealogist,

and she suggested that they submit
his DNA to a genealogy service

that can locate distant relatives.

They did that,

and they found a third cousin
of the Golden State Killer.

They built a large family tree,

scanned the different
branches of that tree,

until they found a profile
that exactly matched

what they knew about
the Golden State Killer.

They obtained DNA from this person
and found a perfect match

to the DNA they had in hand.

They arrested him
and brought him to justice

after all these years.

Since then, genetic genealogists
have started working with

local US law enforcement agencies

to use this technique
in order to capture criminals.

And only in the past six months,

they were able to solve
over 20 cold cases with this technique.

Luckily, we have people like Uncle
Bernie and his fellow genealogists

These are not amateurs
with a self-serving hobby.

These are citizen scientists
with a deep passion to tell us who we are.

And they know that the past
can hold a key to the future.

Thank you very much.

(Applause)

人们
出于各种原因使用互联网。

事实证明,
最受欢迎的网站类别之一

是人们
通常私下消费的东西。

它涉及好奇心、

非微不足道
的自我放纵

,并以记录他人
的生殖活动

为中心。

(笑声)

当然,我说的是家谱——

(笑声)

家族史的研究。

说到详述家谱,

在每个家庭中,我们都有这样
一个痴迷于家谱的人。

我们就叫他伯尼叔叔吧。

伯尼叔叔正是

在感恩节晚餐时最不想坐在旁边的人,

因为他会

一些关于一些古代亲戚的奇特细节让你厌烦至死。

但如你所知,

一切都有科学的一面

,我们发现伯尼叔叔的故事

在生物医学研究方面具有巨大的潜力。

我们让伯尼叔叔
和他的系谱学家同事

通过
一个名为 geni.com 的系谱网站记录他们的家谱。

当用户将
他们的树上传到网站时,

它会扫描他们的亲属

,如果找到与现有树匹配的,它将现有树

和新树合并在一起。

结果是创建了大型
家谱,

超出
了每个系谱学家的个人水平。

现在,通过与全世界数百万人重复这一过程

我们可以众包
构建全人类的家谱。

使用这个网站,

我们能够将 1.25 亿人连接

到一个家谱中。

我不能
在这里的屏幕上画树,

因为它们的像素

比这棵树上的人数少。

但这里是 6,000 个人子集的一个示例

每个绿色节点都是一个人。

红色节点代表婚姻

,连接代表父母身份。

在这棵树的中间,
你看到了祖先。

当我们走到外围时,
你会看到后代。

这棵树大约有七
代。

现在,
当我们将个人数量

增加到 70,000 人时,就会发生这种情况——

仍然
是我们拥有的所有数据的一小部分。

尽管如此,您已经可以看到

与许多非常远的亲戚形成了巨大的家谱。

感谢
我们的系谱学家的辛勤工作,

我们可以回到
数百年前。

例如,这里是亚历山大·汉密尔顿,

他出生于 1755 年。

亚历山大是美国第一
任财政部长,


由于一部广受欢迎的百老汇音乐剧而闻名于世。

我们发现亚历山大
在演艺界有着更深的联系。

事实上,他是……

凯文·培根的血亲!

(笑声)

他们都是

一位生活在 13 世纪的苏格兰女士的后裔。

所以你可以说亚历山大·汉密尔顿

是凯文·培根家谱的 35 度。

(笑声

) 我们的树上有数百万个
这样的故事。

我们投入了大量精力
来验证数据的质量。

使用 DNA,我们发现我们数据中 0.3%
的母子连接

是错误的,

这可能与二战前美国的采用率相匹配

对于父亲一方来说,

这个消息不太好:我们数据中

1.9% 的父子
关系是错误的。

我看到有些人在这里傻笑。

这就是你的想法——

那里有很多送奶工。

(笑声)

然而,父系关系中 1.9% 的错误率

并不是我们的数据所独有的。

以前的研究使用临床级谱系发现
了类似的错误率

所以我们的数据质量很好

,这不足为奇。

我们的系谱学家

对正确记录
他们的家族历史有着深刻的既得利益。

我们可以利用这些数据来学习
有关人类的定量信息,

例如有关人口统计学的问题。

以下是我们
在世界地图上的所有个人资料。

每个像素都是
一个生活在某个时刻的人。

而且由于我们有这么多的数据,

你可以看到
许多国家的轮廓,

尤其是在西方世界。

在此剪辑中,我们根据 1400 年至 1900 年个人的出生年份对
我向您展示的地图进行了分层

并将其
与已知的迁移事件进行了比较。

该剪辑将向您展示
我们数据中最深的血统

一直追溯到英国,

在那里他们有更好的记录,

然后沿着
西方殖民主义的路线传播。

让我们看看这个。

(音乐)

[出生年份:]

[1492 - 哥伦布在蓝色的海洋中航行]

[1620 - 五月花登陆马萨诸塞州]

[1652 - 荷兰人在南非定居]

[1788 - 英国
到澳大利亚的刑事运输开始]

[1836 - 第一 移民使用俄勒冈小径]

[所有活动]

我喜欢这部电影。

现在,由于这些迁移
事件给出了家庭的背景,

我们可以提出以下问题

:丈夫和妻子的出生地之间的典型距离是

多少?

这种距离
在人口统计学中起着举足轻重的作用,

因为
人们迁移形成家庭的模式

决定了基因
在地理区域中的传播方式。

我们使用我们的数据分析了这个距离

,我们发现在过去,

人们很容易。

他们刚和
附近村子里的人结婚。

但工业革命
确实使我们的爱情生活复杂化了。

而今天,有了廉价航班
和在线社交媒体,

人们通常会在
距离出生地 100 多公里的地方

寻找自己的灵魂伴侣。

所以现在你可能会问:

好吧,但是
从一个地方迁移到另一个

地方组建家庭的艰苦工作是谁做的?

这些是男的还是女的?

我们使用我们的数据来解决这个问题,

并且至少在过去的 300 年中,

我们发现女士们做着

从一个地方迁移到另一个
地方组建家庭的艰苦工作。

现在,这些结果
在统计上是显着的,

所以你可以把
男性懒惰作为科学事实。

(笑声)

我们可以从

人口统计学问题转向人类健康问题。

例如,我们可以询问

遗传变异在多大程度上
解释了

个体之间的寿命差异。

以前的研究分析了
双胞胎之间寿命的相关性

来解决这个问题。

他们估计,遗传
变异约占

个体寿命差异的四分之一。

但是双胞胎可能
由于很多原因而相关,

包括各种环境影响

或共同的家庭。

大型家谱让我们有
机会分析近亲,

例如双胞胎,

一直到远亲,
甚至是第四代堂兄弟。

通过这种方式,我们可以建立稳健的模型


将遗传变异

与环境因素的贡献区分开来。

我们使用我们的数据进行了这项分析

,我们发现遗传变异
仅解释

了个体之间寿命差异的 15%

平均而言,就是五年。

因此,基因
对寿命的影响比我们以前想象的要小。

我觉得这是个好消息,

因为这意味着
我们的行动可以更重要。

例如,吸烟决定
了我们 10 年的预期寿命——

是基因决定的两倍。

当我们从家谱中转移出来时,我们甚至可以得到更多令人惊讶的发现,

让我们的系谱学家
记录和众包 DNA 信息。

结果可能是惊人的。

这可能很难想象,
但伯尼叔叔和他的朋友们

可以创造

出甚至
超过联邦调查局目前拥有的 DNA 取证能力。

当你将 DNA
放在一个大的家谱上时,

你就有效地创建了一个灯塔

,照亮了数百
名远亲

,这些远亲都与
DNA 的起源者有关。

通过
在大型家谱上放置多个信标,

您现在可以对未知人的 DNA 进行三角测量

,就像 GPS 系统
使用多个卫星

来查找位置一样。

这种技术力量的主要例子

是捕捉金州杀手,美国

历史上最臭名昭著的罪犯
之一。

联邦调查局一直在
寻找这个人超过 40 年。

他们有他的 DNA,

但他从未出现
在任何警察数据库中。

大约一年前,联邦调查局
咨询了一位遗传系谱学家

,她建议他们将
他的 DNA 提交给

可以定位远亲的家谱服务。

他们这样做了

,他们找到
了金州杀手的第三个堂兄。

他们建立了一个大的家谱,

扫描
了那棵树的不同分支,

直到他们找到了一个

他们
对金州杀手的了解完全匹配的档案。

他们从这个人那里获得了 DNA,
并找到了与

他们手头上的 DNA 完美匹配的 DNA。 这些年来,

他们逮捕了他
并将他

绳之以法。

从那时起,遗传系谱
学家开始与

美国当地执法机构

合作,使用这种
技术来抓捕罪犯。

仅在过去的六个月里,

他们就
用这种技术解决了 20 多起疑难病例。

幸运的是,我们有像
伯尼叔叔和他的系谱学家

这样的人。他们不是
有自私爱好的业余爱好者。

这些公民
科学家充满热情地告诉我们我们是谁。

他们知道过去
可以掌握未来的钥匙。

非常感谢你。

(掌声)