Big Data Tim Smith

Translator: Andrea McDonough
Reviewer: Jessica Ruby

Big data is an elusive concept.

It represents an amount of digital information,

which is uncomfortable to store,

transport,

or analyze.

Big data is so voluminous

that it overwhelms the technologies of the day

and challenges us to create the next generation

of data storage tools and techniques.

So, big data isn’t new.

In fact, physicists at CERN have been rangling

with the challenge of their ever-expanding big data for decades.

Fifty years ago, CERN’s data could be stored

in a single computer.

OK, so it wasn’t your usual computer,

this was a mainframe computer

that filled an entire building.

To analyze the data,

physicists from around the world traveled to CERN

to connect to the enormous machine.

In the 1970’s, our ever-growing big data

was distributed across different sets of computers,

which mushroomed at CERN.

Each set was joined together

in dedicated, homegrown networks.

But physicists collaborated without regard

for the boundaries between sets,

hence needed to access data on all of these.

So, we bridged the independent networks together

in our own CERNET.

In the 1980’s, islands of similar networks

speaking different dialects

sprung up all over Europe and the States,

making remote access possible but torturous.

To make it easy for our physicists across the world

to access the ever-expanding big data

stored at CERN without traveling,

the networks needed to be talking

with the same language.

We adopted the fledgling internet working standard from the States,

followed by the rest of Europe,

and we established the principal link at CERN

between Europe and the States in 1989,

and the truly global internet took off!

Physicists could easily then access

the terabytes of big data

remotely from around the world,

generate results,

and write papers in their home institutes.

Then, they wanted to share their findings

with all their colleagues.

To make this information sharing easy,

we created the web in the early 1990’s.

Physicists no longer needed to know

where the information was stored

in order to find it and access it on the web,

an idea which caught on across the world

and has transformed the way we communicate

in our daily lives.

During the early 2000’s,

the continued growth of our big data

outstripped our capability to analyze it at CERN,

despite having buildings full of computers.

We had to start distributing the petabytes of data

to our collaborating partners

in order to employ local computing and storage

at hundreds of different institutes.

In order to orchestrate these interconnected resources

with their diverse technologies,

we developed a computing grid,

enabling the seamless sharing

of computing resources around the globe.

This relies on trust relationships and mutual exchange.

But this grid model could not be transferred

out of our community so easily,

where not everyone has resources to share

nor could companies be expected

to have the same level of trust.

Instead, an alternative, more business-like approach

for accessing on-demand resources

has been flourishing recently,

called cloud computing,

which other communities are now exploiting

to analyzing their big data.

It might seem paradoxical for a place like CERN,

a lab focused on the study

of the unimaginably small building blocks of matter,

to be the source of something as big as big data.

But the way we study the fundamental particles,

as well as the forces by which they interact,

involves creating them fleetingly,

colliding protons in our accelerators

and capturing a trace of them

as they zoom off near light speed.

To see those traces,

our detector, with 150 million sensors,

acts like a really massive 3-D camera,

taking a picture of each collision event -

that’s up to 14 millions times per second.

That makes a lot of data.

But if big data has been around for so long,

why do we suddenly keep hearing about it now?

Well, as the old metaphor explains,

the whole is greater than the sum of its parts,

and this is no longer just science that is exploiting this.

The fact that we can derive more knowledge

by joining related information together

and spotting correlations

can inform and enrich numerous aspects of everyday life,

either in real time,

such as traffic or financial conditions,

in short-term evolutions,

such as medical or meteorological,

or in predictive situations,

such as business, crime, or disease trends.

Virtually every field is turning to gathering big data,

with mobile sensor networks spanning the globe,

cameras on the ground and in the air,

archives storing information published on the web,

and loggers capturing the activities

of Internet citizens the world over.

The challenge is on to invent new tools and techniques

to mine these vast stores,

to inform decision making,

to improve medical diagnosis,

and otherwise to answer needs and desires

of tomorrow’s society in ways that are unimagined today.

译者:Andrea McDonough
审稿人:Jessica Ruby

大数据是一个难以捉摸的概念。

它代表了大量的数字信息

,难以存储、

传输

或分析。

大数据是如此庞大

,以至于它压倒了当今的技术,

并挑战我们创建

下一代数据存储工具和技术。

因此,大数据并不新鲜。

事实上,欧洲核子研究中心的物理学家

几十年来一直在努力应对不断扩展的大数据带来的挑战。

五十年前,欧洲核子研究中心的数据可以存储

在一台计算机中。

好的,所以这不是您通常的计算机,

这是一台

装满整个建筑物的大型计算机。

为了分析数据,

来自世界各地的物理学家前往欧洲核子研究中心

连接这台巨大的机器。

在 1970 年代,我们不断增长的大

数据分布在不同的计算机组中,

这在 CERN 中如雨后春笋般涌现。

每组都

在专用的本土网络中连接在一起。

但是物理学家在不考虑

集合之间的界限的情况下进行合作,

因此需要访问所有这些数据。

因此,我们在自己的 CERNET 中将独立的网络桥接在一起

在 1980 年代,使用不同方言的类似网络的岛屿

在欧洲和美国如雨后春笋般涌现,

使得远程访问成为可能,但也很痛苦。

为了让我们世界各地的物理学家无需旅行

即可轻松访问存储在 CERN 的不断扩展的大数据

这些网络需要

使用相同的语言进行交流。

我们采用了美国刚刚起步的互联网工作标准

,欧洲其他国家紧随其后

,我们于 1989 年在 CERN 建立了欧洲和美国之间的主要联系,

真正的全球互联网起飞了!

然后,物理学家可以轻松地从世界各地远程

访问数 TB 的大数据

生成结果,

并在他们所在的研究所撰写论文。

然后,他们想与所有同事分享他们的发现

为了使这种信息共享变得容易,

我们在 1990 年代初期创建了网络。

物理学家不再需要

知道信息存储

在哪里才能在网络上找到并访问它,

这一想法在世界范围内流行

并改变

了我们日常生活中的交流方式。

在 2000 年代初期

,我们的大数据的持续增长

超过了我们在 CERN 对其进行分析的能力,

尽管大楼里到处都是计算机。

我们必须开始将 PB 级的数据分

发给我们的合作伙伴

,以便

在数百个不同的机构中使用本地计算和存储。

为了协调这些相互关联的

资源及其不同的技术,

我们开发了一个计算网格,

实现

了全球计算资源的无缝共享。

这依赖于信任关系和相互交流。

但是这种网格模型不能轻易地

从我们的社区中转移出来,

因为不是每个人都有资源可以共享,

也不能期望公司

拥有相同的信任水平。

取而代之的是,一种替代的、更类似于业务

的访问按需资源

的方法最近蓬勃发展,

称为云计算

,其他社区现在正在利用它

来分析他们的大数据。

对于像欧洲核子

研究中心(CERN)这样一个专注于

研究难以想象的小物质组成部分的实验室

来说,成为大数据等大数据的来源似乎有些自相矛盾。

但是我们研究基本粒子

以及它们相互作用的力的方式

涉及到短暂地创造它们,

在我们的加速器中碰撞质子,

并在它们

以接近光速的速度缩小时捕捉它们的痕迹。

为了看到这些痕迹,

我们的探测器拥有 1.5 亿个传感器,

就像一个非常巨大的 3-D 相机,

拍摄每个碰撞事件的照片

  • 每秒高达 1400 万次。

这会产生大量数据。

但是,如果大数据已经存在了这么久,

为什么我们现在突然听到它?

好吧,正如古老的比喻所解释的那样

,整体大于部分之

和,这不再只是利用这一点的科学。

事实上,我们可以

通过将相关信息结合在一起

并发现相关性来获得更多知识,这一事实

可以实时地告知和丰富日常生活的许多方面

例如交通或金融状况,

在短期演变中,

例如医疗或气象,

或在预测情况下,

例如商业、犯罪或疾病趋势。

几乎每个领域都在转向收集大数据,

包括遍布全球的移动传感器网络、

地面和空中的摄像头、

存储在网络上发布的信息的档案

以及捕捉

世界各地互联网公民活动的记录器。

挑战在于发明新的工具和技术

来挖掘这些巨大的商店,

为决策提供信息

,改善医疗诊断,

以及以

今天无法想象的方式满足未来社会的需求和愿望。