How computers translate human language Ioannis Papachimonas

How is it that so many
intergalactic species in movies and TV

just happen to speak perfect English?

The short answer is that no one
wants to watch a starship crew

spend years compiling an alien dictionary.

But to keep things consistent,

the creators of Star Trek
and other science-fiction worlds

have introduced the concept
of a universal translator,

a portable device that can instantly
translate between any languages.

So is a universal translator
possible in real life?

We already have many programs
that claim to do just that,

taking a word, sentence,
or entire book in one language

and translating it into almost any other,

whether it’s modern English
or Ancient Sanskrit.

And if translation were just a matter
of looking up words in a dictionary,

these programs would run circles
around humans.

The reality, however,
is a bit more complicated.

A rule-based translation program
uses a lexical database,

which includes all the words
you’d find in a dictionary

and all grammatical forms they can take,

and set of rules to recognize the basic
linguistic elements in the input language.

For a seemingly simple sentence like,
“The children eat the muffins,”

the program first parses its syntax,
or grammatical structure,

by identifying the children
as the subject,

and the rest of the sentence
as the predicate

consisting of a verb “eat,”

and a direct object “the muffins.”

It then needs to recognize
English morphology,

or how the language can be broken down
into its smallest meaningful units,

such as the word muffin

and the suffix “s,”
used to indicate plural.

Finally, it needs to understand
the semantics,

what the different parts of the sentence
actually mean.

To translate this sentence properly,

the program would refer to a different set
of vocabulary and rules

for each element of the target language.

But this is where it gets tricky.

The syntax of some languages
allows words to be arranged in any order,

while in others, doing so could make
the muffin eat the child.

Morphology can also pose a problem.

Slovene distinguishes between
two children and three or more

using a dual suffix absent
in many other languages,

while Russian’s lack of definite articles
might leave you wondering

whether the children are eating
some particular muffins,

or just eat muffins in general.

Finally, even when the semantics
are technically correct,

the program might miss their finer points,

such as whether the children
“mangiano” the muffins,

or “divorano” them.

Another method is
statistical machine translation,

which analyzes a database
of books, articles, and documents

that have already
been translated by humans.

By finding matches between source
and translated text

that are unlikely to occur by chance,

the program can identify corresponding
phrases and patterns,

and use them for future translations.

However, the quality
of this type of translation

depends on the size
of the initial database

and the availability of samples
for certain languages

or styles of writing.

The difficulty that computers have
with the exceptions, irregularities

and shades of meaning
that seem to come instinctively to humans

has led some researchers to believe
that our understanding of language

is a unique product
of our biological brain structure.

In fact, one of the most famous
fictional universal translators,

the Babel fish from
“The Hitchhiker’s Guide to the Galaxy”,

is not a machine at all
but a small creature

that translates the brain waves
and nerve signals of sentient species

through a form of telepathy.

For now, learning a language
the old fashioned way

will still give you better results than
any currently available computer program.

But this is no easy task,

and the sheer number
of languages in the world,

as well as the increasing interaction
between the people who speak them,

will only continue to spur greater
advances in automatic translation.

Perhaps by the time we encounter
intergalactic life forms,

we’ll be able to communicate with them
through a tiny gizmo,

or we might have to start compiling
that dictionary, after all.

电影和电视中这么多星际物种怎么

会说一口流利的英语？

简短的回答是，没有人
愿意看到星际飞船的船员

花费数年时间编写外星人字典。

但为了保持一致，

《星际迷航》
和其他科幻世界

的创作者引入
了通用翻译器的概念，这是

一种可以
在任何语言之间即时翻译的便携式设备。

那么
现实生活中是否有可能实现通用翻译器？

我们已经有许多程序
声称可以做到这一点，

将一个单词、句子
或整本书用一种

语言翻译成几乎任何其他语言，

无论是现代英语
还是古梵语。

如果翻译只是
在字典中查找单词，那么

这些程序就会
绕着人类转圈子。

然而，现实
情况要复杂一些。

基于规则的翻译程序
使用词汇数据库，

其中包括
您在字典中找到的

所有单词及其可以采用的所有语法形式，

以及识别
输入语言中基本语言元素的一组规则。

对于一个看似简单的句子，如
“孩子们吃松饼

”，程序首先解析其句法
或语法结构，

将孩子识别为主语

，句子的其余部分
作为谓词，

由动词“吃， ”

和直接宾语“松饼”。

然后它需要识别
英语形态，

或者如何将语言分解
为最小的有意义的单元，

例如单词 muffin

和用于表示复数的后缀“s”
。

最后，它需要
理解语义，

即句子不同部分的
实际含义。

为了正确翻译这句话

，程序将为目标语言的每个元素引用一组不同
的词汇和规则

。

但这就是棘手的地方。

一些语言的语法
允许单词以任何顺序排列，

而在其他语言中，这样做可能
会使松饼吃掉孩子。

形态学也可能造成问题。

斯洛文尼亚语使用许多其他语言中不存在的双后缀来区分
两个孩子和三个或更多孩子

，

而俄语缺乏定冠词
可能会让你想

知道孩子们是在吃
一些特定的松饼，

还是只是吃一般的松饼。

最后，即使语义
在技术上是正确的

，程序也可能会错过它们的细节，

例如孩子们是
“mangiano”松饼，

还是“divorano”他们。

另一种方法是
统计机器翻译，

它分析已由人类翻译
的书籍、文章和文档

的数据库
。

通过在源
文本和翻译文本

之间寻找不太可能偶然发生的匹配

，程序可以识别相应的
短语和模式，

并将它们用于未来的翻译。

但是，此类翻译的质量

取决于初始数据库的大小

以及
某些语言

或写作风格的样本的可用性。

计算机
在例外、不规则

和含义深浅方面遇到的困难
似乎是人类本能地出现的，

这使一些研究人员相信
，我们对语言的理解

是我们生物大脑结构的独特产物。

事实上，最著名的
虚构通用翻译器之一，

《银河系漫游指南》中的巴别鱼，

根本不是一台机器，
而是一种小型生物

，它通过一种形式来翻译有情物种的脑电波
和神经信号

。心灵感应。

就目前而言，以老式的方式学习一门语言

仍然会给你比
任何目前可用的计算机程序更好的结果。

但这并非易事，

世界上语言的数量之多，

以及
说这些语言的人们之间日益增加的互动，

只会继续推动
自动翻译的更大进步。

也许当我们遇到
星际生命形式时，

我们将能够
通过一个小小玩意与他们交流，

或者我们可能不得不开始编译
那本字典，毕竟。