Should we get rid of standardized testing Arlo Kempf

The first standardized tests
that we know of

were administered in China
over 2,000 years ago

during the Han dynasty.

Chinese officials used them to determine
aptitude for various government posts.

The subject matter included philosophy,

farming,

and even military tactics.

Standardized tests continued to be used
around the world for the next two millennia,

and today, they’re used for everything

from evaluating stair climbs
for firefighters in France

to language examinations
for diplomats in Canada

to students in schools.

Some standardized tests measure scores

only in relation to the results
of other test takers.

Others measure performances on how well
test takers meet predetermined criteria.

So the stair climb for the firefighter

could be measured by comparing
the time of the climb

to that of all other firefighters.

This might be expressed in what
many call a bell curve.

Or it could be evaluated with reference
to set criteria,

such as carrying a certain amount
of weight a certain distance

up a certain number of stairs.

Similarly, the diplomat might be measured
against other test-taking diplomats,

or against a set of fixed criteria,

which demonstrate different levels
of language proficiency.

And all of these results can be expressed
using something called a percentile.

If a diplomat is in the 70th percentile,
70% of test takers scored below her.

If she scored in the 30th percentile,
70% of test takers scored above her.

Although standardized tests
are sometimes controversial,

they’re simply a tool.

As a thought experiment,
think of a standardized test as a ruler.

A ruler’s usefulness
depends on two things.

First, the job we ask it to do.

Our ruler can’t measure
the temperature outside

or how loud someone is singing.

Second, the ruler’s usefulness depends
on its design.

Say you need to measure the circumference
of an orange.

Our ruler measures length,
which is the right quantity,

but it hasn’t been designed with the
flexibility required for the task at hand.

So, if standardized tests are given
the wrong job,

or aren’t designed properly,

they may end up measuring
the wrong things.

In the case of schools,

students with test anxiety may have
trouble performing their best

on a standardized test,

not because they don’t know the answers,

but because they’re feeling too nervous
to share what they’ve learned.

Students with reading challenges

may struggle with the wording
of a math problem,

so their test results may better reflect
their literacy

rather than numeracy skills.

And students who were confused by examples

on tests that contain
unfamiliar cultural references

may do poorly,

telling us more about the test taker’s
cultural familiarity

than their academic learning.

In these cases, the tests may need
to be designed differently.

Standardized tests can also
have a hard time

measuring abstract
characteristics or skills,

such as creativity, critical thinking,
and collaboration.

If we design a test poorly,

or ask it to do the wrong job,

or a job it’s not very good at,

the results may not be reliable or valid.

Reliability and validity
are two critical ideas

for understanding standardized tests.

To understand the difference between them,

we can use the metaphor
of two broken thermometers.

An unreliable thermometer

gives you a different reading
each time you take your temperature,

and the reliable but invalid thermometer
is consistently ten degrees too hot.

Validity also depends on accurate
interpretations of results.

If people say results of a test
mean something they don’t,

that test may have a validity problem.

Just as we wouldn’t expect a ruler
to tell us how much an elephant weighs,

or what it had for breakfast,

we can’t expect standardized tests alone
to reliably tell us how smart someone is,

how diplomats will handle
a tough situation,

or how brave a firefighter
might turn out to be.

So standardized tests may help us learn
a little about a lot of people

in a short time,

but they usually can’t tell us a lot
about a single person.

Many social scientists worry about
test scores resulting in sweeping

and often negative changes
for test takers,

sometimes with long-term
life consequences.

We can’t blame the tests, though.

It’s up to us to use the right tests
for the right jobs,

and to interpret results appropriately.

我们所知道的第一个标准化考试

在 2000 多年前

的汉朝时期在中国进行的。

中国官员用它们来
确定各种政府职位的能力。

主题包括哲学、

农业,

甚至军事战术。

在接下来的两千年里,标准化考试继续在世界范围内使用,

今天,它们被用于

评估
法国消防员的爬楼梯

、加拿大外交官

和学校学生的语言考试等方方面面。

一些标准化考试


根据其他考生的成绩来衡量分数。

其他人则根据
应试者满足预定标准的程度来衡量表现。

因此,

可以通过比较

所有其他消防员的爬楼梯时间来衡量消防员爬楼梯的时间。

这可以用
许多人所说的钟形曲线来表达。

或者它可以
参考设定的标准进行评估,

例如携带一定数量
的重量

上一定数量的楼梯一定距离。

类似地,外交官可能会
与其他应试外交官

或一组固定标准进行衡量,

这些标准展示了不同水平
的语言能力。

所有这些结果都可以
用百分位数表示。

如果外交官在第 70 个百分位,则
70% 的考生得分低于她。

如果她得分在第 30 个百分位,则
70% 的考生得分高于她。

尽管标准化测试
有时会引起争议,

但它们只是一种工具。

作为一个思想实验,
把标准化考试想象成一把尺子。

尺子的用处
取决于两件事。

首先,我们要求它做的工作。

我们的尺子无法测量
外面的温度

或某人唱歌的音量。

其次,尺子的用处
取决于它的设计。

假设您需要测量
橙子的周长。

我们的尺子测量长度,
这是正确的数量,

但它的设计并没有
满足手头任务所需的灵活性。

因此,如果标准化测试被分配
了错误的工作,

或者没有正确设计,

他们最终可能会
测量错误的东西。

就学校而言,

有考试焦虑的学生可能

在标准化考试中表现不佳,

不是因为他们不知道答案,

而是因为他们感到太紧张而
无法分享他们学到的东西。

有阅读障碍的学生

可能会
在数学问题的措辞上遇到困难,

因此他们的测试结果可能更好地反映了
他们的读写能力

而不是计算能力。

那些对包含
不熟悉的文化参考的测试示例感到困惑的学生

可能会做得很差,

告诉我们更多关于应试者的
文化熟悉程度而

不是他们的学术学习。

在这些情况下,可能需要
以不同的方式设计测试。

标准化测试
也很难

衡量抽象
特征或技能,

例如创造力、批判性思维
和协作。

如果我们设计的测试不好,

或者要求它做错的工作,

或者它不太擅长的工作

,结果可能不可靠或无效。

信度和效度

是理解标准化考试的两个关键概念。

为了理解它们之间的区别,

我们可以使用
两个坏温度计的比喻。

不可靠的温度计在您每次测量体温时

都会给您不同的读数

而可靠但无效的温度计
总是太热十度。

有效性还取决于
对结果的准确解释。

如果人们说测试结果
意味着他们不知道的东西,

那么该测试可能存在有效性问题。

就像我们不希望
尺子告诉我们大象有多重,

或者它早餐吃了什么一样,

我们也不能指望仅靠标准化测试
就能可靠地告诉我们某人有多聪明,

外交官将如何
应对艰难的情况,

或者一个消防员
可能会变得多么勇敢。

所以标准化测试可能会帮助我们在
短时间内了解很多人

但它们通常不能告诉我们很多
关于一个人的信息。

许多社会科学家担心
考试成绩会给应试者带来彻底的

、通常是负面的
变化,

有时会带来长期的
生活后果。

不过,我们不能责怪测试。

我们有
责任为正确的工作使用正确的测试,

并适当地解释结果。