Sentiment Analysis extracting emotion through machine learning

[Music]

[Laughter]

[Applause]

reaching your pockets

hopefully your phone’s still there

our phones do a lot for us they check

the weather

they remind us to turn on our alarm just

in case we don’t

wake up the next morning but there’s one

thing

our phones can’t do yet tell us how we

are

hey siri how am i doing today

okay google how are my emotions today

see these seem like ridiculous questions

but with advancements in sentiment

analysis and machine learning

our machines are becoming closer to

answering

these very questions let me give you a

sentence

i love that movie and i asked you to

rate it out of the 10

with 0 being negative and 10 being

positive

now we’d all agree that this is a pretty

positive sentence

and give it around to 10. let’s change

the verb a bit

i liked that movie here

still pretty positive but clearly lower

on the scale

now let’s go to the other end of the

spectrum

i hated that movie now whoever said this

clearly feels negatively about the

subject and so we’d probably give this

around a zero now sentiment analysis

is simply using machine learning to

teach computers

to do just this extract

the sentiments out of our sentences

now how does this work what is machine

learning

machine learning is simply like a

function in math

you give it one or many numbers and it

spits out another

in machine learning these functions are

called

models now these models are often neural

networks

that simulate the structures of our

brains

to set to get inputs and their

associations

to build models predicting future inputs

now here’s joe who’s joe you might ask

joe’s our friendly neighborhood machine

learning model of course

now we want joe to tell us whether or

not this

image is a tiger see

joe’s pretty sad right now because he

has no clue what to do

here is where we train our model

in order for joe to tell us what a tiger

is and what a tiger isn’t

we as humans must need to first tell him

what a tiger looks like and what it

doesn’t

there’s a slight problem here however

joe doesn’t see this image like we do

the one thing joe can do however is

interpret numbers

so what we can do is give these images

to joe

as a list of numbers of rgb vectors

for each pixel now let me break that

down

rgb vectors are red green blue vectors

for each pixel

denoting the color of each pixel in an

image

now what that effectively allows us to

do is convert these images

into numbers that’s great joe can now

understand what we’re trying to give him

he knows what he’s trying to do

now what we can do when given an

unfamiliar image

is turn this into rgb vectors

give it to joe and hey what do you know

he thinks it’s a tiger he’s 97 sure

actually

so this was the case with pictures with

images

but we’re talking about sentences the

thing is

it’s exactly the same how

do we turn words into numbers

now some of you might be thinking let’s

just slap a number on each word and

call it a day but the thing is if we

train

our models using those vector inputs

we’d run into a problem this method

struggles to recognize the semantic

similarity

between words for example this method

fails to recognize

similarity between a word such as loved

and liked

as opposed to a low similarity between

loved

and hated here is where we run

into one of the most fundamental

concepts in sentiment analysis

word vectors now what are verb vectors

well they’re exactly as they would seem

they’re vectors corresponding to each

word

much like the rgb vectors for each pixel

now unlike the rgb vectors however

these word vectors can span from 25 up

to a thousand components

now conveniently as these vectors are

still simply a list of numbers

they can be plotted on an n dimensional

space

but for the sake of visualization and

your brains let’s reduce that down

to two

on this coordinate plane what word

vectors allow

us to do is to demonstrate and evaluate

the relationships between words as

distances

between points now somewhere on this

coordinate plane

lion and cat would be near each other

related by their fellini

while somewhere else on the plane honda

and

ford would be clustered together related

by their car manufacturing status

now this seems to be working great

what’s the problem

well the problem comes in when we add in

more words

what would we do about the word jaguar

now it is a feline so does it

go somewhere near cat and lion no but it

is a car manufacturer

somewhere in the middle we have no clue

we’ve run into the dilemma which makes

it possible for word vectors to be

multi-dimensional

by adding more vectors by adding more

dimensions to these vectors

we’re able to express the relationships

between

words in the english language with more

nuance

great now we have these word vectors and

we can associate them

to the words in our sentence converting

them to numbers

that joe loves theoretically now

we can feed these numbers to joe and joe

will now be able to predict the

sentiment of any sentence we give it

so naturally i decided to put that to

the test

where would i get my data to train this

model

well after some searching i decided to

go with kaggle’s twitter sentiment data

set

consisting of 1.5 million tweets

manually categorized by either zero for

negative

or one for positive now you might be

thinking hold up 1.5

million tweets it’s a lot of tweets and

that is true it is a lot

of tweets but just as you and i

will be better at identifying something

the more examples you got of it

joe can benefit from as much data as we

can give it as for our word vectors

however

i went with stanford university’s glove

stand for global

vectors now this word vector set

was pre-created which means that these

researchers had to go through

thousands and thousands of sentences

look at instances for each word

and evaluate their context to create

word vectors

for each and every word

there was one more step i had to take

before i could train

joe however let’s look at this tweet

stopped at mcdonald’s for lunch i’m

excited

nuggets now if we fed this

right to our model we’d see a problem

see us as humans can see through the

twitter clutter can

see through the various distractions

in this tweet but for joe joe needs a

bit

of help and that’s why we need to clean

this data set show you what i mean

the first things to go were punctuation

along with punctuation when twitter

artifacts such as mentions hashtags and

links

second went are what recalled in natural

language processing

as stop words words such as as

if i and that don’t necessarily add to

the meaning of the sentence

now finally and arguably the most

tricky part of cleaning this data set

was how to deal with internet

slang now it’s impossible to go anywhere

on the internet without encountering

some sort of abbreviation some sort of

slang

the tough part about dealing with this

is that there is no set

way of evaluating these words

now to be fair common words such as

law or lmao all have their individual

entries in word vector sets such as

glove

now misspellings such as the one we see

in this tweet here

can be caught with a spell check but

some

words and phrases do end up slipping

through our fingers

and that does make or break some

sentiment analysis models

now regardless we’ve caught that

and now we were able to condense that

original tweet

into the four words that you see on the

bottom there

now that we’ve cleaned our tweets we can

associate the word vectors

in gloves to each word and now

again we have our numbers to word

association

and can now train our model

that’s exactly what i did now

how is my model you might be asking how

good was it

well luckily for the safety of the

internet world as we know it

i wasn’t that successful my model

reached around a 60

accuracy which meant that it was able to

correctly identify the sentiments of

around 60 percent of the sentences that

i gave it

however considering that this is a

problem that has yet to be solved

this number is a sign of hope for things

to come

now throughout this talk you might have

been asking yourself why do we care who

asked andy what’s next

and i’m here to tell you this it’s true

this technology is bringing us ever so

closer

to our inevitable robot overlord

world but i still believe that this

technology

is imperative and essential to our

technological

development for the benefits that it can

provide

now currently the applications of

sentiment analysis

are purely commercial we see movie

producers using sentiment analysis

to evaluate audience feedback on their

recent projects

we see corporations including this

technology

to assess how consumers are reacting to

their products

but in the future as this technology

gets better we can see

that this technology can be applied to a

myriad

of problems for example sentiment

analysis could be used

to provide help for people with mental

health issues

many people with these issues find

refuge

in the internet and so with this

technology we’ll be able to provide

help for people that might have been

reluctant to seek it

furthermore this technology could be

used to gauge radicalism on the internet

as the internet has become a hub for

radicalization

we can see that this technology can be

used by governments

to make the internet safer for us all

and hey if that didn’t reach all of you

then maybe our phones can become a

therapist one day

hey siri how am i doing thank you

[Applause]

you

[音乐]

[笑声]

[掌声] 把手

伸进你的口袋

希望你的手机还在

我们的手机还不能做的一件事告诉我们我们

现在过

得怎么样

这些问题让我给你一个

句子，

我喜欢那部电影，我让你

在 10 分中给它打分

，0 分是负面的，10 分是

正面的，

现在我们都同意这是一个非常

积极的句子，

然后给它 10. 让我们

稍微改变一下动词

我很喜欢那部电影在这里

仍然很积极但显然

在规模上

现在让我们转到光谱的另一端

我现在讨厌那部电影无论谁说这

显然是负面的这个

主题，所以我们可能会给出

这个零现在情绪分析

只是使用机器学习来

教计算机

这样做现在

从我们的句子中提取情绪

这是如何工作的什么是机器

学习

机器学习就像一个

数学中的函数

你给它一个或多个数字，它

在机器学习中吐出另一个数字这些函数现在

称为

模型这些模型通常是神经

网络

，它模拟我们大脑的结构

以设置获取输入及其

关联

以构建预测未来的模型输入

现在这里是乔谁是乔你可能会问

乔是我们友好的邻里机器

学习模型当然

现在我们想让乔告诉我们

这张

图片是否是一只老虎看到

乔现在很伤心因为

他不知道在这里做什么

是哪里我们训练我们的模型

是为了让乔告诉我们什么是

老虎，什么不是老虎

我们作为人类必须首先告诉

他老虎的样子喜欢和它没有什么

这里有一个小问题但是

乔没有看到这个图像就像我们做

的一件事乔可以做的但是

解释数字

所以我们可以做的就是将这些图像作为数字列表提供

给乔

每个像素的 rgb 向量现在让我分解一下

rgb 向量是

每个像素的红绿蓝向量，

表示图像中每个像素的颜色

现在有效地允许我们

做的是将这些图像

转换为乔现在可以

理解的数字我们想给他

什么他知道他

现在想做什么当给定一个不熟悉的图像时我们能做的

就是把它变成 rgb 矢量

把它给乔，嘿，你知道

他认为这是一只老虎，他确实 97 岁了

所以这就是带有图像的图片的情况，

但我们谈论的是句子，

事情

是完全一样的

，我们如何将单词变成数字

现在你们中的一些人可能会想，让

我们在每个单词上加上一个数字，然后

收工但是问题是，如果我们

使用这些向量输入来训练我们的模型，

我们会遇到一个问题，这种方法

难以识别单词之间的语义

相似性

，例如，这种方法

无法识别

单词之间的相似性，例如喜爱

和喜欢

，而不是低爱与恨之间的相似性

是我们

遇到情感分析词向量中最基本的概念之一的地方，

现在什么是动词

向量，它们就像看起来一样，

它们是对应于每个

词

的向量，很像现在每个像素

都与 rgb 向量不同，但是

这些词向量现在可以很方便地跨越 25

到 1000 个分量，

因为这些向量

仍然只是一个数字列表，

它们可以绘制在 n 维

空间上，

但是为了可视化和

你的大脑让我们

在这个坐标平面上把它减少到两个词

向量允许

我们做的是展示和评估

词之间的关系因为

现在这个坐标平面上某个点之间的距离

狮子和猫会

因为他们的fellini而彼此靠近，

而本田和福特飞机上的其他地方

会

因为他们的汽车制造状况而聚集在一起，

现在这似乎

很好用

问题很好，当我们添加

单词时，问题就

来了

‘已经陷入困境，通过向这些向量添加更多维度来添加更多向量，

使词向量成为多维成为可能，

我们能够

用更多细微差别来表达英语中单词之间

的关系这些词向量，

我们可以

将它们与我们句子中的词相关联，将

它们转换为

joe 理论上喜欢的数字现在

我们可以将这些数字提供给 joe a nd

joe 现在能够预测

我们给出的任何句子的情绪，

所以很自然地，我决定将其

进行测试

，经过一番搜索，我在哪里可以得到我的数据来训练这个

模型，我

决定使用 kaggle 的 twitter 情绪数据

集

由 150 万条推文组成，

手动分类为 0 表示

负面

或 1 表示正面现在你可能会

想保留 150

万条推文，这是很多推文，

这是真的，它是

很多推文，但就像你和我

会更好在识别一些东西时

，你得到的例子越多，

joe 可以从我们可以提供的尽可能多的数据中受益

，就像我们的词向量一样，

但是

我选择了斯坦福大学的 glove

stand 来表示全局

向量，现在这个词向量集

是预先创建的，这意味着这些

研究人员必须通过

成千上万个句子

查看每个单词的实例

并评估其上下文，以便

为每个单词创建单词向量，

还有一个 tep

在我训练乔之前我必须接受，

但是让我们看看这条

在麦当劳停下来吃午饭的推文我现在很

兴奋

掘金如果我们把这个

权利喂给我们的模型，我们会看到一个问题，

就像人类可以通过推特看到的那样

clutter 可以

看穿

这条推文中的各种干扰，但对于 joe 来说，joe 需要

一些帮助，这就是为什么我们需要清理

这个数据集向你展示我的

意思首先要做的事情是标点符号

和标点符号，当

诸如提及之类的 twitter 工件时 hashtags 和

links

second go 在自然

语言处理中被召回

为停用词，例如

if i 并且不一定会增加

句子的含义，

现在可以说

清理这个数据集最棘手的部分

是如何处理互联网

俚语现在不可能

在互联网上的任何地方不遇到

某种缩写某种

俚语

处理这个问题的困难部分

是没有固定的

方法现在将这些单词评估

为公平的常见单词，例如

law 或 lmao，

在单词向量集中都有各自的条目，例如

glove

现在拼写错误，例如我们

在这条推文中看到的拼写错误，

可以通过拼写检查来发现，但

有些

单词和短语最终会

从我们的手指间溜走

，这确实会制造或破坏一些

情绪分析模型，

无论我们是否已经抓住了这一点

，现在我们能够将

原始推文浓缩

为您在底部看到的四个单词，

因为我们现在’ 已经清理了我们的推文，我们可以

将手套中的词

向量与每个词相关联，现在

我们再次将数字与词

关联

，现在可以训练我们的模型

，这正是我现在所做的，

你可能会问我的模型有多

好？

幸运的是，

我们知道，为了互联网世界的安全，

我的模型并不成功，

准确率达到了 60 左右，这意味着它能够

正确识别

大约 60% 的情绪在我给出的句子中，

但是考虑到这是一个

尚未解决的问题，

这个数字是对现在整个谈话的希望的迹象，

你可能

一直在问自己，为什么我们关心谁

问安迪接下来会发生什么

我在这里告诉你这是真的，

这项技术让我们越来越

接近我们不可避免的机器人霸主

世界，但我仍然相信这项

技术

对于我们的技术发展来说是必不可少的，

因为它现在可以提供的好处

情感分析

的应用纯粹是商业性的我们看到电影

制片人使用情感分析

来评估观众对他们