从互联网某电商平台抓取手机的中文评论内容。然后对中文评论进行分词处理。为了区分评论中的好评和差评,我们使用支持向量机模型,和word2vec模型来对数据进行建模,并分析模型的输出结果。

任何行业领域,用户对产品的评价都显得尤为重要。通过用户评论,可以对用户情感倾向进行判定。例如目前最为普遍的网购行为:对于用户来说,参考评论可以做出更优的购买决策;对于商家来说,对商品评论按照情感倾向进行分类,并通过文本聚类得到普遍提及的商品优缺点,可以进一步改良产品。本案例主要讨论如何对商品评论进行情感倾向判定。下图为某电商平台上针对某款手机的评论:

由于数据集包含中文评论,我们需要设置编码为utf-8

In [1]:
import sys
# 设置编码utf-8,并保持stdin,stdout,stderr正常输出。
stdi, stdo, stde = sys.stdin, sys.stdout, sys.stderr
reload(sys)
sys.setdefaultencoding('utf-8')
sys.stdin, sys.stdout, sys.stderr = stdi, stdo, stde

案例使用包的版本

In [2]:
import jieba, numpy, pandas, sklearn, gensim, wordcloud, matplotlib, logging

print 'jieba %s' % jieba.__version__
print 'gensim %s' % gensim.__version__
print 'numpy %s' % numpy.__version__
print 'pandas %s' % pandas.__version__
print 'sklearn %s' % sklearn.__version__
print 'wordcloud %s' % wordcloud.__version__
print 'matplotlib %s' % matplotlib.__version__
print 'logging %s' % logging.__version__
jieba 0.38
gensim 0.12.4
numpy 1.14.2
pandas 0.20.1
sklearn 0.19.1
wordcloud 1.3.1
matplotlib 2.1.0
logging 0.5.1.2

1 数据源

这份某款手机的商品评论信息数据集,包含2个属性,共计8187个样本。

列名 说明 类型 示例
Comment
对该款手机的评论
String
客服特别不负责,明明备注了也不看,发错了东西。
Class
该评论的情感倾向:
-1 ----- 差评
0 ----- 中评
1 ------ 好评
Int
-1

使用Pandas中的read_excel函数读取xls格式的数据集文件,注意文件的编码设置为gb18030

In [3]:
import pandas as pd

#读入数据集
data = pd.read_excel("./input/data.xls", encoding='gb18030')
data.head()
Out[3]:
Comment Class
0 快就是手感满意也好喜欢也流畅很服务态度实用超快挺快用着速度礼品也不错非常好挺好感觉才来还行好... 1
1 差评,说好的返现返现都是骗人。东西很差 很垃圾 -1
2 售后真是差 买了不到15天锁屏键出现故障,申请换货过了审核说上门取件了 等了几天没人来 ... -1
3 郁闷啊多等2天多无线充 充电宝和贴膜 和京东沟通没补发 心里那个郁闷啊不摆了 失败 ... -1
4 今天去贴膜时才看到在卡槽右面有一处很小的刻痕,很是郁闷 -1

查看数据集的相关信息,包括行列数,列名,以及各个类别的样本数。

In [4]:
# 数据集的大小
data.shape
Out[4]:
(8186, 2)
In [5]:
# 数据集的列名
data.columns.values
Out[5]:
array([u'Comment', u'Class'], dtype=object)
In [6]:
# 不同类别数据记录的统计
data['Class'].value_counts()
Out[6]:
 1    3042
-1    2657
 0    2487
Name: Class, dtype: int64

2 数据预处理

现在,我们要将Comment列的文本信息,转化成数值矩阵表示,也就是将文本映射到特征空间。首先,通过jieba,使用HMM模型,对文本进行中文分词。

In [7]:
# 导入中文分词库jieba
import jieba
import numpy as np
In [8]:
# 对数据集的每个样本的文本进行中文分词,如遇到缺失值,使用“还行 一般吧”进行填充

cutted = []
for row in data.values:
    try:
        raw_words = (" ".join(jieba.cut(row[0])))
        cutted.append(raw_words)
    except AttributeError:
        print row[0]
        cutted.append(u"还行 一般吧")

cutted_array = np.array(cutted)

# 生成新数据文件,Comment字段为分词后的内容
data_cutted = pd.DataFrame({
    'Comment': cutted_array,
    'Class': data['Class']
})
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.288 seconds.
Prefix dict has been built succesfully.
nan
In [9]:
data_cutted.head()
Out[9]:
Class Comment
0 1 快 就是 手感 满意 也好 喜欢 也 流畅 很 服务态度 实用 超快 挺快 用 着 速度 礼...
1 -1 差评 , 说好 的 返现 返现 都 是 骗人 。 东西 很差 很 垃圾
2 -1 售后 真是 差 买 了 不到 15 天锁 屏键 出现 故障 , 申请 换货 过 ...
3 -1 郁闷 啊 多 等 2 天多 无线 充 充电 宝 和 贴膜 和 京东 沟通...
4 -1 今天 去 贴膜 时才 看到 在 卡槽 右面 有 一处 很小 的 刻痕 , 很 是 郁闷

为了更直观地观察词频高的词语,我们使用第三方库wordcloud进行文本的可视化。

In [10]:
# 导入第三方库wordcloud

from wordcloud import WordCloud
import matplotlib.pyplot as plt

针对好评,中评和差评的文本,建立WordCloud对象,绘制词云。

In [11]:
# 好评
wc = WordCloud(font_path='./input/KaiTi_GB2312.ttf')
wc.generate(''.join(data_cutted['Comment'][data_cutted['Class'] == 1]))
fig = plt.figure(figsize = (10, 10))
plt.axis('off')
plt.imshow(wc)
plt.show()
In [32]:
# 中评

wc = WordCloud(font_path='./input/KaiTi_GB2312.ttf')
wc.generate(''.join(data_cutted['Comment'][data_cutted['Class'] == 0]))
fig = plt.figure(figsize = (10, 10))
plt.axis('off')
plt.imshow(wc)
plt.show()
In [12]:
# 差评

wc = WordCloud(font_path='./input/KaiTi_GB2312.ttf')
wc.generate(''.join(data_cutted['Comment'][data_cutted['Class'] == -1]))
fig = plt.figure(figsize = (10, 10))
plt.axis('off')
plt.imshow(wc)
plt.show()

从词云展现的词频统计图来看,"手机","就是","屏幕","收到"等词对于区分毫无帮助而且会造成偏差。因此,需要把这些对区分类没有意义的词语筛选出来,放到停用词文件stopwords.txt中。

In [13]:
# 读入停用词文件
import codecs

with codecs.open('./input/stopwords.txt', 'r', encoding='utf-8') as f:
    stopwords = [item.strip() for item in f]
    
for item in stopwords[0:200]:
    print item,
, ? 、 。 “ ” 《 》 ! , : ; ? a b c d e f g h i j k l m n o p q r s t u v w x y z Q W E R T Y U I O P A S D F G H J K L Z X C V B N M 手机 京东 屏幕 客服 系统 苹果 三星 自己 联系 人民 末##末 啊 阿 哎 哎呀 哎哟 唉 俺 俺们 按 按照 吧 吧哒 把 罢了 被 本 本着 比 比方 比如 鄙人 彼 彼此 边 别 别的 别说 并 并且 不比 不成 不单 不但 不独 不管 不光 不过 不仅 不拘 不论 不怕 不然 不如 不特 不惟 不问 不只 朝 朝着 趁 趁着 乘 冲 除 除此之外 除非 除了 此 此间 此外 从 从而 打 待 但 但是 当 当着 到 得 的 的话 等 等等 地 第 叮咚 对 对于 多 多少 而 而况 而且 而是 而外 而言 而已 尔后 反过来 反过来说 反之 非但 非徒 否则 嘎 嘎登 该 赶 个 各 各个 各位 各种 各自 给 根据 跟 故 故此 固然 关于 管 归 果然 果真 过 哈 哈哈 呵 和 何 何处 何况

使用jieba库的extract_tags函数,统计好评,中评,差评文本中的top 20关键词

In [14]:
#设定停用词文件,在统计关键词的时候,过滤停用词
import jieba.analyse

jieba.analyse.set_stop_words('./input/stopwords.txt') 
In [15]:
# 好评关键词
keywords_pos = jieba.analyse.extract_tags(''.join(data_cutted['Comment'][data_cutted['Class'] == 1]), topK=20)
for item in keywords_pos:
    print item,
不错 正品 赠品 五分 发货 东西 满意 机子 喜欢 收到 很漂亮 充电 好评 很快 卖家 速度 评价 流畅 快递 物流
In [16]:
#中评关键词
keywords_med = jieba.analyse.extract_tags(''.join(data_cutted['Comment'][data_cutted['Class'] == 0]), topK=20)
for item in keywords_med:
    print item,
充电 不错 发热 外观 感觉 电池 机子 问题 赠品 有点 无线 换货 软件 发烫 快递 退货 内存 知道 售后 死机
In [17]:
#差评关键词
keywords_neg = jieba.analyse.extract_tags(''.join(data_cutted['Comment'][data_cutted['Class'] == -1]), topK=20)

for item in keywords_neg:
    print item,
差评 售后 垃圾 赠品 退货 问题 换货 充电 降价 发票 充电器 东西 发热 机子 无线 死机 收到 质量 15 失望

经过以上步骤的处理,整个数据集的预处理工作“告一段落”。在中文文本分析和情感分析的工作中,数据预处理的内容主要是分词。只有经过分词处理后的文本数据集才可以进行下一步的向量化操作,满足输入模型的条件。

3、基于SVM的情感分类模型

经过分词之后的文本数据集要先进行向量化之后才能输入到分类模型中进行运算。

我们使用sklearn库实现向量化方法,去掉停用词,并将其通过tftf-idf映射到特征空间。

$\text{tf-idf} = (1 + \log \text{tf})\cdot \log \dfrac{\text{N}}{\text{df}}$

其中,$\text{tf}$为词频,即分词后每个词项在该条评论中出现的次数;$\text{df}$为出现该词项评论数目;$\text{N}$为评论总数,使用对数来适当抑制$\text{tf}$和$\text{df}$值的影响。

向量化方法 0/1模型 TF模型 TF-IDF模型
数字代码 0 1 2

我们使用sklearn库中的函数直接实现SVM算法。在这里,我们选取以下形式的SVM模型参与运算。

分类模型 SVC LinearSVC SGDClassifier
数字代码 1 2 3

为了方便,创建文本情感分析类CommentClassifier,来实现建模过程:

  • __init__为类的初始化函数,输入参数classifier_typevector_type,分别代表分类模型的类型和向量化方法的类型。

  • fit()函数,来实现向量化与模型建立的过程。

In [18]:
# 实现向量化方法
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

#实现svm和贝叶斯模型
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier


# 实现交叉验证
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import cross_val_score

# 实现评价指标
from sklearn import metrics
/explorer/pyenv/jupyter/lib/python2.7/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
In [19]:
# 文本情感分类的类:CommentClassifier
class CommentClassifier:
    def __init__(self, classifier_type, vector_type):
        self.classifier_type = classifier_type #分类器类型:支持向量机或贝叶斯分类
        self.vector_type = vector_type         #文本向量化模型:0\1模型,TF模型,TF-IDF模型

    def fit(self, train_x, train_y, max_df):
        list_text = list(train_x)
        
        #向量化方法:0 - 0/1,1 - TF,2 - TF-IDF
        if self.vector_type == 0:
            self.vectorizer = CountVectorizer(max_df, stop_words = stopwords, ngram_range=(1, 3)).fit(list_text)
        elif self.vector_type == 1:
            self.vectorizer = TfidfVectorizer(max_df, stop_words = stopwords, ngram_range=(1, 3), use_idf=False).fit(list_text)
        else:
            self.vectorizer = TfidfVectorizer(max_df, stop_words = stopwords, ngram_range=(1, 3)).fit(list_text)

        self.array_trainx = self.vectorizer.transform(list_text)
        self.array_trainy = train_y

        #分类模型选择:1 - SVC,2 - LinearSVC,3 - SGDClassifier,三种SVM模型  
        if self.classifier_type == 1:
            self.model = SVC(kernel='linear', gamma=10 ** -5, C=1).fit(self.array_trainx, self.array_trainy)
        elif self.classifier_type == 2:
            self.model = LinearSVC().fit(self.array_trainx, self.array_trainy)
        else:
            self.model = SGDClassifier().fit(self.array_trainx, self.array_trainy)
        
    def predict_value(self, test_x):
        list_text = list(test_x)
        self.array_testx = self.vectorizer.transform(list_text)
        array_predict = self.model.predict(self.array_testx)
        return array_predict

    def predict_proba(self, test_x):
        list_text = list(test_x)
        self.array_testx = self.vectorizer.transform(list_text)
        array_score = self.model.predict_proba(self.array_testx)
        return array_score 
  • 使用train_test_split()函数划分训练集和测试集。训练集:80%;测试集:20%。

  • 建立classifier_typevector_type两个参数的取值列表,来表示选择的向量化方法以及分类模型

  • 输出每种向量化方法和分类模型的组合所对应的分类评价结果,内容包括混淆矩阵以及含PrecisionRecallF1-score三个指标的评分矩阵

In [20]:
#划分训练集,测试集
train_x, test_x, train_y, test_y = train_test_split(data_cutted['Comment'].ravel().astype('U'), data_cutted['Class'].ravel(),
                                                        test_size=0.2, random_state=4)

classifier_list = [1,2,3]
vector_list = [0,1,2]

for classifier_type in classifier_list:
    for vector_type in vector_list:
        commentCls = CommentClassifier(classifier_type, vector_type)
        #max_df 设置为0.98
        commentCls.fit(train_x, train_y, 0.98)
        if classifier_type == 0:
            value_result = commentCls.predict_value(test_x)
            proba_result = commentCls.predict_proba(test_x)
            print classifier_type,vector_type
            print 'classification report'
            print metrics.classification_report(test_y, value_result, labels=[-1, 0, 1])
            print 'confusion matrix'
            print metrics.confusion_matrix(test_y, value_result, labels=[-1, 0, 1])
        else:
            value_result = commentCls.predict_value(test_x)
            print classifier_type,vector_type
            print 'classification report'
            print metrics.classification_report(test_y, value_result, labels=[-1, 0, 1])
            print 'confusion matrix'
            print metrics.confusion_matrix(test_y, value_result, labels=[-1, 0, 1])
1 0
classification report
             precision    recall  f1-score   support

         -1       0.67      0.63      0.65       519
          0       0.58      0.49      0.53       485
          1       0.74      0.87      0.80       634

avg / total       0.67      0.68      0.68      1638

confusion matrix
[[329 115  75]
 [132 239 114]
 [ 28  55 551]]
1 1
classification report
             precision    recall  f1-score   support

         -1       0.71      0.74      0.72       519
          0       0.57      0.54      0.55       485
          1       0.83      0.84      0.83       634

avg / total       0.71      0.72      0.72      1638

confusion matrix
[[383 104  32]
 [146 260  79]
 [ 12  89 533]]
1 2
classification report
             precision    recall  f1-score   support

         -1       0.69      0.74      0.72       519
          0       0.58      0.53      0.55       485
          1       0.84      0.84      0.84       634

avg / total       0.71      0.72      0.72      1638

confusion matrix
[[386 104  29]
 [154 255  76]
 [ 16  83 535]]
2 0
classification report
             precision    recall  f1-score   support

         -1       0.66      0.63      0.64       519
          0       0.58      0.47      0.52       485
          1       0.75      0.89      0.81       634

avg / total       0.67      0.68      0.67      1638

confusion matrix
[[325 120  74]
 [143 230 112]
 [ 25  47 562]]
2 1
classification report
             precision    recall  f1-score   support

         -1       0.69      0.75      0.71       519
          0       0.62      0.48      0.54       485
          1       0.82      0.90      0.86       634

avg / total       0.72      0.73      0.72      1638

confusion matrix
[[387  95  37]
 [163 232  90]
 [ 14  48 572]]
2 2
classification report
             precision    recall  f1-score   support

         -1       0.68      0.75      0.71       519
          0       0.64      0.49      0.55       485
          1       0.83      0.91      0.87       634

avg / total       0.73      0.73      0.73      1638

confusion matrix
[[389  92  38]
 [166 237  82]
 [ 15  43 576]]
/explorer/pyenv/jupyter/lib/python2.7/site-packages/sklearn/linear_model/stochastic_gradient.py:128: FutureWarning: max_iter and tol parameters have been added in <class 'sklearn.linear_model.stochastic_gradient.SGDClassifier'> in 0.19. If both are left unset, they default to max_iter=5 and tol=None. If tol is not None, max_iter defaults to max_iter=1000. From 0.21, default max_iter will be 1000, and default tol will be 1e-3.
  "and default tol will be 1e-3." % type(self), FutureWarning)
3 0
classification report
             precision    recall  f1-score   support

         -1       0.69      0.72      0.70       519
          0       0.62      0.48      0.54       485
          1       0.80      0.90      0.85       634

avg / total       0.71      0.72      0.71      1638

confusion matrix
[[374  95  50]
 [154 234  97]
 [ 14  47 573]]
3 1
classification report
             precision    recall  f1-score   support

         -1       0.70      0.73      0.71       519
          0       0.58      0.52      0.55       485
          1       0.82      0.85      0.83       634

avg / total       0.71      0.72      0.71      1638

confusion matrix
[[378  99  42]
 [153 254  78]
 [ 12  82 540]]
3 2
classification report
             precision    recall  f1-score   support

         -1       0.69      0.76      0.72       519
          0       0.59      0.50      0.54       485
          1       0.83      0.86      0.84       634

avg / total       0.71      0.72      0.72      1638

confusion matrix
[[392  93  34]
 [164 244  77]
 [ 14  76 544]]

从结果上来看,选择tfidf向量化方法,使用LinearSVC模型效果比较好,f1-socre为0.73

从混淆矩阵来看,我们会发现多数的错误分类都出现在中评和差评上。我们可以将原始数据集的中评删除。

In [21]:
data_bi = data_cutted[data_cutted['Class'] != 0]
data_bi['Class'].value_counts()
Out[21]:
 1    3042
-1    2657
Name: Class, dtype: int64

再次运行分类模型,查看分类结果。

In [22]:
train_x, test_x, train_y, test_y = train_test_split(data_bi['Comment'].ravel().astype('U'), data_bi['Class'].ravel(),
                                                        test_size=0.2, random_state=4)

classifier_list = [1,2,3]
vector_list = [0,1,2]
for classifier_type in classifier_list:
    for vector_type in vector_list:
        commentCls = CommentClassifier(classifier_type, vector_type)
        commentCls.fit(train_x, train_y,0.98)
        if classifier_type == 0:
            value_result = commentCls.predict_value(test_x)
            proba_result = commentCls.predict_proba(test_x)
            print classifier_type,vector_type
            print 'classification report'
            print metrics.classification_report(test_y, value_result, labels=[-1, 1])
            print 'confusion matrix'
            print metrics.confusion_matrix(test_y, value_result, labels=[-1, 1])
        else:
            value_result = commentCls.predict_value(test_x)
            print classifier_type,vector_type
            print 'classification report'
            print metrics.classification_report(test_y, value_result, labels=[-1, 1])
            print 'confusion matrix'
            print metrics.confusion_matrix(test_y, value_result, labels=[-1, 1])
1 0
classification report
             precision    recall  f1-score   support

         -1       0.88      0.79      0.83       550
          1       0.82      0.90      0.86       590

avg / total       0.85      0.85      0.85      1140

confusion matrix
[[436 114]
 [ 59 531]]
1 1
classification report
             precision    recall  f1-score   support

         -1       0.87      0.91      0.89       550
          1       0.91      0.88      0.89       590

avg / total       0.89      0.89      0.89      1140

confusion matrix
[[500  50]
 [ 73 517]]
1 2
classification report
             precision    recall  f1-score   support

         -1       0.88      0.92      0.90       550
          1       0.92      0.88      0.90       590

avg / total       0.90      0.90      0.90      1140

confusion matrix
[[505  45]
 [ 70 520]]
2 0
classification report
             precision    recall  f1-score   support

         -1       0.88      0.81      0.84       550
          1       0.83      0.90      0.87       590

avg / total       0.86      0.86      0.85      1140

confusion matrix
[[444 106]
 [ 59 531]]
2 1
classification report
             precision    recall  f1-score   support

         -1       0.91      0.89      0.90       550
          1       0.90      0.92      0.91       590

avg / total       0.91      0.91      0.91      1140

confusion matrix
[[488  62]
 [ 46 544]]
2 2
classification report
             precision    recall  f1-score   support

         -1       0.91      0.89      0.90       550
          1       0.90      0.92      0.91       590

avg / total       0.91      0.91      0.91      1140

confusion matrix
[[491  59]
 [ 46 544]]
3 0
classification report
             precision    recall  f1-score   support

         -1       0.93      0.83      0.88       550
          1       0.86      0.94      0.90       590

avg / total       0.89      0.89      0.89      1140

confusion matrix
[[459  91]
 [ 34 556]]
3 1
classification report
             precision    recall  f1-score   support

         -1       0.86      0.89      0.87       550
          1       0.89      0.87      0.88       590

avg / total       0.88      0.88      0.88      1140

confusion matrix
[[489  61]
 [ 79 511]]
3 2
classification report
             precision    recall  f1-score   support

         -1       0.87      0.90      0.89       550
          1       0.91      0.88      0.89       590

avg / total       0.89      0.89      0.89      1140

confusion matrix
[[496  54]
 [ 73 517]]

删除差评之后,不同组合的分类模型效果均有显著提升。这也说明,分类模型能够有效地将好评区分出来。

数据集中存在标注不准确的问题,主要集中在中评。由于人在评论时,除非有问题否则一般都会打好评,如果打了中评说明对产品有不满意之处,在情感的表达上就会趋向于负向情感,同时评论具有很大主观性,很多中评会将其归为差评,但数据集中却认为是中评。因此,将一条评论分类为好评、中评、差评是不够客观,中评与差评之间的边界很模糊,因此识别率很难提高。

4、基于word2vec中doc2vec的无监督分类模型

开源文本向量化工具word2vec,可以为文本数据寻求更加深层次的特征表示。词语之间可以进行运算:

w2v(woman)-w2v(man)+w2v(king)=w2v(queen)

基于word2vec的doc2vec,将每个文档表示为一个向量,并且通过余弦距离可以计算两个文档的相似程度,那么就可以计算一句话和一句极好的好评的距离,以及一句话到极差的差评的距离。

在本案例的数据集中:

  • 好评:快 就是 手感 满意 也好 喜欢 也 流畅 很 服务态度 实用 超快 挺快 用着 速度 礼品 也不错 非常好 挺好 感觉 才来 还行 好看 也快 不错的 送了 非常不错 超级 赞 好多东西 很实用 各方面 挺好的 很多 漂亮 配件 还不错 也多 特意 慢 满分 好用 非常漂亮......

  • 差评:不多说 上当 差差 刚用 服务差 一点也不 不要 简直 还是去 实体店 大家 保证 不肯 生气 开发票 磨损 后悔 印记 网 什么破 烂烂 左边 失效 太 骗 掉价 走下坡路 不说了 彻底 三星手机 自营 几次 真心 别的 看完 简单说 机会 这是 生气了 触动 缝隙 冲动了 失望......

我们使用第三方库gensim来实现doc2vec模型。

In [23]:
import pandas as pd
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
import logging
In [24]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

train_x = data_bi['Comment'].ravel()
train_y = data_bi['Class'].ravel()

#为train_x列贴上标签"TRAIN"
def labelizeReviews(reviews, label_type):
    labelized = []
    for i, v in enumerate(reviews):
        label = '%s_%s' % (label_type, i)
        labelized.append(TaggedDocument(v.split(" "), [label]))
    return labelized


train_x = labelizeReviews(train_x, "TRAIN")

#建立Doc2Vec模型model
size = 300
all_data = []
all_data.extend(train_x)

model = Doc2Vec(min_count=1, window=8, size=size, sample=1e-4, negative=5, hs=0, iter=5, workers=8)
model.build_vocab(all_data)

# 设置迭代次数10
for epoch in range(10):
    model.train(train_x)
    
#建立空列表pos和neg以对相似度计算结果进行存储,计算每个评论和极好评论之间的余弦距离,并存在pos列表中
#计算每个评论和极差评论之间的余弦距离,并存在neg列表中
pos = []
neg = []

for i in range(0,len(train_x)):
    pos.append(model.docvecs.similarity("TRAIN_0","TRAIN_{}".format(i)))
    neg.append(model.docvecs.similarity("TRAIN_1","TRAIN_{}".format(i)))
    
#将pos列表和neg列表更新到原始数据文件中,分别表示为字段PosSim和字段NegSim
data_bi[u'PosSim'] = pos
data_bi[u'NegSim'] = neg
2018-08-14 21:52:07,934 : INFO : collecting all words and their counts
2018-08-14 21:52:07,936 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2018-08-14 21:52:08,006 : INFO : collected 11238 word types and 5699 unique tags from a corpus of 5699 examples and 193439 words
2018-08-14 21:52:08,037 : INFO : min_count=1 retains 11238 unique words (drops 0)
2018-08-14 21:52:08,039 : INFO : min_count leaves 193439 word corpus (100% of original 193439)
2018-08-14 21:52:08,093 : INFO : deleting the raw counts dictionary of 11238 items
2018-08-14 21:52:08,094 : INFO : sample=0.0001 downsamples 466 most-common words
2018-08-14 21:52:08,095 : INFO : downsampling leaves estimated 82494 word corpus (42.6% of prior 193439)
2018-08-14 21:52:08,096 : INFO : estimated required memory for 11238 words and 300 dimensions: 40568800 bytes
2018-08-14 21:52:08,127 : INFO : resetting layer weights
2018-08-14 21:52:08,446 : INFO : training model with 8 workers on 11238 vocabulary and 300 features, using sg=0 hs=0 sample=0.0001 negative=5
2018-08-14 21:52:08,447 : INFO : expecting 5699 sentences, matching count from corpus used for vocabulary survey
2018-08-14 21:52:09,469 : INFO : PROGRESS: at 22.67% examples, 103678 words/s, in_qsize 15, out_qsize 0
2018-08-14 21:52:10,476 : INFO : PROGRESS: at 48.61% examples, 112940 words/s, in_qsize 15, out_qsize 0
2018-08-14 21:52:11,502 : INFO : PROGRESS: at 71.55% examples, 107711 words/s, in_qsize 15, out_qsize 0
2018-08-14 21:52:12,114 : INFO : worker thread finished; awaiting finish of 7 more threads
2018-08-14 21:52:12,273 : INFO : worker thread finished; awaiting finish of 6 more threads
2018-08-14 21:52:12,279 : INFO : worker thread finished; awaiting finish of 5 more threads
2018-08-14 21:52:12,296 : INFO : worker thread finished; awaiting finish of 4 more threads
2018-08-14 21:52:12,306 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-08-14 21:52:12,310 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-08-14 21:52:12,318 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-08-14 21:52:12,323 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-08-14 21:52:12,324 : INFO : training on 967195 raw words (441274 effective words) took 3.9s, 113982 effective words/s
2018-08-14 21:52:12,325 : INFO : training model with 8 workers on 11238 vocabulary and 300 features, using sg=0 hs=0 sample=0.0001 negative=5
2018-08-14 21:52:12,326 : INFO : expecting 5699 sentences, matching count from corpus used for vocabulary survey
2018-08-14 21:52:13,370 : INFO : PROGRESS: at 22.66% examples, 100920 words/s, in_qsize 15, out_qsize 0
2018-08-14 21:52:14,533 : INFO : PROGRESS: at 50.75% examples, 107742 words/s, in_qsize 16, out_qsize 0
2018-08-14 21:52:15,534 : INFO : PROGRESS: at 80.78% examples, 112042 words/s, in_qsize 15, out_qsize 0
2018-08-14 21:52:15,966 : INFO : worker thread finished; awaiting finish of 7 more threads
2018-08-14 21:52:16,107 : INFO : worker thread finished; awaiting finish of 6 more threads
2018-08-14 21:52:16,136 : INFO : worker thread finished; awaiting finish of 5 more threads
2018-08-14 21:52:16,155 : INFO : worker thread finished; awaiting finish of 4 more threads
2018-08-14 21:52:16,158 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-08-14 21:52:16,164 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-08-14 21:52:16,169 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-08-14 21:52:16,176 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-08-14 21:52:16,177 : INFO : training on 967195 raw words (441048 effective words) took 3.8s, 114653 effective words/s
2018-08-14 21:52:16,178 : INFO : training model with 8 workers on 11238 vocabulary and 300 features, using sg=0 hs=0 sample=0.0001 negative=5
2018-08-14 21:52:16,179 : INFO : expecting 5699 sentences, matching count from corpus used for vocabulary survey
2018-08-14 21:52:17,221 : INFO : PROGRESS: at 22.67% examples, 101448 words/s, in_qsize 15, out_qsize 0
2018-08-14 21:52:18,224 : INFO : PROGRESS: at 48.61% examples, 111986 words/s, in_qsize 16, out_qsize 0
2018-08-14 21:52:19,249 : INFO : PROGRESS: at 71.55% examples, 107077 words/s, in_qsize 15, out_qsize 0
2018-08-14 21:52:19,869 : INFO : worker thread finished; awaiting finish of 7 more threads
2018-08-14 21:52:20,015 : INFO : worker thread finished; awaiting finish of 6 more threads
2018-08-14 21:52:20,048 : INFO : worker thread finished; awaiting finish of 5 more threads
2018-08-14 21:52:20,057 : INFO : worker thread finished; awaiting finish of 4 more threads
2018-08-14 21:52:20,059 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-08-14 21:52:20,063 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-08-14 21:52:20,070 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-08-14 21:52:20,078 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-08-14 21:52:20,079 : INFO : training on 967195 raw words (441068 effective words) took 3.9s, 113251 effective words/s
2018-08-14 21:52:20,080 : INFO : training model with 8 workers on 11238 vocabulary and 300 features, using sg=0 hs=0 sample=0.0001 negative=5
2018-08-14 21:52:20,081 : INFO : expecting 5699 sentences, matching count from corpus used for vocabulary survey
2018-08-14 21:52:21,138 : INFO : PROGRESS: at 22.67% examples, 99903 words/s, in_qsize 15, out_qsize 0
2018-08-14 21:52:22,159 : INFO : PROGRESS: at 47.65% examples, 107953 words/s, in_qsize 16, out_qsize 0
2018-08-14 21:52:23,179 : INFO : PROGRESS: at 71.55% examples, 106217 words/s, in_qsize 16, out_qsize 0
2018-08-14 21:52:23,815 : INFO : worker thread finished; awaiting finish of 7 more threads
2018-08-14 21:52:23,939 : INFO : worker thread finished; awaiting finish of 6 more threads
2018-08-14 21:52:23,969 : INFO : worker thread finished; awaiting finish of 5 more threads
2018-08-14 21:52:23,981 : INFO : worker thread finished; awaiting finish of 4 more threads
2018-08-14 21:52:23,986 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-08-14 21:52:23,990 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-08-14 21:52:24,001 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-08-14 21:52:24,010 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-08-14 21:52:24,011 : INFO : training on 967195 raw words (441093 effective words) took 3.9s, 112381 effective words/s
2018-08-14 21:52:24,012 : INFO : training model with 8 workers on 11238 vocabulary and 300 features, using sg=0 hs=0 sample=0.0001 negative=5
2018-08-14 21:52:24,013 : INFO : expecting 5699 sentences, matching count from corpus used for vocabulary survey
2018-08-14 21:52:25,030 : INFO : PROGRESS: at 20.27% examples, 94693 words/s, in_qsize 15, out_qsize 0
2018-08-14 21:52:26,089 : INFO : PROGRESS: at 46.90% examples, 105579 words/s, in_qsize 15, out_qsize 0
2018-08-14 21:52:27,186 : INFO : PROGRESS: at 71.55% examples, 103414 words/s, in_qsize 16, out_qsize 0
2018-08-14 21:52:27,786 : INFO : worker thread finished; awaiting finish of 7 more threads
2018-08-14 21:52:27,946 : INFO : worker thread finished; awaiting finish of 6 more threads
2018-08-14 21:52:27,973 : INFO : worker thread finished; awaiting finish of 5 more threads
2018-08-14 21:52:27,991 : INFO : worker thread finished; awaiting finish of 4 more threads
2018-08-14 21:52:27,995 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-08-14 21:52:28,001 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-08-14 21:52:28,011 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-08-14 21:52:28,017 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-08-14 21:52:28,018 : INFO : training on 967195 raw words (440718 effective words) took 4.0s, 110177 effective words/s
2018-08-14 21:52:28,020 : INFO : training model with 8 workers on 11238 vocabulary and 300 features, using sg=0 hs=0 sample=0.0001 negative=5
2018-08-14 21:52:28,021 : INFO : expecting 5699 sentences, matching count from corpus used for vocabulary survey
2018-08-14 21:52:29,104 : INFO : PROGRESS: at 22.67% examples, 97382 words/s, in_qsize 16, out_qsize 0
2018-08-14 21:52:30,126 : INFO : PROGRESS: at 47.65% examples, 106263 words/s, in_qsize 15, out_qsize 0
2018-08-14 21:52:31,188 : INFO : PROGRESS: at 71.55% examples, 103612 words/s, in_qsize 15, out_qsize 0
2018-08-14 21:52:31,818 : INFO : worker thread finished; awaiting finish of 7 more threads
2018-08-14 21:52:31,956 : INFO : worker thread finished; awaiting finish of 6 more threads
2018-08-14 21:52:31,968 : INFO : worker thread finished; awaiting finish of 5 more threads
2018-08-14 21:52:31,994 : INFO : worker thread finished; awaiting finish of 4 more threads
2018-08-14 21:52:31,998 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-08-14 21:52:32,006 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-08-14 21:52:32,012 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-08-14 21:52:32,021 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-08-14 21:52:32,022 : INFO : training on 967195 raw words (440617 effective words) took 4.0s, 110245 effective words/s
2018-08-14 21:52:32,023 : INFO : training model with 8 workers on 11238 vocabulary and 300 features, using sg=0 hs=0 sample=0.0001 negative=5
2018-08-14 21:52:32,024 : INFO : expecting 5699 sentences, matching count from corpus used for vocabulary survey
2018-08-14 21:52:33,039 : INFO : PROGRESS: at 21.82% examples, 98982 words/s, in_qsize 15, out_qsize 0
2018-08-14 21:52:34,073 : INFO : PROGRESS: at 46.90% examples, 106964 words/s, in_qsize 15, out_qsize 0
2018-08-14 21:52:35,181 : INFO : PROGRESS: at 71.55% examples, 103969 words/s, in_qsize 15, out_qsize 0
2018-08-14 21:52:35,791 : INFO : worker thread finished; awaiting finish of 7 more threads
2018-08-14 21:52:35,941 : INFO : worker thread finished; awaiting finish of 6 more threads
2018-08-14 21:52:35,962 : INFO : worker thread finished; awaiting finish of 5 more threads
2018-08-14 21:52:35,967 : INFO : worker thread finished; awaiting finish of 4 more threads
2018-08-14 21:52:35,980 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-08-14 21:52:35,986 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-08-14 21:52:35,992 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-08-14 21:52:35,999 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-08-14 21:52:36,000 : INFO : training on 967195 raw words (440782 effective words) took 4.0s, 110978 effective words/s
2018-08-14 21:52:36,001 : INFO : training model with 8 workers on 11238 vocabulary and 300 features, using sg=0 hs=0 sample=0.0001 negative=5
2018-08-14 21:52:36,002 : INFO : expecting 5699 sentences, matching count from corpus used for vocabulary survey
2018-08-14 21:52:37,020 : INFO : PROGRESS: at 21.82% examples, 98849 words/s, in_qsize 16, out_qsize 0
2018-08-14 21:52:38,036 : INFO : PROGRESS: at 46.90% examples, 107759 words/s, in_qsize 15, out_qsize 0
2018-08-14 21:52:39,156 : INFO : PROGRESS: at 71.55% examples, 104085 words/s, in_qsize 16, out_qsize 0
2018-08-14 21:52:39,813 : INFO : worker thread finished; awaiting finish of 7 more threads
2018-08-14 21:52:39,992 : INFO : worker thread finished; awaiting finish of 6 more threads
2018-08-14 21:52:40,016 : INFO : worker thread finished; awaiting finish of 5 more threads
2018-08-14 21:52:40,026 : INFO : worker thread finished; awaiting finish of 4 more threads
2018-08-14 21:52:40,035 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-08-14 21:52:40,037 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-08-14 21:52:40,045 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-08-14 21:52:40,053 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-08-14 21:52:40,054 : INFO : training on 967195 raw words (440645 effective words) took 4.0s, 108903 effective words/s
2018-08-14 21:52:40,055 : INFO : training model with 8 workers on 11238 vocabulary and 300 features, using sg=0 hs=0 sample=0.0001 negative=5
2018-08-14 21:52:40,055 : INFO : expecting 5699 sentences, matching count from corpus used for vocabulary survey
2018-08-14 21:52:41,162 : INFO : PROGRESS: at 22.67% examples, 95678 words/s, in_qsize 16, out_qsize 0
2018-08-14 21:52:42,193 : INFO : PROGRESS: at 47.86% examples, 104890 words/s, in_qsize 15, out_qsize 0
2018-08-14 21:52:43,259 : INFO : PROGRESS: at 71.55% examples, 102620 words/s, in_qsize 15, out_qsize 0
2018-08-14 21:52:43,938 : INFO : worker thread finished; awaiting finish of 7 more threads
2018-08-14 21:52:44,089 : INFO : worker thread finished; awaiting finish of 6 more threads
2018-08-14 21:52:44,108 : INFO : worker thread finished; awaiting finish of 5 more threads
2018-08-14 21:52:44,124 : INFO : worker thread finished; awaiting finish of 4 more threads
2018-08-14 21:52:44,133 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-08-14 21:52:44,143 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-08-14 21:52:44,148 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-08-14 21:52:44,155 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-08-14 21:52:44,156 : INFO : training on 967195 raw words (441171 effective words) took 4.1s, 107739 effective words/s
2018-08-14 21:52:44,157 : INFO : training model with 8 workers on 11238 vocabulary and 300 features, using sg=0 hs=0 sample=0.0001 negative=5
2018-08-14 21:52:44,158 : INFO : expecting 5699 sentences, matching count from corpus used for vocabulary survey
2018-08-14 21:52:45,164 : INFO : PROGRESS: at 20.27% examples, 95467 words/s, in_qsize 15, out_qsize 0
2018-08-14 21:52:46,177 : INFO : PROGRESS: at 44.83% examples, 101749 words/s, in_qsize 16, out_qsize 0
2018-08-14 21:52:47,184 : INFO : PROGRESS: at 67.34% examples, 102370 words/s, in_qsize 16, out_qsize 0
2018-08-14 21:52:48,090 : INFO : worker thread finished; awaiting finish of 7 more threads
2018-08-14 21:52:48,240 : INFO : PROGRESS: at 91.09% examples, 101709 words/s, in_qsize 6, out_qsize 1
2018-08-14 21:52:48,242 : INFO : worker thread finished; awaiting finish of 6 more threads
2018-08-14 21:52:48,268 : INFO : worker thread finished; awaiting finish of 5 more threads
2018-08-14 21:52:48,274 : INFO : worker thread finished; awaiting finish of 4 more threads
2018-08-14 21:52:48,280 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-08-14 21:52:48,293 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-08-14 21:52:48,298 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-08-14 21:52:48,305 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-08-14 21:52:48,306 : INFO : training on 967195 raw words (440753 effective words) took 4.1s, 106398 effective words/s
/explorer/pyenv/jupyter/lib/python2.7/site-packages/ipykernel_launcher.py:39: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/explorer/pyenv/jupyter/lib/python2.7/site-packages/ipykernel_launcher.py:40: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
In [25]:
data_bi.head()
Out[25]:
Class Comment PosSim NegSim
0 1 快 就是 手感 满意 也好 喜欢 也 流畅 很 服务态度 实用 超快 挺快 用 着 速度 礼... 1.000000 0.603461
1 -1 差评 , 说好 的 返现 返现 都 是 骗人 。 东西 很差 很 垃圾 0.603461 1.000000
2 -1 售后 真是 差 买 了 不到 15 天锁 屏键 出现 故障 , 申请 换货 过 ... 0.604366 0.484621
3 -1 郁闷 啊 多 等 2 天多 无线 充 充电 宝 和 贴膜 和 京东 沟通... 0.648596 0.684616
4 -1 今天 去 贴膜 时才 看到 在 卡槽 右面 有 一处 很小 的 刻痕 , 很 是 郁闷 0.724642 0.641944
In [26]:
from matplotlib import pyplot as plt

label= data_bi['Class'].ravel()
values = data_bi[['PosSim' , 'NegSim']].values
In [28]:
plt.scatter(values[:,0], values[:,1], c=label, alpha=0.4)
plt.show()

从上图中可以看到,好评与差评基本上可以通过一条直线区分开(紫色为差评,黄色为好评)

该方法与传统思路完全不同,没有使用词频率,情感词等特征,其优点有:

  • 将数据集映射到了极低维度的空间,只有二维
  • 一种无监督的学习方法,不需要对原始训练数据进行标注
  • 具有普适性,在其他领域也可以用这种方法,只需要先找出该领域极其正和极其负的方法,将其与所有待识别的数据通过doc2vec转化为向量计算距离即可
In [ ]: