[Python人工智能] 二十二.基于大连理工情感词典的情感分析和情绪计算

[Python人工智能] 二十二.基于大连理工情感词典的情感分析和情绪计算







文章目录 一.大连理工中文情感词典二.七种情绪计算三.七种情绪词云可视化1.基本用法2.统计七种情绪特征词3.词云分析 四.自定义词典情感分析五.SnowNLP情感分析六.总结




情感分析 (Sentiment Analysis)和情绪分类 (Emotion Classification)都是非常重要的文本挖掘手段。情感分析的基本流程如下图所示,通常包括:



目前中国研究成熟的词典有大连理工大学情感词汇本体库、知网的 HowNet 情感词典及TW大学中文情感极性词典等。本文选择的基础词典是大连理工大学情感词汇本体库,此词典将情感分为“乐”“好”“怒”“哀”“惧”“恶”“惊”7 个大类和 21 个小类,其情感词的初始情感强度被设置为 1、3、5、7、9 五个等级,较其他词典而言,强度划分得更为细致。情感词的情感极性有中性、褒义、贬义 3 类,分别对应值 0、1、2。为便于计算机作情感计算,文中将代表贬义的极性值2 修改为-1。词汇的情感值公式为:


中文情感词汇本体库是大连理工大学信息检索研究室在 林鸿飞教授 的指导下经过全体 教研室成员的努力整理和标注的一个中文本体资源。该资源从不同角度描述一个中文词汇或 者短语,包括词语词性种类、情感类别、情感强度及极性等信息。中文情感词汇本体的情感分类体系是在国外比较有影响的 Ekman 的 6 大类情感分类体 系的基础上构建的。在 Ekman 的基础上,词汇本体加入情感类别“好”对褒义情感进行了 更细致的划分。最终词汇本体中的情感共分为 7 大类 21 小类。

构造该资源的宗旨是在情感计算领域,为中文文本情感分析和倾向性分析提供一个便捷 可靠的辅助手段。中文情感词汇本体可以用于解决多类别情感分类的问题,同时也可以用于 解决一般的倾向性分析的问题。如下图所示,该词典共包括27466个词语,包含词语、词性种类、词义数、词义序号、情感分类、强度、极性、辅助情感分类、强度和极性。




情感分类按照论文《情感词汇本体的构造》所述,情感分为 7 大类 21 小类。情感强度分为 1、3、5、7、9 五档,9 表示强度最大,1 为强度最小。情感分类如下表所示:


情感词汇本体中的词性种类一共分为 7 类,分别是名词(noun)、动词(verb)、形容词 (adj)、副词(adv)、网络词语(nw)、成语(idiom)、介词短语(prep)。同时,每个词在每一类情感下都对应了一个极性。其中,0代表中性,1代表褒义,2代表贬 义,3代表兼有褒贬两性。最后给出否定词和程序副词,否定词会将情感强度乘以-1,程度副词代表不同级别的情感倾向。







# coding: utf-8 import pandas as pd #获取数据集 f = open('庆余年220.csv',encoding='utf8') weibo_df = pd.read_csv(f) print(weibo_df.head())




# coding: utf-8 import pandas as pd #-------------------------------------获取数据集--------------------------------- f = open('庆余年220.csv',encoding='utf8') weibo_df = pd.read_csv(f) print(weibo_df.head()) #-------------------------------------情感词典读取------------------------------- #注意: #1.词典中怒的标记(NA)识别不出被当作空值,情感分类列中的NA都给替换成NAU #2.大连理工词典中有情感分类的辅助标注(有NA),故把情感分类列改好再替换原词典中 # 扩展前的词典 df = pd.read_excel('大连理工大学中文情感词汇本体NAU.xlsx') print(df.head(10)) df = df[['词语', '词性种类', '词义数', '词义序号', '情感分类', '强度', '极性']] df.head()




# coding: utf-8 import pandas as pd #-------------------------------------获取数据集--------------------------------- f = open('庆余年220.csv',encoding='utf8') weibo_df = pd.read_csv(f) print(weibo_df.head()) #-------------------------------------情感词典读取------------------------------- #注意: #1.词典中怒的标记(NA)识别不出被当作空值,情感分类列中的NA都给替换成NAU #2.大连理工词典中有情感分类的辅助标注(有NA),故把情感分类列改好再替换原词典中 # 扩展前的词典 df = pd.read_excel('大连理工大学中文情感词汇本体NAU.xlsx') print(df.head(10)) df = df[['词语', '词性种类', '词义数', '词义序号', '情感分类', '强度', '极性']] df.head() #-------------------------------------七种情绪的运用------------------------------- Happy = [] Good = [] Surprise = [] Anger = [] Sad = [] Fear = [] Disgust = [] #df.iterrows()功能是迭代遍历每一行 for idx, row in df.iterrows(): if row['情感分类'] in ['PA', 'PE']: Happy.append(row['词语']) if row['情感分类'] in ['PD', 'PH', 'PG', 'PB', 'PK']: Good.append(row['词语']) if row['情感分类'] in ['PC']: Surprise.append(row['词语']) if row['情感分类'] in ['NB', 'NJ', 'NH', 'PF']: Sad.append(row['词语']) if row['情感分类'] in ['NI', 'NC', 'NG']: Fear.append(row['词语']) if row['情感分类'] in ['NE', 'ND', 'NN', 'NK', 'NL']: Disgust.append(row['词语']) if row['情感分类'] in ['NAU']: #修改: 原NA算出来没结果 Anger.append(row['词语']) #正负计算不是很准 自己可以制定规则 Positive = Happy + Good + Surprise Negative = Anger + Sad + Fear + Disgust print('情绪词语列表整理完成') print(Anger)






#---------------------------------------中文分词--------------------------------- import jieba import time #添加使用者词典和停用词 jieba.load_userdict("user_dict.txt") #自定义词典 stop_list = pd.read_csv('stop_words.txt', engine='python', encoding='utf-8', delimiter="\n", names=['t'])['t'].tolist() def txt_cut(juzi): return [w for w in jieba.lcut(juzi) if w not in stop_list] #可增加len(w)>1



第五步,计算七种情绪特征词的出现频率。 情绪包括anger、disgust、fear、sadness、surprise、good、happy。

#---------------------------------------中文分词--------------------------------- import jieba import time #添加自定义词典和停用词 #jieba.load_userdict("user_dict.txt") stop_list = pd.read_csv('stop_words.txt', engine='python', encoding='utf-8', delimiter="\n", names=['t']) #获取重命名t列的值 stop_list = stop_list['t'].tolist() def txt_cut(juzi): return [w for w in jieba.lcut(juzi) if w not in stop_list] #可增加len(w)>1 #---------------------------------------情感计算--------------------------------- def emotion_caculate(text): positive = 0 negative = 0 anger = 0 disgust = 0 fear = 0 sad = 0 surprise = 0 good = 0 happy = 0 wordlist = txt_cut(text) #wordlist = jieba.lcut(text) wordset = set(wordlist) wordfreq = [] for word in wordset: freq = wordlist.count(word) if word in Positive: positive+=freq if word in Negative: negative+=freq if word in Anger: anger+=freq if word in Disgust: disgust+=freq if word in Fear: fear+=freq if word in Sad: sad+=freq if word in Surprise: surprise+=freq if word in Good: good+=freq if word in Happy: happy+=freq emotion_info = { 'length':len(wordlist), 'positive': positive, 'negative': negative, 'anger': anger, 'disgust': disgust, 'fear':fear, 'good':good, 'sadness':sad, 'surprise':surprise, 'happy':happy, } indexs = ['length', 'positive', 'negative', 'anger', 'disgust','fear','sadness','surprise', 'good', 'happy'] return pd.Series(emotion_info, index=indexs) #测试 text=""" 原著的确更吸引编剧读下去,所以跟《诛仙》系列明显感觉到编剧只看过故事大纲比,这个剧的编剧完整阅读过小说。 配乐活泼俏皮,除了强硬穿越的台词轻微尴尬,最应该尴尬的感情戏反而入戏, 故意模糊了陈萍萍的太监身份、太子跟长公主的暧昧关系, 整体观影感受极好,很期待第二季拍大东山之役。玩弄人心的阴谋阳谋都不狗血,架空的设定能摆脱历史背景, 服装道具能有更自由的发挥空间,特别喜欢庆帝的闺房。以后还是少看国产剧,太长了, 还是精短美剧更适合休闲,追这个太累。王启年真是太可爱了。 """ res = emotion_caculate(text) print(res)


length 83 positive 7 negative 6 anger 0 disgust 6 fear 0 sadness 0 surprise 0 good 6 happy 1 dtype: int64




# coding: utf-8 import pandas as pd import jieba import time #-------------------------------------获取数据集--------------------------------- f = open('庆余年220.csv',encoding='utf8') weibo_df = pd.read_csv(f) print(weibo_df.head()) #-------------------------------------情感词典读取------------------------------- #注意: #1.词典中怒的标记(NA)识别不出被当作空值,情感分类列中的NA都给替换成NAU #2.大连理工词典中有情感分类的辅助标注(有NA),故把情感分类列改好再替换原词典中 # 扩展前的词典 df = pd.read_excel('大连理工大学中文情感词汇本体NAU.xlsx') print(df.head(10)) df = df[['词语', '词性种类', '词义数', '词义序号', '情感分类', '强度', '极性']] df.head() #-------------------------------------七种情绪的运用------------------------------- Happy = [] Good = [] Surprise = [] Anger = [] Sad = [] Fear = [] Disgust = [] #df.iterrows()功能是迭代遍历每一行 for idx, row in df.iterrows(): if row['情感分类'] in ['PA', 'PE']: Happy.append(row['词语']) if row['情感分类'] in ['PD', 'PH', 'PG', 'PB', 'PK']: Good.append(row['词语']) if row['情感分类'] in ['PC']: Surprise.append(row['词语']) if row['情感分类'] in ['NB', 'NJ', 'NH', 'PF']: Sad.append(row['词语']) if row['情感分类'] in ['NI', 'NC', 'NG']: Fear.append(row['词语']) if row['情感分类'] in ['NE', 'ND', 'NN', 'NK', 'NL']: Disgust.append(row['词语']) if row['情感分类'] in ['NAU']: #修改: 原NA算出来没结果 Anger.append(row['词语']) #正负计算不是很准 自己可以制定规则 Positive = Happy + Good + Surprise Negative = Anger + Sad + Fear + Disgust print('情绪词语列表整理完成') print(Anger) #---------------------------------------中文分词--------------------------------- #添加自定义词典和停用词 #jieba.load_userdict("user_dict.txt") stop_list = pd.read_csv('stop_words.txt', engine='python', encoding='utf-8', delimiter="\n", names=['t']) #获取重命名t列的值 stop_list = stop_list['t'].tolist() def txt_cut(juzi): return [w for w in jieba.lcut(juzi) if w not in stop_list] #可增加len(w)>1 #---------------------------------------情感计算--------------------------------- def emotion_caculate(text): positive = 0 negative = 0 anger = 0 disgust = 0 fear = 0 sad = 0 surprise = 0 good = 0 happy = 0 anger_list = [] disgust_list = [] fear_list = [] sad_list = [] surprise_list = [] good_list = [] happy_list = [] wordlist = txt_cut(text) #wordlist = jieba.lcut(text) wordset = set(wordlist) wordfreq = [] for word in wordset: freq = wordlist.count(word) if word in Positive: positive+=freq if word in Negative: negative+=freq if word in Anger: anger+=freq anger_list.append(word) if word in Disgust: disgust+=freq disgust_list.append(word) if word in Fear: fear+=freq fear_list.append(word) if word in Sad: sad+=freq sad_list.append(word) if word in Surprise: surprise+=freq surprise_list.append(word) if word in Good: good+=freq good_list.append(word) if word in Happy: happy+=freq happy_list.append(word) emotion_info = { 'length':len(wordlist), 'positive': positive, 'negative': negative, 'anger': anger, 'disgust': disgust, 'fear':fear, 'good':good, 'sadness':sad, 'surprise':surprise, 'happy':happy, } indexs = ['length', 'positive', 'negative', 'anger', 'disgust','fear','sadness','surprise', 'good', 'happy'] #return pd.Series(emotion_info, index=indexs), anger_list, disgust_list, fear_list, sad_list, surprise_list, good_list, happy_list return pd.Series(emotion_info, index=indexs) #测试 (res, anger_list, disgust_list, fear_list, sad_list, surprise_list, good_list, happy_list) text = """ 原著的确更吸引编剧读下去,所以跟《诛仙》系列明显感觉到编剧只看过故事大纲比,这个剧的编剧完整阅读过小说。 配乐活泼俏皮,除了强硬穿越的台词轻微尴尬,最应该尴尬的感情戏反而入戏, 故意模糊了陈萍萍的太监身份、太子跟长公主的暧昧关系, 整体观影感受极好,很期待第二季拍大东山之役。玩弄人心的阴谋阳谋都不狗血, 架空的设定能摆脱历史背景,服装道具能有更自由的发挥空间, 特别喜欢庆帝的闺房。以后还是少看国产剧,太长了,还是精短美剧更适合休闲,追这个太累。王启年真是太可爱了。 """ #res, anger, disgust, fear, sad, surprise, good, happy = emotion_caculate(text) res = emotion_caculate(text) print(res) #---------------------------------------情感计算--------------------------------- start = time.time() emotion_df = weibo_df['review'].apply(emotion_caculate) end = time.time() print(end-start) print(emotion_df.head()) #输出结果 output_df = pd.concat([weibo_df, emotion_df], axis=1) output_df.to_csv('庆余年220_emotion.csv',encoding='utf_8_sig', index=False) print(output_df.head())






#显示fear、negative数据集 fear_content = output_df.sort_values(by='fear',ascending=False) print(fear_content) print(fear_content.iloc[0:5]['review']) negative_content = output_df.sort_values(by='negative',ascending=False) print(negative_content) print(negative_content.iloc[0:5]['review'])









# coding=utf-8 from pyecharts import options as opts from pyecharts.charts import WordCloud from pyecharts.globals import SymbolType # 数据 words = [ ('背包问题', 10000), ('大整数', 6181), ('Karatsuba乘法算法', 4386), ('穷举搜索', 4055), ('傅里叶变换', 2467), ('状态树遍历', 2244), ('剪枝', 1868), ('Gale-shapley', 1484), ('最大匹配与匈牙利算法', 1112), ('线索模型', 865), ('关键路径算法', 847), ('最小二乘法曲线拟合', 582), ('二分逼近法', 555), ('牛顿迭代法', 550), ('Bresenham算法', 462), ('粒子群优化', 366), ('Dijkstra', 360), ('A*算法', 282), ('负极大极搜索算法', 273), ('估值函数', 265) ] # 渲染图 def wordcloud_base() -> WordCloud: c = ( WordCloud() .add("", words, word_size_range=[20, 100], shape='diamond') # SymbolType.ROUND_RECT .set_global_opts(title_opts=opts.TitleOpts(title='WordCloud词云')) ) return c # 生成图 wordcloud_base().render('词云图.html')



核心代码为: add(name, attr, value, shape=“circle”, word_gap=20, word_size_range=None, rotate_step=45)

name -> str: 图例名称attr -> list: 属性名称value -> list: 属性所对应的值shape -> list: 词云图轮廓,有’circle’, ‘cardioid’, ‘diamond’, ‘triangleforward’, ‘triangle’, ‘pentagon’, ‘star’可选word_gap -> int: 单词间隔,默认为20word_size_range -> list: 单词字体大小范围,默认为[12,60]rotate_step -> int: 旋转单词角度,默认为45 2.统计七种情绪特征词


# coding: utf-8 import pandas as pd import jieba import time import csv #-------------------------------------获取数据集--------------------------------- f = open('庆余年220.csv',encoding='utf8') weibo_df = pd.read_csv(f) print(weibo_df.head()) #-------------------------------------情感词典读取------------------------------- #注意: #1.词典中怒的标记(NA)识别不出被当作空值,情感分类列中的NA都给替换成NAU #2.大连理工词典中有情感分类的辅助标注(有NA),故把情感分类列改好再替换原词典中 # 扩展前的词典 df = pd.read_excel('大连理工大学中文情感词汇本体NAU.xlsx') print(df.head(10)) df = df[['词语', '词性种类', '词义数', '词义序号', '情感分类', '强度', '极性']] df.head() #-------------------------------------七种情绪的运用------------------------------- Happy = [] Good = [] Surprise = [] Anger = [] Sad = [] Fear = [] Disgust = [] #df.iterrows()功能是迭代遍历每一行 for idx, row in df.iterrows(): if row['情感分类'] in ['PA', 'PE']: Happy.append(row['词语']) if row['情感分类'] in ['PD', 'PH', 'PG', 'PB', 'PK']: Good.append(row['词语']) if row['情感分类'] in ['PC']: Surprise.append(row['词语']) if row['情感分类'] in ['NB', 'NJ', 'NH', 'PF']: Sad.append(row['词语']) if row['情感分类'] in ['NI', 'NC', 'NG']: Fear.append(row['词语']) if row['情感分类'] in ['NE', 'ND', 'NN', 'NK', 'NL']: Disgust.append(row['词语']) if row['情感分类'] in ['NAU']: #修改: 原NA算出来没结果 Anger.append(row['词语']) #正负计算不是很准 自己可以制定规则 Positive = Happy + Good + Surprise Negative = Anger + Sad + Fear + Disgust print('情绪词语列表整理完成') print(Anger) #---------------------------------------中文分词--------------------------------- #添加自定义词典和停用词 #jieba.load_userdict("user_dict.txt") stop_list = pd.read_csv('stop_words.txt', engine='python', encoding='utf-8', delimiter="\n", names=['t']) #获取重命名t列的值 stop_list = stop_list['t'].tolist() def txt_cut(juzi): return [w for w in jieba.lcut(juzi) if w not in stop_list] #可增加len(w)>1 #---------------------------------------情感计算--------------------------------- #文件写入 c = open("Emotion_features.csv", "a+", newline='', encoding='gb18030') writer = csv.writer(c) writer.writerow(["Emotion","Word","Num"]) #情感统计 def emotion_caculate(text): positive = 0 negative = 0 anger = 0 disgust = 0 fear = 0 sad = 0 surprise = 0 good = 0 happy = 0 anger_list = [] disgust_list = [] fear_list = [] sad_list = [] surprise_list = [] good_list = [] happy_list = [] wordlist = txt_cut(text) #wordlist = jieba.lcut(text) wordset = set(wordlist) wordfreq = [] for word in wordset: freq = wordlist.count(word) tlist = [] if word in Positive: positive+=freq if word in Negative: negative+=freq if word in Anger: anger+=freq anger_list.append(word) tlist.append("anger") tlist.append(word) tlist.append(freq) writer.writerow(tlist) if word in Disgust: disgust+=freq disgust_list.append(word) tlist.append("disgust") tlist.append(word) tlist.append(freq) writer.writerow(tlist) if word in Fear: fear+=freq fear_list.append(word) tlist.append("fear") tlist.append(word) tlist.append(freq) writer.writerow(tlist) if word in Sad: sad+=freq sad_list.append(word) tlist.append("sad") tlist.append(word) tlist.append(freq) writer.writerow(tlist) if word in Surprise: surprise+=freq surprise_list.append(word) tlist.append("surprise") tlist.append(word) tlist.append(freq) writer.writerow(tlist) if word in Good: good+=freq good_list.append(word) tlist.append("good") tlist.append(word) tlist.append(freq) writer.writerow(tlist) if word in Happy: happy+=freq happy_list.append(word) tlist.append("happy") tlist.append(word) tlist.append(freq) writer.writerow(tlist) emotion_info = { 'length':len(wordlist), 'positive': positive, 'negative': negative, 'anger': anger, 'disgust': disgust, 'fear':fear, 'good':good, 'sadness':sad, 'surprise':surprise, 'happy':happy, } indexs = ['length', 'positive', 'negative', 'anger', 'disgust','fear','sadness','surprise', 'good', 'happy'] #return pd.Series(emotion_info, index=indexs), anger_list, disgust_list, fear_list, sad_list, surprise_list, good_list, happy_list return pd.Series(emotion_info, index=indexs) #---------------------------------------情感计算--------------------------------- start = time.time() emotion_df = weibo_df['review'].apply(emotion_caculate) end = time.time() print(end-start) print(emotion_df.head()) #输出结果 output_df = pd.concat([weibo_df, emotion_df], axis=1) output_df.to_csv('庆余年220_emotion.csv',encoding='utf_8_sig', index=False) print(output_df.head()) #结束统计 c.close()





# coding: utf-8 import csv import pandas as pd #读取数据 f = open('Emotion_features.csv') data = pd.read_csv(f) print(data.head()) #统计结果 groupnum = data.groupby(['Emotion']).size() print(groupnum) print("") #分组统计 for groupname,grouplist in data.groupby('Emotion'): print(groupname) print(grouplist)


Emotion Word Num 0 good 人心 1 1 good 极好 1 2 good 活泼 1 3 disgust 强硬 1 4 disgust 尴尬 2 Emotion anger 2 disgust 208 fear 9 good 254 happy 39 sad 42 surprise 11 dtype: int64 anger Emotion Word Num 133 anger 气愤 1 382 anger 报仇 3 disgust Emotion Word Num 3 disgust 强硬 1 4 disgust 尴尬 2 8 disgust 模糊 1 .. ... ... ... 558 disgust 紧张 1 560 disgust 紧张 1 561 disgust 刺激 1 [208 rows x 3 columns] fear Emotion Word Num 93 fear 鸿门宴 1 111 fear 吓人 1 148 fear 可怕 1 170 fear 没头苍蝇 1 211 fear 厉害 1 290 fear 刀光剑影 1 292 fear 忌惮 1 342 fear 无时无刻 1 559 fear 紧张 1 good Emotion Word Num 0 good 人心 1 1 good 极好 1 .. ... ... ...


# coding: utf-8 import csv import pandas as pd import operator #------------------------------------统计结果------------------------------------ #读取数据 f = open('Emotion_features.csv') data = pd.read_csv(f) print(data.head()) #统计结果 groupnum = data.groupby(['Emotion']).size() print(groupnum) print("") #分组统计 for groupname,grouplist in data.groupby('Emotion'): print(groupname) print(grouplist) #生成数据 word = [('A',10), ('B',9), ('C',8)] 列表+Tuple i = 0 words = [] counts = [] while i 情感分数累加


# coding: utf-8 import sys import gzip from collections import defaultdict from itertools import product import jieba import csv import pandas as pd class Struct(object): def __init__(self, word, sentiment, pos,value, class_value): self.word = word self.sentiment = sentiment self.pos = pos self.value = value self.class_value = class_value class Result(object): def __init__(self,score, score_words,not_word, degree_word ): self.score = score self.score_words = score_words self.not_word = not_word self.degree_word = degree_word class Score(object): # 七个情感大类对应的小类简称: 尊敬 score_class = {'乐':['PA','PE'], '好':['PD','PH', 'PG','PB','PK'], '怒':['NA' ], '哀':['NB','NJ','NH', 'PF'], '惧':['NI', 'NC', 'NG'], '恶':['NE', 'ND', 'NN','NK','NL'], '惊':['PC'] } # 大连理工大学 -> ICTPOS 3.0 POS_MAP = { 'noun': 'n', 'verb': 'v', 'adj': 'a', 'adv': 'd', 'nw': 'al', # 网络用语 'idiom': 'al', 'prep': 'p', } # 否定词 NOT_DICT = set(['不','不是','不大', '没', '无', '非', '莫', '弗', '毋', '勿', '未', '否', '别', '無', '休']) def __init__(self, sentiment_dict_path, degree_dict_path, stop_dict_path ): self.sentiment_struct,self.sentiment_dict = self.load_sentiment_dict(sentiment_dict_path) self.degree_dict = self.load_degree_dict(degree_dict_path) self.stop_words = self.load_stop_words(stop_dict_path) def load_stop_words(self, stop_dict_path): stop_words = [w for w in open(stop_dict_path).readlines()] #print (stop_words[:100]) return stop_words def remove_stopword(self, words): words = [w for w in words if w not in self.stop_words] return words def load_degree_dict(self, dict_path): """读取程度副词词典 Args: dict_path: 程度副词词典路径. 格式为 word\tdegree 所有的词可以分为6个级别,分别对应极其, 很, 较, 稍, 欠, 超 Returns: 返回 dict = {word: degree} """ degree_dict = {} with open(dict_path, 'r', encoding='UTF-8') as f: for line in f: line = line.strip() word, degree = line.split('\t') degree = float(degree) degree_dict[word] = degree return degree_dict def load_sentiment_dict(self, dict_path): """读取情感词词典 Args: dict_path: 情感词词典路径. 格式请看 README.md Returns: 返回 dict = {(word, postag): 极性} """ sentiment_dict = {} sentiment_struct = [] with open(dict_path, 'r', encoding='UTF-8') as f: #with gzip.open(dict_path) as f: for index, line in enumerate(f): if index == 0: # title,即第一行的标题 continue items = line.split('\t') word = items[0] pos = items[1] sentiment=items[4] intensity = items[5] # 1, 3, 5, 7, 9五档, 9表示强度最大, 1为强度最小. polar = items[6] # 极性 # 将词性转为 ICTPOS 词性体系 pos = self.__class__.POS_MAP[pos] intensity = int(intensity) polar = int(polar) # 转换情感倾向的表现形式, 负数为消极, 0 为中性, 正数为积极 # 数值绝对值大小表示极性的强度 // 分成3类,极性:褒(+1)、中(0)、贬(-1); 强度为权重值 value = None if polar == 0: # neutral value = 0 elif polar == 1: # positive value = intensity elif polar == 2: # negtive value = -1 * intensity else: # invalid continue #key = (word, pos, sentiment ) key = word sentiment_dict[key] = value #找对应的大类 for item in self.score_class.items(): key = item[0] values = item[1] #print(key) #print(value) for x in values: if (sentiment==x): class_value = key # 如果values中包含,则获取key sentiment_struct.append(Struct(word, sentiment, pos,value, class_value)) return sentiment_struct, sentiment_dict def findword(self, text): #查找文本中包含哪些情感词 word_list = [] for item in self.sentiment_struct: if item.word in text: word_list.append(item) return word_list def classify_words(self, words): # 这3个键是词的序号(索引) sen_word = {} not_word = {} degree_word = {} # 找到对应的sent, not, degree; words 是分词后的列表 for index, word in enumerate(words): if word in self.sentiment_dict and word not in self.__class__.NOT_DICT and word not in self.degree_dict: sen_word[index] = self.sentiment_dict[word] elif word in self.__class__.NOT_DICT and word not in self.degree_dict: not_word[index] = -1 elif word in self.degree_dict: degree_word[index] = self.degree_dict[word] return sen_word, not_word, degree_word def get2score_position(self, words): sen_word, not_word, degree_word = self.classify_words(words) # 是字典 score = 0 start = 0 # 存所有情感词、否定词、程度副词的位置(索引、序号)的列表 sen_locs = sen_word.keys() not_locs = not_word.keys() degree_locs = degree_word.keys() senloc = -1 # 遍历句子中所有的单词words,i为单词的绝对位置 for i in range(0, len(words)): if i in sen_locs: W = 1 # 情感词间权重重置 not_locs_index = 0 degree_locs_index = 0 # senloc为情感词位置列表的序号,之前的sen_locs是情感词再分词后列表中的位置序号 senloc += 1 #score += W * float(sen_word[i]) if (senloc==0): # 第一个情感词,前面是否有否定词,程度词 start = 0 elif senloc 0) and (degree_locs_index>0 )): if (not_locs_index 0): pos_score = pos_score + word.value pos_word.append(word.word) else: neg_score = neg_score+word.value neg_word.append(word.word) print ("pos_score=%d; neg_score=%d" %(pos_score, neg_score)) #print('pos_word',pos_word) #print('neg_word',neg_word) def getscore(self, text): word_list = self.findword(text) ##查找文本中包含哪些情感词 # 增加程度副词+否定词 not_w = 1 not_word = [] for notword in self.__class__.NOT_DICT: # 否定词 if notword in text: not_w = not_w * -1 not_word.append(notword) degree_word = [] for degreeword in self.degree_dict.keys(): if degreeword in text: degree = self.degree_dict[degreeword] #polar = polar + degree if polar > 0 else polar - degree degree_word.append(degreeword) # 7大类找对应感情大类的词语,分别统计分数= 词极性*词权重 result = [] for key in self.score_class.keys(): #区分7大类 score = 0 score_words = [] for word in word_list: if (key == word.class_value): score = score + word.value score_words.append(word.word) if score > 0: score = score + degree elif score0,程度更强; 分数情感词间是否有否定词/程度词+前后顺序->分数累加 result = score.get2score_position(words_) print(result) tlist.append(str(n)) tlist.append(words) tlist.append(str(result)) writer.writerow(tlist) n = n + 1 #句子-> 整句判断否定词/程度词 -> 分正负词 #score.get2score(temp) #score.getscore(text) c.close()









中文分词(算法是Character-Based Generative Model)词性标注(原理是TnT、3-gram 隐马)情感分析文本分类(原理是朴素贝叶斯)转换拼音、繁体转简体提取文本关键词(原理是TextRank)提取摘要(原理是TextRank)、分割句子文本相似(原理是BM25)




# -*- coding: utf-8 -*- from snownlp import SnowNLP s1 = SnowNLP(u"我今天很开心") print(u"s1情感分数:") print(s1.sentiments) s2 = SnowNLP(u"我今天很沮丧") print(u"s2情感分数:") print(s2.sentiments) s3 = SnowNLP(u"大傻瓜,你脾气真差,动不动就打人") print(u"s3情感分数:") print(s3.sentiments)


s1情感分数: 0.842040189791 s2情感分数: 0.648537121839 s3情感分数: 0.049546727538


sentiment.train(’./neg.txt’, ‘./pos.txt’)sentiment.save(‘sentiment.marshal’)

下面的代码是对《庆余年》电视剧部分评论进行情感分析。在做情感分析的时候,很多论文都是将情感区间从[0, 1.0]转换为[-0.5, 0.5],这样的曲线更加好看,位于0以上的是积极评论,反之消极评论。最终代码如下:

# -*- coding: utf-8 -*- from snownlp import SnowNLP import codecs import os import pandas as pd #获取情感分数 f = open('庆余年220.csv',encoding='utf8') data = pd.read_csv(f) sentimentslist = [] for i in data['review']: s = SnowNLP(i) print(s.sentiments) sentimentslist.append(s.sentiments) #区间转换为[-0.5, 0.5] result = [] i = 0 while i




