“《三国演义》人物出场统计“实例讲解

您所在的位置:网站首页 三国演义各种版本名称 “《三国演义》人物出场统计“实例讲解

“《三国演义》人物出场统计“实例讲解

#“《三国演义》人物出场统计“实例讲解| 来源: 网络整理| 查看: 265

刚学完英文词频统计,现在我们来看一下中文人物出场统计

下面我们以《三国演义》为例,进行统计分析

一、解题思路 1.jieba库的使用

jieba库是优秀的中文第三方库,利用jieba库我们可以对中文文本分词获得单个的词语

2.词语筛选

本次统计的目的是获取《三国演义》中的人物出场次数,这就要求我们对词语进行筛选,

筛除一个字的词语(不可能是人名)通过对输出的结果进行分析,将不符合的词语进行筛除,不断重复该步骤,直至输出的结果符合我们的期望有的人物可能有多钟称谓,需要我们进行合并 3.出场次数排序

通过字典的值,对数据进行排序,输出出场次数排名前20的人物

二、代码实现 1.CalThreeKingdomsV1 代码 #CalThreeKingdomsV1.py import jieba txt = open("threekingdoms.txt", "r", encoding='utf-8').read() words = jieba.lcut(txt) counts = {} for word in words: if len(word) == 1: continue else: counts[word] = counts.get(word,0) + 1 items = list(counts.items()) items.sort(key=lambda x:x[1], reverse=True) for i in range(15): word, count = items[i] print ("{0:5}".format(word, count))

注意事项:

读取中文文本要修改编码方式为"utf-8",不然没有办法读取利用jieba.lcut()方法,把文本精确的切分开,不存在冗余单词利用字典对出场次数进行统计,利用sorted()方法进行排序 输出结果

 我们可以看出输出结果并不是我们所期望的:

“将军,却说,二人,不可,不能,如此,荆州”都不是人名“曹操”和“丞相”,“孔明”和“孔明曰”都是一个人 2.CalThreeKingdomsV2

将不符合的词语从字典中筛除,有多个称谓的进行合并处理

代码 #CalThreeKingdomsV2.py import jieba excludes = {"将军","却说","荆州","二人","不可","不能","如此"} txt = open("threekingdoms.txt", "r", encoding='utf-8').read() words = jieba.lcut(txt) counts = {} for word in words: if len(word) == 1: continue elif word == "诸葛亮" or word == "孔明曰": rword = "孔明" elif word == "关公" or word == "云长": rword = "关羽" elif word == "玄德" or word == "玄德曰": rword = "刘备" elif word == "孟德" or word == "丞相": rword = "曹操" else: rword = word counts[rword] = counts.get(rword,0) + 1 for word in excludes: del counts[word] items = list(counts.items()) items.sort(key=lambda x:x[1], reverse=True) for i in range(10): word, count = items[i] print ("{0:5}".format(word, count)) 输出结果

3.CalThreeKingdomsV3

经过对结果反复的筛选,终于得到了出场次数前20的人名:

代码 # CalThreeKingdomsV3.py import jieba excludes = {"将军", "却说", "荆州", "二人", "不可", "不能", "如此", "商议", "如何", "主公", "军士", "左右", "军马", "引兵", "次日", "大喜", "天下", "东吴", "于是", "今日", "不敢", "魏兵", "陛下", "一人", "都督", "人马", "不知", "汉中", "只见", "众将", "蜀兵", "上马", "大叫", "太守", "此人", "夫人", "后人", "背后", "城中", "一面", "何不", "大军", "忽报", "先生", "百姓", "何故", "然后", "先锋", "不如", "赶来", "原来", "令人", "江东", "下马", "喊声", "正是", "徐州", "忽然", "因此", "成都", "不见", "未知", "大败", "大事", "之后", "一军", "引军", "起兵", "军中", "接应", "进兵", "大惊", "可以"} txt = open("threekingdoms.txt", "r", encoding='utf-8').read() words = jieba.lcut(txt) counts = {} for word in words: if len(word) == 1: continue elif word == "诸葛亮" or word == "孔明曰": rword = "孔明" elif word == "关公" or word == "云长": rword = "关羽" elif word == "玄德" or word == "玄德曰" or word == "先主": rword = "刘备" elif word == "孟德" or word == "丞相": rword = "曹操" elif word == "后主": rword = "刘禅" elif word == "天子": rword = "刘协" else: rword = word counts[rword] = counts.get(rword, 0) + 1 for word in excludes: del counts[word] items = list(counts.items()) items.sort(key=lambda x: x[1], reverse=True) for i in range(20): word, count = items[i] print("{0:5}".format(word, count))

输出结果:

 备注:筛除的词语中有些是具有歧义的,如“先生”“夫人”

看到最后的结果,出场次数最多的是曹操,你是否感到惊讶~~~



【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3