Python中文分词、词频统计并制作词云图

您所在的位置:网站首页 已知词频做词云 Python中文分词、词频统计并制作词云图

Python中文分词、词频统计并制作词云图

2024-07-03 03:19| 来源: 网络整理| 查看: 265

中文分词、词频统计并制作词云图是统计数据常用的功能,这里用到了三个模块快速实现这个功能。

中文分词、词频统计 import jieba from collections import Counter # 1. 读取文本内容并进行分词 with open('demo.txt', mode='r', encoding='gbk') as f: report = f.read() words = jieba.cut(report) # 2. 按指定长度提取词 report_words = [] for word in words: if len(word) >= 4: report_words.append(word) print(report_words) # 3. 统计高频词汇 result = Counter(report_words).most_common(50) print(result)

上面代码用jieba模块进行分词,用collections进行词频统计。 jieba是一个优秀的第三方中文词库,用于中文分词。中文分词指的是将一个汉字序列切分成一个一个单独的词。jieba可以帮助你快速高效地完成中文分词,支持三种分词模式:精确模式、全模式和搜索引擎模式。

collections是Python标准库中的一个模块,提供了一些额外的容器类型,以提供Python标准内建容器dict、list、set和tuple的替代选择。这些容器类型包括namedtuple、deque、Counter等。

简单词云图 import jieba.posseg as pseg from collections import Counter from wordcloud import WordCloud # 1. 读取文本内容并进行分词 with open('demo.txt', mode='r', encoding='gbk') as f: report = f.read() words = pseg.cut(report) # 2. 按指定长度和词性提取词 report_words = [] for word, flag in words: if (len(word) >= 4) and ('n' in flag): report_words.append(word) # print(report_words) # 3. 统计高频词汇 result = Counter(report_words).most_common(50) # print(result) # 4. 绘制词云图 content = dict(result) # print(content) wc = WordCloud(font_path='PINGFANG MEDIUM.TTF', background_color='white', width=1000, height=600) wc.generate_from_frequencies(content) wc.to_file('词云图1.png')

这里用到了wordcloud模块来生成词云图。

按照图片绘制词云图 import jieba.posseg as pseg from collections import Counter from PIL import Image import numpy as np from wordcloud import WordCloud # 1. 读取文本内容并进行分词 with open('demo.txt', mode='r', encoding='gbk') as f: report = f.read() words = pseg.cut(report) # 2. 按指定长度和词性提取词 report_words = [] for word, flag in words: if (len(word) >= 4) and ('n' in flag): report_words.append(word) # print(report_words) # 3. 统计高频词汇 result = Counter(report_words).most_common(300) # print(result) # 4. 绘制词云图 mask_pic = Image.open('map.png') mask_data = np.array(mask_pic) print(mask_data) content = dict(result) wc = WordCloud(font_path='PINGFANG MEDIUM.TTF', background_color='white', mask=mask_data) wc.generate_from_frequencies(content) wc.to_file('词云图2.png')

这里给WordCloud加了mask遮罩参数。

按照图片绘制渐变词云图 import jieba.posseg as pseg from collections import Counter from PIL import Image import numpy as np from wordcloud import WordCloud, ImageColorGenerator # 1. 读取文本内容并进行分词 with open('demo.txt', mode='r', encoding='gbk') as f: report = f.read() words = pseg.cut(report) # 2. 按指定长度和词性提取词 report_words = [] for word, flag in words: if (len(word) >= 4) and ('n' in flag): report_words.append(word) # print(report_words) # 3. 统计高频词汇 result = Counter(report_words).most_common(300) # print(result) # 4. 绘制词云图 mask_pic = Image.open('map.png') mask_data = np.array(mask_pic) content = dict(result) wc = WordCloud(font_path='PINGFANG MEDIUM.TTF', background_color='white', mask=mask_data) wc.generate_from_frequencies(content) mask_colors = ImageColorGenerator(mask_data) wc.recolor(color_func=mask_colors) wc.to_file('词云图3.png')

这里用recolor重绘了颜色。



【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3