NLP：使用 gensim 中的 word2vec 训练中文词向量

您所在的位置：网站首页 › Python设置中文语料 › NLP：使用 gensim 中的 word2vec 训练中文词向量

NLP：使用 gensim 中的 word2vec 训练中文词向量

2024-07-12 18:27| 来源: 网络整理| 查看: 265

文章目录前言1.1 下载数据集1.2 预处理1.2.1 将原始 xml 文件转换为 txt 文件1.2.2 中文繁体转简体1.2.3 分词1.2.4 去除停用词 1.3 训练词向量1.4 测试词向量1.5 总结参考

前言

本内容主要介绍使用 gensim 中的 word2vec 训练中文词向量。

1.1 下载数据集

本内容使用 wiki 中文语料作为训练数据。可以在 wiki 官网下载中文语料，下载后得到一个名为 zhwiki-latest-pages-articles.xml.bz2 的压缩文件，解压后里面是一个 xml 文件。

下载地址：https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2

1.2 预处理 1.2.1 将原始 xml 文件转换为 txt 文件

因为下载得到的数据是一份 xml 文件，里面存在各种标签，所以我们需要对其进行处理，然后将其保存到一个 txt 文件中。我们可以借助 gensim 中提供的 gensim.corpora.WikiCorpus 实现这个操作，示例代码如下：

from gensim.corpora import WikiCorpus class ZhWikiPreProcessor: def convert_xml_to_text(self, input_file, output_file): with open(output_file, mode='w', encoding='utf-8') as out_f: index = 0 space = ' ' wiki_corpus = WikiCorpus(input_file, dictionary=[]) for text in wiki_corpus.get_texts(): # 保存到文件中 out_f.write(space.join(text) + '\n') index += 1 # 打印处理进度 if index % 1000 == 0: print("Saved {} articles...".format(index)) zh_wiki_xml_file = r'.\zhwiki\zhwiki-latest-pages-articles.xml.bz2' zh_wiki_text_file = r'.\zhwiki\zh_wiki_original_text.txt' preprocessor = ZhWikiPreProcessor() # 将 xml 内容转换为 txt 内容 preprocessor.convert_xml_to_text(zh_wiki_xml_file, zh_wiki_text_file) 1.2.2 中文繁体转简体

因为上面获取到的 Wiki 中文语料中包含了很多繁体字，所以我们需要将繁体字转换为简体字。可以使用 OpenCC 实现这个操作，示例代码如下：

from opencc import OpenCC class ZhWikiPreProcessor: def convert_t2s(self, input_file, output_file): with open(output_file, mode='w', encoding='utf-8') as output_f: with open(input_file, mode='r', encoding='utf-8') as input_f: index = 0 cc = OpenCC('t2s') for line in input_f.readlines(): # 将中文繁体转换为中文简体 line = cc.convert(line.strip()) # 保存到文件中 output_f.write(line + '\n') index += 1 # 打印处理进度 if index % 1000 == 0: print('Converted {} articles...'.format(index)) zh_wiki_text_file = r'.\zhwiki\zh_wiki_original_text.txt' zh_wiki_simple_text_file = r'.\zhwiki\zh_wiki_simple_text.txt' preprocessor = ZhWikiPreProcessor() # 将中文繁体转换为中文简体 preprocessor.convert_t2s(zh_wiki_text_file, zh_wiki_simple_text_file)

如果想了解更多关于 OpenCC 的信息，可以参照这里。

1.2.3 分词

下面需要对训练文本进行分词操作，就是将句子分解成一个个的词。中文分词工具有中科院计算所 NLPIR、哈工大 LTP、清华大学 THULAC、北京大学 PKUSeg、FoolNLTK、HanLP、jieba 等。本内容采用了 jieba 分词工具（其使用简单方便、流行度高），示例代码如下：

import jieba class ZhWikiPreProcessor: def text_segment(self, input_file, output_file): with open(output_file, mode='w', encoding='utf-8') as output_f: with open(input_file, mode='r', encoding='utf-8') as input_f: index = 0 for line in input_f.readlines(): # 对文本进行分词操作 words = jieba.cut(line.strip(), cut_all=False) # 将词之间使用空格分隔，并保存到文件中 output_f.write(' '.join(words) + '\n') index += 1 # 打印处理进度 if index % 1000 == 0: print("Segment text {} lines...".format(index)) zh_wiki_simple_text_file = r'.\zhwiki\zh_wiki_simple_text.txt' zh_wiki_simple_segment_text_file = r'.\zhwiki\zh_wiki_simple_segment_text.txt' preprocessor = ZhWikiPreProcessor() # 对文本进行分词操作 preprocessor.text_segment(zh_wiki_simple_text_file, zh_wiki_simple_segment_text_file) 1.2.4 去除停用词

有时候我们可能需要去除停用词。在原有分词代码的基础上，需要新增两处修改：a. 获取停用词集合；b. 移除分词结果中的停用词。示例代码如下：

import jieba class ZhWikiPreProcessor: def text_segment(self, input_file, output_file, stopwords_file=None): # 获取停用词集合（01.新增加内容） if stopwords_file is not None: stopwords_set = self.get_stopwords(stopwords_file) else: stopwords_set = set() with open(output_file, mode='w', encoding='utf-8') as output_f: with open(input_file, mode='r', encoding='utf-8') as input_f: index = 0 for line in input_f.readlines(): # 对文本进行分词操作 words = jieba.cut(line.strip(), cut_all=False) # 去除停用词（02.新增加内容） words = [word for word in words if word not in stopwords_set] # 将词之间使用空格分隔，并保存到文件中 output_f.write(' '.join(words) + '\n') index += 1 # 打印处理进度 if index % 1000 == 0: print("Segment text {} lines...".format(index)) @staticmethod def get_stopwords(stopwords_file): stopwords_set = set() with open(stopwords_file, mode='r', encoding='utf-8') as f: for stopword in f.readlines(): stopwords_set.add(stopword.strip()) return stopwords_set 1.3 训练词向量

本内容使用 gensim 工具包中的 word2vec 进行训练，示例代码如下：

from gensim.models import Word2Vec from gensim.models.word2vec import LineSentence def train_zhwiki(): # 训练数据文件名 data_file = r'.\zhwiki\zh_wiki_segment_text.txt' # 保存的模型文件名 model_file = r'.\zhwiki\zh_wiki.model' vector_file = r'.\zhwiki\zh_wiki.vector' # 训练模型 model = Word2Vec(LineSentence(data_file), vector_size=100) # 保存模型 model.save(model_file) model.wv.save_word2vec_format(vector_file, binary=False)

下面列出 gensim.models.Word2Vec 的一些主要参数：

参数名说明sentences训练语料，需要是一个可迭代的对象，比如简单的列表或者 gensim 提供的 LineSentence 对象等。corpus_file训练语料文件，可以替代上面的 sentences。vector_size词向量的维度，默认值为 100。window目标词和上下文词的最大距离。min_count词频小于这个值的词，不计算其词向量，默认值为 5。workers训练模型时的线程数sgword2vec 训练模型的选择。1 表示 skip-gram；否则为 CBOW。hs训练模型的优化算法的选择。1 表示使用层级 softmax；0 并且参数 negative 为非零时，使用负采样。negative指定负采样的个数。当设置为 0 时，将不会使用负采样。cbow_mean当使用 CBOW 时有效。0 表示使用上下文词向量的和；1 表示使用上下文词的平均值。alpha初始学习率。min_alpha最小学习率。max_vocab_size词表的最大数量。epochs训练的迭代次数。callbacks回调函数。

可以通过调整这些参数，从而使训练出来的模型达到最佳效果。

1.4 测试词向量

下面对训练的词向量效果进行测试，示例代码如下：

from gensim.models import Word2Vec def test_zhwiki_model(): # 模型文件名 model_file = r'.\zhwiki\zh_wiki.model' # 加载模型 model = Word2Vec.load(model_file) # 获取词的相似词 result = model.wv.most_similar('清华大学') for word in result: print(word) print('\n') # 获取两个词之间的余弦相似度 result = model.wv.similarity('男人', '女人') print(result)

将输出以下结果：

('北京大学', 0.886064350605011) ('武汉大学', 0.8563425540924072) ('清华', 0.8015897274017334) ('浙江大学', 0.7996682524681091) ('中山大学', 0.7982087135314941) ('复旦大学', 0.7962114214897156) ('中国人民大学', 0.7908720374107361) ('南开大学', 0.7769314050674438) ('南京大学', 0.775385320186615) ('华东师范大学', 0.7644416093826294) 0.915614 1.5 总结

从上面对词向量的测试结果看，效果还算可以。如果我们需要在实际项目中进行使用，可以考虑从以下几个方向入手：

提高分词的准确度。维护一份有效的停用词表。不断调整 gensim.models.Word2Vec 的参数。其他。参考

[1] 利用Python构建Wiki中文语料词向量模型试验

[1] 使用 gensim 訓練中文詞向量

[2] 以 gensim 訓練中文詞向量

【本文地址】

NLP：使用 gensim 中的 word2vec 训练中文词向量

NLP：使用 gensim 中的 word2vec 训练中文词向量

今日新闻

推荐新闻