使用20_newsgroup集做训练集,载入Glove预训练权重训练模型
预训练20_newsgroup数据集
Load sample
Preview file folder
Define the path to 20_newsgroup folder
Load data from all the child folder in 20_newsgroup
Preprocess the texts data
Import Library
Tokenizer
Pad_sequences
Preprocess the labels
Import Library
使用to_categorical函数将labels转变成one-hot 类型
Shuffle and Split to train and dev set
构建Embedding层并使用Glove权重
下载 glove.6b.50d.txt 文件
Load the Embedding Vector
找到对应path
读取每个单词的word vector
建立weight matrix, 载入每个单词的word vector
建立 Embedding layer
训练一个简单的ConvNet
Import Library
构建Model
训练模型
预训练20_newsgroup数据集
Load sample
Preview file folder
首先下载 news20.tar.gz, 解压到jupyter notebook的相应运行位置,会出现一个20_newsgroup的file folder,这就是我们的训练集,打开这个folder, 发现又有19个不同的folder, 每个folder都讲的内容应该都不一样,我们后面会对出自不同folder的text做labels.
Define the path to 20_newsgroup folder
import os
from os.path import join
Base_dir = '.'
Text_dir = join(Base_dir,'20_newsgroup')
print(Text_dir)
Load data from all the child folder in 20_newsgroup
texts, labels, labels_index = [], {
}, []
# texts 做完下面的运算后会变成一个装了各个文件文本的list
# labels 对不同文件夹进行编号的字典
# labels_index 相当于对每个文件文本进行编号, 这个list的长度和texts一致
for name in sorted(os.listdir(Text_dir)):
# every file_folder under the root_file_folder should be labels with a unique number
labels[name] = len(labels) #
path = join(Text_dir, name)
for fname in sorted(os.listdir(path)):
if fname.isdigit():# The training set we want is all have a digit name
fpath = join(path,fname)
labels_index.append(labels[name])
# skip header
f = open(fpath, encoding='latin-1')
t = f.read()
i = t.find('\n\n')
if i > 0:
t = t[i:]# 去除每篇文章没用的header
texts.append(t)
f.close(
|