使用20

2024-06-15 22:46| 来源: 网络整理| 查看: 265

使用20_newsgroup集做训练集，载入Glove预训练权重训练模型预训练20_newsgroup数据集 Load sample Preview file folder Define the path to 20_newsgroup folder Load data from all the child folder in 20_newsgroup Preprocess the texts data Import Library Tokenizer Pad_sequences Preprocess the labels Import Library 使用to_categorical函数将labels转变成one-hot 类型 Shuffle and Split to train and dev set 构建Embedding层并使用Glove权重下载 glove.6b.50d.txt 文件 Load the Embedding Vector 找到对应path 读取每个单词的word vector 建立weight matrix，载入每个单词的word vector 建立 Embedding layer 训练一个简单的ConvNet Import Library 构建Model 训练模型

预训练20_newsgroup数据集 Load sample Preview file folder

首先下载 news20.tar.gz, 解压到jupyter notebook的相应运行位置，会出现一个20_newsgroup的file folder，这就是我们的训练集，打开这个folder, 发现又有19个不同的folder, 每个folder都讲的内容应该都不一样，我们后面会对出自不同folder的text做labels.

Define the path to 20_newsgroup folder import os from os.path import join Base_dir = '.' Text_dir = join(Base_dir,'20_newsgroup') print(Text_dir) Load data from all the child folder in 20_newsgroup texts, labels, labels_index = [], { }, [] # texts 做完下面的运算后会变成一个装了各个文件文本的list # labels 对不同文件夹进行编号的字典 # labels_index 相当于对每个文件文本进行编号，这个list的长度和texts一致 for name in sorted(os.listdir(Text_dir)): # every file_folder under the root_file_folder should be labels with a unique number labels[name] = len(labels) # path = join(Text_dir, name) for fname in sorted(os.listdir(path)): if fname.isdigit():# The training set we want is all have a digit name fpath = join(path,fname) labels_index.append(labels[name]) # skip header f = open(fpath, encoding='latin-1') t = f.read() i = t.find('\n\n') if i > 0: t = t[i:]# 去除每篇文章没用的header texts.append(t) f.close(

【本文地址】

使用20

使用20

今日新闻

推荐新闻