理解语言的 Transformer 模型

您所在的位置：网站首页 › transflow教程 › 理解语言的 Transformer 模型

理解语言的 Transformer 模型

2024-07-13 12:51| 来源: 网络整理| 查看: 265

在 tensorflow.google.cn 上查看

在 Google Colab 运行

在 Github 上查看源代码

下载此 notebook Note: 我们的 TensorFlow 社区翻译了这些文档。因为社区翻译是尽力而为，所以无法保证它们是最准确的，并且反映了最新的官方英文文档。如果您有改进此翻译的建议，请提交 pull request 到 tensorflow/docs GitHub 仓库。要志愿地撰写或者审核译文，请加入 [email protected] Google Group

本教程训练了一个 Transformer 模型用于将葡萄牙语翻译成英语。这是一个高级示例，假定您具备文本生成（text generation）和注意力机制（attention）的知识。

Transformer 模型的核心思想是自注意力机制（self-attention）——能注意输入序列的不同位置以计算该序列的表示的能力。Transformer 创建了多层自注意力层（self-attetion layers）组成的堆栈，下文的按比缩放的点积注意力（Scaled dot product attention）和多头注意力（Multi-head attention）部分对此进行了说明。

一个 transformer 模型用自注意力层而非 RNNs 或 CNNs 来处理变长的输入。这种通用架构有一系列的优势：

它不对数据间的时间/空间关系做任何假设。这是处理一组对象（objects）的理想选择（例如，星际争霸单位（StarCraft units））。层输出可以并行计算，而非像 RNN 这样的序列计算。远距离项可以影响彼此的输出，而无需经过许多 RNN 步骤或卷积层（例如，参见场景记忆 Transformer（Scene Memory Transformer））它能学习长距离的依赖。在许多序列任务中，这是一项挑战。

该架构的缺点是：

对于时间序列，一个单位时间的输出是从整个历史记录计算的，而非仅从输入和当前的隐含状态计算得到。这可能效率较低。如果输入确实有时间/空间的关系，像文本，则必须加入一些位置编码，否则模型将有效地看到一堆单词。

在此 notebook 中训练完模型后，您将能输入葡萄牙语句子，得到其英文翻译。

Attention heatmap

import tensorflow_datasets as tfds import tensorflow as tf import time import numpy as np import matplotlib.pyplot as plt 设置输入流水线（input pipeline）

使用 TFDS 来导入葡萄牙语-英语翻译数据集，该数据集来自于 TED 演讲开放翻译项目.

该数据集包含来约 50000 条训练样本，1100 条验证样本，以及 2000 条测试样本。

examples, metadata = tfds.load('ted_hrlr_translate/pt_to_en', with_info=True, as_supervised=True) train_examples, val_examples = examples['train'], examples['validation']

从训练数据集创建自定义子词分词器（subwords tokenizer）。

tokenizer_en = tfds.features.text.SubwordTextEncoder.build_from_corpus( (en.numpy() for pt, en in train_examples), target_vocab_size=2**13) tokenizer_pt = tfds.features.text.SubwordTextEncoder.build_from_corpus( (pt.numpy() for pt, en in train_examples), target_vocab_size=2**13) sample_string = 'Transformer is awesome.' tokenized_string = tokenizer_en.encode(sample_string) print ('Tokenized string is {}'.format(tokenized_string)) original_string = tokenizer_en.decode(tokenized_string) print ('The original string: {}'.format(original_string)) assert original_string == sample_string

如果单词不在词典中，则分词器（tokenizer）通过将单词分解为子词来对字符串进行编码。

for ts in tokenized_string: print ('{} ----> {}'.format(ts, tokenizer_en.decode([ts]))) BUFFER_SIZE = 20000 BATCH_SIZE = 64

将开始和结束标记（token）添加到输入和目标。

def encode(lang1, lang2): lang1 = [tokenizer_pt.vocab_size] + tokenizer_pt.encode( lang1.numpy()) + [tokenizer_pt.vocab_size+1] lang2 = [tokenizer_en.vocab_size] + tokenizer_en.encode( lang2.numpy()) + [tokenizer_en.vocab_size+1] return lang1, lang2

Note：为了使本示例较小且相对较快，删除长度大于40个标记的样本。

MAX_LENGTH = 40 def filter_max_length(x, y, max_length=MAX_LENGTH): return tf.logical_and(tf.size(x) english for (batch, (inp, tar)) in enumerate(train_dataset): train_step(inp, tar) if batch % 50 == 0: print ('Epoch {} Batch {} Loss {:.4f} Accuracy {:.4f}'.format( epoch + 1, batch, train_loss.result(), train_accuracy.result())) if (epoch + 1) % 5 == 0: ckpt_save_path = ckpt_manager.save() print ('Saving checkpoint for epoch {} at {}'.format(epoch+1, ckpt_save_path)) print ('Epoch {} Loss {:.4f} Accuracy {:.4f}'.format(epoch + 1, train_loss.result(), train_accuracy.result())) print ('Time taken for 1 epoch: {} secs\n'.format(time.time() - start)) 评估（Evaluate）

以下步骤用于评估：

用葡萄牙语分词器（tokenizer_pt）编码输入语句。此外，添加开始和结束标记，这样输入就与模型训练的内容相同。这是编码器输入。解码器输入为 start token == tokenizer_en.vocab_size。计算填充遮挡和前瞻遮挡。解码器通过查看编码器输出和它自身的输出（自注意力）给出预测。选择最后一个词并计算它的 argmax。将预测的词连接到解码器输入，然后传递给解码器。在这种方法中，解码器根据它预测的之前的词预测下一个。

Note：这里使用的模型具有较小的能力以保持相对较快，因此预测可能不太正确。要复现论文中的结果，请使用全部数据集，并通过修改上述超参数来使用基础 transformer 模型或者 transformer XL。

def evaluate(inp_sentence): start_token = [tokenizer_pt.vocab_size] end_token = [tokenizer_pt.vocab_size + 1] # 输入语句是葡萄牙语，增加开始和结束标记 inp_sentence = start_token + tokenizer_pt.encode(inp_sentence) + end_token encoder_input = tf.expand_dims(inp_sentence, 0) # 因为目标是英语，输入 transformer 的第一个词应该是 # 英语的开始标记。 decoder_input = [tokenizer_en.vocab_size] output = tf.expand_dims(decoder_input, 0) for i in range(MAX_LENGTH): enc_padding_mask, combined_mask, dec_padding_mask = create_masks( encoder_input, output) # predictions.shape == (batch_size, seq_len, vocab_size) predictions, attention_weights = transformer(encoder_input, output, False, enc_padding_mask, combined_mask, dec_padding_mask) # 从 seq_len 维度选择最后一个词 predictions = predictions[: ,-1:, :] # (batch_size, 1, vocab_size) predicted_id = tf.cast(tf.argmax(predictions, axis=-1), tf.int32) # 如果 predicted_id 等于结束标记，就返回结果 if predicted_id == tokenizer_en.vocab_size+1: return tf.squeeze(output, axis=0), attention_weights # 连接 predicted_id 与输出，作为解码器的输入传递到解码器。 output = tf.concat([output, predicted_id], axis=-1) return tf.squeeze(output, axis=0), attention_weights def plot_attention_weights(attention, sentence, result, layer): fig = plt.figure(figsize=(16, 8)) sentence = tokenizer_pt.encode(sentence) attention = tf.squeeze(attention[layer], axis=0) for head in range(attention.shape[0]): ax = fig.add_subplot(2, 4, head+1) # 画出注意力权重 ax.matshow(attention[head][:-1, :], cmap='viridis') fontdict = {'fontsize': 10} ax.set_xticks(range(len(sentence)+2)) ax.set_yticks(range(len(result))) ax.set_ylim(len(result)-1.5, -0.5) ax.set_xticklabels( ['']+[tokenizer_pt.decode([i]) for i in sentence]+[''], fontdict=fontdict, rotation=90) ax.set_yticklabels([tokenizer_en.decode([i]) for i in result if i < tokenizer_en.vocab_size], fontdict=fontdict) ax.set_xlabel('Head {}'.format(head+1)) plt.tight_layout() plt.show() def translate(sentence, plot=''): result, attention_weights = evaluate(sentence) predicted_sentence = tokenizer_en.decode([i for i in result if i < tokenizer_en.vocab_size]) print('Input: {}'.format(sentence)) print('Predicted translation: {}'.format(predicted_sentence)) if plot: plot_attention_weights(attention_weights, sentence, result, plot) translate("este é um problema que temos que resolver.") print ("Real translation: this is a problem we have to solve .") translate("os meus vizinhos ouviram sobre esta ideia.") print ("Real translation: and my neighboring homes heard about this idea .") translate("vou então muito rapidamente partilhar convosco algumas histórias de algumas coisas mágicas que aconteceram.") print ("Real translation: so i 'll just share with you some stories very quickly of some magical things that have happened .")

您可以为 plot 参数传递不同的层和解码器的注意力模块。

translate("este é o primeiro livro que eu fiz.", plot='decoder_layer4_block2') print ("Real translation: this is the first book i've ever done.") 总结

在本教程中，您已经学习了位置编码，多头注意力，遮挡的重要性以及如何创建一个 transformer。

尝试使用一个不同的数据集来训练 transformer。您可也可以通过修改上述的超参数来创建基础 transformer 或者 transformer XL。您也可以使用这里定义的层来创建 BERT 并训练最先进的模型。此外，您可以实现 beam search 得到更好的预测。

【本文地址】

理解语言的 Transformer 模型

理解语言的 Transformer 模型

今日新闻

推荐新闻