Attention is all you need: understanding with example

您所在的位置:网站首页 人的全身图怎么画 Attention is all you need: understanding with example

Attention is all you need: understanding with example

#Attention is all you need: understanding with example | 来源: 网络整理| 查看: 265

We must keep in mind that each row of the above matrices corresponds to one of the input sequence tokens as shown in the below diagram

Q1, K1, V1 corresponds to Query, Key & Value for 1st token & likewise for other tokens. Observe the similarity between this & the above figure

But, where did the monstrous formula for Attention go?

Going one term at a time, Let us 1st calculate the below term 1st:

As d_k = 3(assumed earlier), root(d_k) is approximated to 1 for ease of calculation.

Hence, Q x K_Transpose/1 =

The pictorial representation of what’s going on has been added below. The above matrix can be called as Score matrix. The below image shows the Scores for 1st token.

The score for 1st token (corresponds to 1st row of Score matrix)

Now, we need to apply softmax across each row of the above output.

Hence, softmax(QxK_Transpose/1)=

Softmaxed scores for each token. Here also, 1st Row: Raj, 2nd Row:is & 3rd Row:good

The softmaxed scores for a token represent the importance of tokens corresponding to other tokens. For example, Softmaxed score for 1st token ‘Raj’ =[0.06,0.46,0.46] implies the Importance of ‘Raj’ for

Raj=0.06

is=0.46

good=0.46

More the score, the more the importance of that token corresponding to that token (including itself).

After applying softmax() on the Score matrix, this is what happens in the Attention Layer for 1st token

The softmaxed score for 1st token (corresponds to 1st row of softmaxed Score matrix)

So, left with the last part of the equation, multiplication of the above-softmaxed value with Value matrix. But with a twist. We will be calculating 3 attention values for each token corresponding to all other tokens in the sequence including itself. If total tokens were 5,6 or any other number, that many attention values would have been calculated

What we will be doing is calculate Attention for 1st token i.e. row 1 below. For this, we need to multiply each value of row 1 in the softmaxed score matrix with the corresponding index row in the Value matrix (observe Value matrix declared above) i.e.

0.063337894([0,0] in softmaxed matrix)*[1. 2. 3.](1st row, Value matrix) =

[0.06337894, 0.12675788, 0.19013681] i.e. A1

0.46831053([0,1] in softmaxed matrix)*[2. 8. 0.](2nd row, Value matrix) =

[0.93662106, 3.74648425, 0. ] i.e. A2

0.46831053([0,2] in softmaxed matrix)*[2. 6. 3.](3rd row, Value matrix)=

[0.93662106 2.80986319 1.40493159] i.e. A3

Observe how the 3 attention vectors are calculated for 1st token (values are rounded off for ease of understanding)

Now, we need to add these 3 vectors A1+A2+A3=

[1.93662106 6.68310531 1.59506841]]

And finally, attention for 1st token is calculated !!

Similarly, we can calculate attention for the remaining 2 tokens (considering 2nd & 3rd row of softmaxed matrix respectively) & hence, our Attention matrix will be of the shape, n x d_k i.e. 3 x 3 in our case.

Now, coming back to the paper where we have 8 such attention heads. In this case, we will concatenate output matrices from all heads & this concatenated matrix is multiplied with a weights matrix such that output = n x d_model which was the input shape for this Multi-Head Attention layer.

Here, z_i represent attention matrix output from the ith head

Hence, to summarize

Calculate Q, K & V using Q_w, K_w & V_w & Input sequence Embeddings

Calculate Attention Matrix of dimension n x d_k. This involves a few steps shown above

Concatenate Attention matrix from all attention heads & multiply with a weights matrix such that the output = n x d_model.

Just before ending, we must know why all this mathematics helps in calculating attention values? What is the significance of Query, Key & Value matrices? To know this, do read this section.

That’s all for attention !!

Post Layer Normalization

With an idea to not lose out on essential information, normalization uses residual connections to look back at the Input of the previous layer & its output simultaneously. This is done in the following steps

Add Input & Output of the previous layer, whatever it may be. In this case, the Multi-Head Attention Layer is the previous layer. Hence Input & Output of this layer is added.

Let this be V (of the dimension n x d_model)

2. Normalize V using the below formula

Here, μ= Mean, σ=Standard Deviation, γ= Scaling/Damping factor, β=Regularization constant. It must be kept in head that μ & σ are calculated separately for each row ‘r’ of V (Input + Output matrix, previous layer).

Next is a Feed-Forward Network comprising of 2 layered neural networks applying ReLU. Also, though the input & output dimension remains same, the 1st layer has an output dimension= 2048

After this FNN, again a Post Layer Normalization is done with Input(FNN) & Output(FNN) similar to what happened after Multi-Head Attention. Now, the 4 segments:

Multi-Head Attention (8 heads)

Normalization

Fee-Forward Network

Normalization

is repeated for 6 iterations

The final output from the 2nd Normalization Layer in the 6th iteration is of the dimension n x d_model & is passed to the Decoder which is the attention matrix for the input sequence. We will discuss how & to which segment does this goes in the below explanation

DECODER

Another monster in the house !!

The major aim of using a Decoder is to determine the output sequence’s tokens one at a time by using:

Attention known for all tokens for the input sequence from EncoderAll predicted tokens of output sequence so far.Once a new token is predicted, it is considered to determine the next token. This chain goes till the ‘End Of Sentence’ token isn’t predicted.

As followed in Encoder, I will go through this network in a bottom-up approach

1. ‘Outputs’ is the numeric representation of the Output Sequence generated using a tokenizer as done in Encoder but with a difference. This numeric representation is right-shifted.

But why right-shifted?

As the decoder is trained in such a way that it is able to predict the next word of the sequence given the previous tokens & attention from Encoder. Now, for the 1st token, this looks troublesome as there exist no previous tokens. This would have lead to a problem to predict the 1st token every time. Hence, the output sequence is shifted & a ‘BOS’ (Beginning of Sentence) is inserted at the beginning. Now, when we need to predict the 1st token, this BOS becomes the previous token of the output sequence.

2. The Output Embedding & Positional Embedding layers have the same role & structure as in Encoder.

3. The core of the decoder changes a bit compared to the Encoder’s core with the addition of Masked Multi-Head Attention, though, repeated for 6 iterations similar to Encoder’s core.

4. In Masked Multi-Head Attention Layer, attention is applied on tokens up to current position (index till which prediction is done by transformer) & not future tokens(as aren’t predicted till now). This is in stark difference from Encoder where attention is calculated for the entire sequence at once.

For example: If we wish to translate ‘I am good’ (input for the encoder, attention will be calculated for all tokens all at once) into French i.e ‘je vais bien’ (input for decoder), & the translation has reached till ‘vais’(2nd token), hence, ‘bien’ would be masked & attention will be applied for the 1st 2 tokens. This is done by setting future tokens (embedding for ‘bein’)as infinite values.

After this, a Normalization Layer same as in Encoder is followed.

All this is good but where did the output from Encoder go?How Decoder is using it?

After the Normalization layer, a Multi-Head Attention layer follows which

Intakes the output from Encoder (n x d_model; remember the final output from Encoder!), calls this output K & V which is used as Key & Value by Decoder’s Multi-Head Attention.Also, the Query matrix from the previous Masked Multi-Head Attention layer is taken. Hence, this attention layer doesn’t require any training as takes pretrained values for Query, Key & Value matrices.

This layer uses the pretrained information from Encoder. As Query vector will be available for just ‘seen’ tokens (predicted tokens), even this Multi-Head Attention Layer can’t see beyond what isn’t predicted in the output sequence similar to Masked Multi-Head Attention Layer.

After this layer, The following layers comes in

NormalizationFNNNormalization

Functioning similarly to as in Encoder. Hence, the decoder core has

Masked Multi-Head Attention Layer + Normalization

Multi-Head Attention Layer + Normalization

FNN + Normalization

After these repeated code blocks, we have a linear layer followed by a softmax function giving us the probability for the aptest token in the predicted sequenceOnce the most probable token is predicted, it goes back in the tail of the output sequence (remember the right-shifted sequence).

Hence, if we have 2 tokens in the output sequence till now out of 4 tokens (apart from BOS, i.e. two tokens have been predicted),

The current aim of the decoder is to predict the 3rd token of the outputOnce its 3rd token is predicted, it goes to the tail of the output sequence & a new iteration startsIn this new iteration, we have 3 tokens in the output sequence(apart from BOS, the previous two tokens & newly predicted token) & Decoder now aims to predict the 4th tokenIf the predicted token is ‘End Of Sentence’(EOS), the transformation is done & the output sequence is completely predicted.

Next, will come up with BERT, one of the breakthrough models that uplifted the entire NLP game.

A big thanks to:https://www.amazon.in/Transformers-Natural-Language-Processing-architectures-ebook/dp/B08S977X8K

With this, it's a sign-off!!



【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3