I am going to talk about Attention, Self-Attention, and Transformer one by one.
Attention is an improvement on RNN, so let’s quickly review the structure of RNN used in seq2seq problem. The left part RNN called the Encoder, generates encoding vectors from input, and the right part RNN called the Decoder, does the translation based on the encoding.
The problem lies in the structure: when the input has a long length, it can be hard for the encoding vectors to contain all the information that the decoder needs. Moreover, it’s common sense that one output vector should put emphasis on a certain part of the input but not the whole. For example, “estamos” is more related to “we are” but not “bread”.
So what’s different in Attention, is that every stage in Decoder has its own context vector c.
This picture shows how one context vector is generated for the first stage in Decoder. After traversing Encoder, we have a decoder state s0. Every stage in the Encoder has an output vector h. s0 records the information of all the input vectors.
We can use an MLP to combine s0 with each h to get the corresponding e. Then each e will get softmax to get a, and then multiply a by corresponding h, and sum the results up to get the first context vector c1. The procedure to obtain c2 is similar. We just use s1 rather than s0 to combine with h.
a is what we called Attention Weights. The inner meaning of a is that, if a is larger, c1 and y1 pay more attention to the corresponding input x, so the corresponding h should be scaled up, otherwise, the input x has less importance on the corresponding output y. After training, each output y will have a different emphasis on each input.
And here’s an example showing how we can use the idea of Attention on a trained CNN backbone to do Image Captioning.
After the idea of Attention comes out, people design a network layer called Self Attention Layer, using this idea to extract features.
The intuition is that, for each vector in the input, e.g. a pixel in the input image, the layer calculates the similarity between this vector and the other input vectors. If this vector has high similarity with the other vectors, then the value of the vector will be scaled up, otherwise scaled down.
Let’s scrutinize the layer.
First, the input vectors are projected into three different spaces Q, K, and V using learnable weight matrices. This step is similar to what we do in CNN or FCN, multiplying the input by weights to increase the ability to express more complex functions.
Then we use Q and K to calculate how similar each vector is to the other vectors. The calculation multiplies Q by transpose(K) and divides the result by Dq. Dq is what we call “query size”, which is the size of the second dimension of all Q, K, and V. Because we need to do softmax along the feature axis, if we divide each element by Dq, the gap among each element will be smaller and the result of the softmax will be smaller, which means we have the loss to be larger to keep the gradient from vanishing.
After Softmax, A is what we call the Attention Matrix. Multiply A by V and sum along the second axis to obtain the final result Y.
There’s a problem that Self-Attention doesn’t care about the order of the inputs. If we apply a permutation on input X, result Y will be permuted in the same way. But sometimes, we want to keep the position information of each vector, e.g. the pixel location. We can use an MLP or just a fixed function, to create a positional encoding for each input X and add it to X.
The idea of Multihead Self-Attention is that we use this Self-Attention Layer multiple times on the same input X concurrently. Each Self-Attention, which we call a Head, has its own set of QKV weight matrices. In the end, we concatenate the output of each head and pass it through a Linear layer. In this way, the network could learn deeper interpretation lying in the input. Here’s the structure of it. If you want to learn more details about it, you can check this excellent post: https://towardsdatascience.com/transformers-explained-visually-part-3-multi-head-attention-deep-dive-1c1ff1024853
After understanding Self-Attention, we come to Transformer. The structure of a Transformer Block is illustrated below. A whole vanilla Transformer is just stacking up multiple Transformer Blocks.
To use Transformer in Computer Vision problems, we will have a big problem if we directly feed the whole image into it because it will take a huge amount of memory. To solve it, in the paper “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”, Vision Transformer (ViT) is proposed.
In ViT, it takes a 224x224 image, divides it into 16x16 grids, and each grid has a size of 14x14 and feeds them into the following network.
Each image grid is linearly projected to the white vectors. The red vectors are the positional encoding of the corresponding input grid. Add up white and red and feed it into Transformer. Additionally, there will be a special extra input also be fed into Transformer called Classification Token. In the end, we take the result corresponding to the Classification Token and linearly projected it to get the Classification Scores.
In this ViT, the transformer has 48 layers and 16 heads per layer, taking 112MB of memory. If we directly feed one 128x128 image into the transformer, it takes 768GB of memory.