The computational complexity of self-attention layers as per sequence length.
The primary bottleneck in Transformer models is its self-attention mechanism. While using the Transformer, each token representation is updated by attending to all other tokens in the previous layer.
This is used for retaining long-term information, hence Transformer models have the edge over recurrent models on long sequences. But attending to all tokens at each layer has high computational complexity.
Watch the below video for a bit more detail… :)
What is Self Attention Bottleneck for Transformers Deep Learning Models
Happy Learning !!