Self Attention Bottleneck for Transformers Deep Learning Models

Jan 8, 2021

The computational complexity of self-attention layers as per sequence length.

What is Self Attention Bottleneck for Transformers Deep Learning Models

The primary bottleneck in Transformer models is its self-attention mechanism. While using the Transformer, each token representation is updated by attending to all other tokens in the previous layer.

This is used for retaining long-term information, hence Transformer models have the edge over recurrent models on long sequences. But attending to all tokens at each layer has high computational complexity.

Watch the below video for a bit more detail… :)

What is Self Attention Bottleneck for Transformers Deep Learning Models

Happy Learning !!

Self Attention Bottleneck for Transformers Deep Learning Models

Written by Dhiraj K

No responses yet