Fast Transformers

Transformers are very succsessfull models that achieve state of the art performance in many natural language tasks. However, it is very difficult to scale them to long sequences due to the quadratic scaling of self-attention.

This library was developed for our research on fast attention for transformers. You can find a list of our papers below as well as related papers and papers that we have implemented.


The main interface of the library for using the implemented fast transformers is the builder interface. This allows for experimenting with different attention implentations with minimal code changes. For instance building a BERT-like transformer encoder is as simple as the following code:

import torch
from import TransformerEncoderBuilder

# Build a transformer encoder
bert = TransformerEncoderBuilder.from_kwargs(
    attention_type="full", # change this to use another
                           # attention implementation

y = bert(torch.rand(
    10,    # batch_size
    512,   # sequence length
    64*12  # features


The fast transformers library has the following dependencies:

  • PyTorch
  • C++ toolchain
  • CUDA toolchain (if you want to compile for GPUs)

For most machines installation should be as simple as:

pip install --user pytorch-fast-transformers



To read about the theory behind some attention implementations in this library we encourage you to follow our research.

  • Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention (arxiv, video)
  • Fast Transformers with Clustered Attention (arxiv, blog)

If you found our research helpful or influential please consider citing

    author = {Katharopoulos, A. and Vyas, A. and Pappas, N. and Fleuret, F.},
    title = {Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention},
    booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
    year = {2020}

    author={Vyas, A. and Katharopoulos, A. and Fleuret, F.},
    title={Fast Transformers with Clustered Attention},
    journal={arXiv preprint arXiv:2007.04825},

By others

  • Efficient Attention: Attention with Linear Complexities (arxiv)
  • Linformer: Self-Attention with Linear Complexity (arxiv)
  • Reformer: The Efficient Transformer (arxiv)

This software is distributed with the MIT license which pretty much means that you can use it however you want and for whatever reason you want. All the information regarding support, copyright and the license can be found in the LICENSE file in the repository.