The builders module takes care of simplifying the construction of transformer networks. The following example showcases how simple it is to create a transformer encoder using the TransformerEncoderBuilder.

import torch

# Building without a builder
from fast_transformers.transformers import TransformerEncoder, \
from fast_transformers.attention import AttentionLayer, FullAttention

bert = TransformerEncoder(
            AttentionLayer(FullAttention(), 768, 12),
        ) for l in range(12)

# Building with a builder
from import TransformerEncoderBuilder
bert = TransformerEncoderBuilder.from_kwargs(

Although it seems that the creation of a transformer is as simple with and without the builder, it becomes apparent that changing the creation logic with the builder is significantly easier. For instance, the attention_type can be read from a configuration file or from command line arguments. The rest of this page describes the API of the builders.

Builder API

The interface for all the builders is a simple method get() without any arguments that returns a PyTorch module that implements a transformer.

All the parameters of the builders are simple python properties that can be set after the creation of the builder object.

builder = ...                          # create a builder

builder.parameter = value              # set a parameter
builder.other_parameter = other_value  # and another parameter
transformer = builder.get()            # construct the transformer

builder.parameter = changed_value      # change a parameter
other_transformer = builder.get()      # construct another transformer

The BaseBuilder provides helper static methods that make it simpler to set multiple builder arguments at once from configuration files or command line arguments.

from_dictionary(dictionary, strict=True)

Construct a builder and set all the parameters in the dictionary. If strict is set to True then throw a ValueError in case a dictionary key does not correspond to a builder parameter.


Construct a builder and set all the keyword arguments as builder parameters.

from_namespace(args, strict=False)

Construct a builder from an argument list returned by the python argparse module. If strict is set to True then throw a ValueError in case an argument does not correspond to a builder parameter.

Transformer Builders

There exist the following transformer builders for creating encoder and decoder architectures for inference and training:

Attention Builders

Attention builders simplify the construction of the various attention modules and allow for plugin-like extension mechanisms when creating new attention implementations.

Their API is the same as the transformer builders, namely they accept attributes as parameters and then calling get(attention_type: str) constructs an nn.Module that implements an attention layer.

from import AttentionBuilder

builder = AttentionBuilder.from_kwargs(
    attention_dropout=0.1,                   # used by softmax attention
    softmax_temp=1.,                         # used by softmax attention
    feature_map=lambda x: (x>0).float() * x  # used by linear
softmax = builder.get("full")
linear = builder.get("linear")

The library provides the following attention builders that create the correspondingly named attention modules.

  • AttentionBuilder
  • RecurrentAttentionBuilder
  • RecurrentCrossAttentionBuilder

Attention composition

The attention builders allow for attention composition through a simple convention of the attention_type parameter. Attention composition allows the creation of an attention layer that accepts one or more attention layers as a parameters. An example of this pattern is the ConditionalFullAttention that performs full softmax attention when the sequence length is small and delegates to another attention type when the sequence length becomes large.

The following example code creates an attention layer that uses improved clustered attention for sequences larger than 512 elements and full softmax attention otherwise.

builder = AttentionBuilder.from_kwargs(
    attention_dropout=0.1,  # used by all
    topk=32,                # used by improved clustered
    length_limit=512        # used by conditional attention
attention = builder.get("conditional-full:improved-clustered")


Attention layers that are designed for composition cannot be used standalone. For instance conditional-full is not a valid attention type by itsself.

Attention Registry

The attention builders allow the dynamic registering of attention implementations through an attention registry. There are three registries, one for each available builder. You can find plenty of usage examples in the provided attention implementations (e.g. FullAttention).

This should only concern developers of new attention implementations and a simple example can be found in the custom attention layer section of the docs.