# Recurrent Transformers

The transformer layers implemented in the fast_transformers.transformers module are processing the entire sequence simultaneously. On the other hand, this module implements transfomers as recurrent networks. Namely as networks that process the sequence one element at a time while updating some state.

The TransformerEncoder and TransformerEncoderLayer give way to RecurrentTransformerEncoder and RecurrentTransformerEncoderLayer and for the decoders RecurrentTransformerDecoder and RecurrentTransformerDecoderLayer respectively.

## Forward method

**RecurrentTransformerEncoder** or **RecurrentTransformerEncoderLayer**

```
forward(x, state=None)
```

**Arguments**

**x**: The input features of shape (N, E) where N is the batch size and E is`d_model`

passed in the constructor. Note that x corresponds to a specific element in the sequence and not the entire sequence.**state**: The state is a python object that varies depending on the attention implementation

**RecurrentTransformerDecoder** or **RecurrentTransformerDecoderLayer**

```
forward(x, memory, memory_length_mask=None, state=None)
```

**x**: The input features of shape (N, E) where N is the batch size and E is`d_model`

passed in the constructor. Note that x corresponds to a specific element in the sequence and not the entire sequence.**memory**: A sequence of features (N, S, E) that the input will attend to. S is the sequence length and E is the same as for x.**memory_length_mask**: An implementation of a BaseMask that encodes how many elements each memory sequence in the batch consists of.**state**: The state is a python object that varies depending on the attention implementation

Note

The masks are different in the recurrent implementations than in their batch counterparts. Namely, recurrent encoders and decoders enforce a triangular causal mask on self attention. In addition, recurrent decoders enforce a full mask on cross attention.

## Available Attentions

Not all attention formulations can be written in an autoregressive fashion as a recurrent model. In particular, since the sequence is passed to the transformer element by element we have the same result as passing a causal mask to normal transformers. The current list for recurrent attention implementations is:

## Example

The following example builds a random recurrent transformer encoder and applies its output as input 100 times.

```
# for simplicity ignore all the classification
# layers and the embedding layers
from fast_transformers.builders import RecurrentEncoderBuilder
model = RecurrentEncoderBuilder.from_kwargs(
attention_type="linear",
n_layers=8,
n_heads=12,
feed_forward_dimensions=1536,
query_dimensions=32,
value_dimensions=32
).get()
x0 = torch.rand(
10, # batch size
12*32 # feature size
)
state = None
x = x0
for i in range(100):
x, state = model(x, state=state)
```