Transformer Language Model Mathematical Definition

The transformer language model technology is used in the following projects: It was introduced in the seminal article Attention Is All You Need coauthored by Lukasz Kaiser.

We aim to explain the computer science theory behind this technology and to present it from basic principles without any use of AI jargon so that any mathematician can understand how it works and any programmer can implement it on their own. I studied the original article and the Trax implementation and produced this mathematical documentation.

This article was written in 2021 by Michal Ryszard Wojcik (PhD in mathematics) under the guidance of Lukasz Kaiser (PhD in computer science) based on two sources:

Later I wrote a Python implementation of this mathematical specification, with some fully working demo examples: github.com/MichalRyszardWojcik/transformer-language-model. It can be executed in Google Colab directly in the browser without any installation.

Rowwise and autoregressive matrix operations

Let $T(X)=Y$ represent an operation which takes a matrix $X$ and returns a matrix $Y$ with the same number of rows as $X$, the number of colums being fixed for both input and output, say $m$ for input and $m'$ for output.

We say that $T$ is rowwise iff for each matrix $X$ with $n$ rows and $m$ columns we have $T(X_i)=T(X)_i$ whenever $i\leq n$, where $X_i$ is the i-th row of $X$ and $T(X)_i$ is the i-th row of $T(X)$.

We say that $T$ is autoregressive iff for each matrix $X$ with $n$ rows and $m$ columns we have $T(X|i)=Y|i$ whenever $i\leq n$, where the symbol $A|i$ denotes the matrix A reduced to the first $i$ rows, (e.g. $A|n=A$ and $A|1$ is the first row).

Note that rowwise operations are autoregressive and compositions of autoregressive operations are autoregressive.
Let $\unicode{0x25FA}A$ denote the matrix $A$ modified so that it has $-\infty$ above the main diagonal and is unchanged otherwise. Note that for a square matrix $A$ the rowwise operation softmax($\unicode{0x25FA}A$) always returns a lower triangular square matrix.
Claim. If $A,B$ are autoregressive operations and $A$ always returns lower triangular square matrices then the matrix multiplication operation $X\mapsto A(X)\times B(X)$ is autoregressive.

Below we are going to define the Transformer Language Model and argue that it is autoregressive by appropriately applying the arguments collected above.

Embedding

The Embedding layer is the first layer acting directly on the input composed of a sequence of tokens.

Each token is an integer from $\{0,1,\ldots,\mathrm{vocab\_size}-1\}$, where $\mathrm{vocab\_size}$ is one of the transformer language model's parameters.

This layer translates tokens into "vectorwords" which are vectors in $\mathbb R^{d_{model}}$, where $d_{model}$ is a parameter.

There is an embedded array of weights of shape $\mathrm{vocab\_size}\times d_{model}$. The layer simply assigns a row to each token.

The output is a matrix of dimension $n\times d_{model}$, where $n$ is the number of tokens in the input.

Positional Encoding

Let us define the Positional Encoding layer in a transformer language model with max input length $M$ and $d_{model}$.

Let $X$ be an input matrix with rows containing vectorwords of length $d_{model}$ with the number of rows between $1$ and $M$.

Then $PE(X)=X+PE|n$ where $n$ is the number of rows in $X$ and $PE|n$ is the layer's embedded $PE$ weight matrix reduced to the first $n$ rows.

Note that this layer simply adds its weight matrix to its input.

Note that it is a rowwise operation.
The weight matrix can be anything that results from the training process. But it is initialized in the following way.
$M$ is the maximum number of tokens in the input

$$PE\colon\{1,\ldots,M\}\times\{1,\ldots,d_{model}\}\to[-1,1]$$ $$PE(pos,2i)=\sin\Big(\cfrac{pos}{10000^{2i/d_{model}}}\Big)$$ $$PE(pos,2i+1)=\cos\Big(\cfrac{pos}{10000^{2i/d_{model}}}\Big)$$ $PE$ is defined here as a function of two variables where the first variable is interpreted as the position in the input sequence (e.g. first token, second token, third token) and the second variable is interpreted as one of the $d_{model}$ axes of the vectorwords created from the input tokens.

But we will also treat $PE$ as a $M\times d_{model}$ matrix of weights embedded in the Positional Encoding layer.
The image represents a vector of $d_{model}=6$ real numbers (vectorword).

$\Large[$$\small{\mathrm{sin}\Big(\cfrac{pos}{10000^{0/d_{model}}}\Big)}$, $\small{\mathrm{cos}\Big(\cfrac{pos}{10000^{0/d_{model}}}\Big)}$, $\small{\mathrm{sin}\Big(\cfrac{pos}{10000^{2/d_{model}}}\Big)}$, $\small{\mathrm{cos}\Big(\cfrac{pos}{10000^{2/d_{model}}}\Big)}$, $\small{\mathrm{sin}\Big(\cfrac{pos}{10000^{4/d_{model}}}\Big)}$, $\small{\mathrm{cos}\Big(\cfrac{pos}{10000^{4/d_{model}}}\Big)}$$\Large]$
Let us assume maximum input length $M=3$ and $d_{model}=6$ for simplicity of presentation.

The following $M\times d_{model}$ matrix is populated with numbers from the $PE$ function.
$\small{\mathrm{sin}\Big(\cfrac{1}{10000^{0/d_{model}}}\Big)}$ $\small{\mathrm{cos}\Big(\cfrac{1}{10000^{0/d_{model}}}\Big)}$ $\small{\mathrm{sin}\Big(\cfrac{1}{10000^{2/d_{model}}}\Big)}$ $\small{\mathrm{cos}\Big(\cfrac{1}{10000^{2/d_{model}}}\Big)}$ $\small{\mathrm{sin}\Big(\cfrac{1}{10000^{4/d_{model}}}\Big)}$ $\small{\mathrm{cos}\Big(\cfrac{1}{10000^{4/d_{model}}}\Big)}$

$\small{\mathrm{sin}\Big(\cfrac{2}{10000^{0/d_{model}}}\Big)}$ $\small{\mathrm{cos}\Big(\cfrac{2}{10000^{0/d_{model}}}\Big)}$ $\small{\mathrm{sin}\Big(\cfrac{2}{10000^{2/d_{model}}}\Big)}$ $\small{\mathrm{cos}\Big(\cfrac{2}{10000^{2/d_{model}}}\Big)}$ $\small{\mathrm{sin}\Big(\cfrac{2}{10000^{4/d_{model}}}\Big)}$ $\small{\mathrm{cos}\Big(\cfrac{2}{10000^{4/d_{model}}}\Big)}$

$\small{\mathrm{sin}\Big(\cfrac{3}{10000^{0/d_{model}}}\Big)}$ $\small{\mathrm{cos}\Big(\cfrac{3}{10000^{0/d_{model}}}\Big)}$ $\small{\mathrm{sin}\Big(\cfrac{3}{10000^{2/d_{model}}}\Big)}$ $\small{\mathrm{cos}\Big(\cfrac{3}{10000^{2/d_{model}}}\Big)}$ $\small{\mathrm{sin}\Big(\cfrac{3}{10000^{4/d_{model}}}\Big)}$ $\small{\mathrm{cos}\Big(\cfrac{3}{10000^{4/d_{model}}}\Big)}$
From the paper Attention Is All You Need:
We chose this function because we hypothesized it would allow the model to easily attend by relative position, since for any fixed offset $k$, $PE_{pos+k}$ can be represented as a linear function of $PE_{pos}$.

The following articles help to understand the linear representation claim: For any fixed offset $k$, there is a linear map $T_k\colon\mathbb R^{d_{model}}\to\mathbb R^{d_{model}}$ such that $$T_k\circ PE_{pos}=PE_{pos+k}$$ for any $pos\in\{1,2,\ldots,M-k\}$.

For any fixed offset $k$, there is a linear map $T_k\colon\mathbb R^{d_{model}}\to\mathbb R^{d_{model}}$ such that $$T_k(PE(pos,i))=PE(pos+k,i)$$ for any $pos\in\{1,2,\ldots,M-k\}$ and any $i\in\{1,\ldots,d_{model}\}$.

Normalization layers

Let us say that to standardize a vectorword $x\in\mathbb R^{d_{model}}$ means to subtract its arithmetic mean $\overline{x}$ from each coordinate and divide each coordinate by $\sigma(x)+\epsilon$, where $\sigma(x)$ is its standard deviation and $\epsilon=10^{-6}$ is a constant embedded in the code.

Let us say that to normalize a vectorword $x\in\mathbb R^{d_{model}}$ means to standardize it and then multiply it coordinatewise by a weight vector of the same length $a\in\mathbb R^{d_{model}}$ and then add a weight vector $b\in\mathbb R^{d_{model}}$.

Each normalization layer has two weight vectors $a,b\in\mathbb R^{d_{model}}$.

The input is as usual a matrix $X$ with $n$ rows, where $n$ is the number of words in the input sentence (the number of tokens in the input). Each row of $X$ is a vectorword from $\mathbb R^{d_{model}}$.

A normalization layer is a rowwise operation that normalizes each row using the same weights $a,b$.

DecoderBlock

input

X is a matrix with n rows, where n is the number of words in the input sentence (the number of tokens in the input)

Each row of X is a vectorword from $R^{d_{model}}$ – a vector of $d_{model}$ real numbers.

Let us call this the usual shape for a matrix because such are the tensors accepted and returned by decoder blocks and its sublayers – simplifying the exposition by assuming one sentence per batch and disregarding the batch size axis.

Causal Attention with one head

This section is not necessary because the Multi-Head Causal Attention section is self-contained and the definition of the Decoder Block refers to the Multi-Head Causal Attention section. The point is to be a gentle introduction to the multi-head version.

weights

$W^Q,W^K,W^V$ are square matrices of size ${d_{model}}\times{d_{model}}$ (e.g. 512)

They are the weights embedded in the Causal Attention layer CA.
definition

Let

$Q(X)=X\times W^Q$,

$K(X)=X\times W^K$,

$V(X)=X\times W^V$.

Then the Causal Attention layer is defined as $$CA(X)=\mathrm{softmax}\Big(\cfrac{\unicode{0x25FA}\big(Q(X)\times K(X)^T\big)}{\sqrt{d_{model}}}\Big)\times V(X)$$
Note that $CA$ is an autoregressive operation.

Multi-Head Causal Attention

parameters

$d_{model}$ is the length of the vectorwords

$h$ is the number of attention heads

$d_{head}=\cfrac{d_{model}}{h}$ is the width of the weight matrices associated with individual heads
weights

$W_i^Q,W_i^K,W_i^V$ are matrices of dimension ${d_{model}}\times{d_{head}}$ for $i=1,\ldots,h$

$W^O$ is a matrix of dimension ${d_{model}}\times{d_{model}}$

$B$ is a matrix of dimension $n\times{d_{model}}$ whose rows are identical
Fixing $i$, let

$Q_i(X)=X\times W_i^Q$,

$K_i(X)=X\times W_i^K$,

$V_i(X)=X\times W_i^V$.

The dimension of these matrices is $n\times d_{head}$.

The calculation for an individual head is given by $$H_i(X)=\mathrm{softmax}\Big(\unicode{0x25FA}\Big(\cfrac{Q_i(X)\times K_i(X)^T}{\sqrt{d_{head}}}\Big)\Big)\times V_i(X)$$
Note that each $H_i$ is an autoregressive operation.

Note that the output matrices $H_i(X)$ have dimension $n\times d_{head}$.

Let us concatenate them to obtain a matrix of dimension $n\times d_{model}$ $$H(X)=\mathrm{Concat}(H_1,\ldots,H_h)$$ Note that $H$ is autoregressive.
Finally, the multi-head causal attention layer is defined by $$CA(X)=H(X)\times W^O+B.$$ Note that $CA$ is autoregressive.

The Feed Forward layer

The feed forward layer is defined by $$FF(X) = \mathrm{ReLu}\big(N_{ff}(X)\times A+K\big)\times B+L$$ where $N_{ff}$ is a normalization layer with its two weight vectors $a,b\in\mathbb R^{d_{model}}$,

$A$ is a weight matrx of $d_{model}$ rows and $d_{ff}$ columns (e.g. $d_{ff} = 2048$),

$K$ is a weight matrix of dimension $n\times d_{ff}$ with identical rows,

$B$ is a weight matrix of $d_{ff}$ rows and $d_{model}$ columns,

$L$ is a weight matrix of dimension $n\times d_{model}$ with identical rows.

Note that $a,b,A,K,B,L$ are weights embedded in $FF$.

Note that both the input and output are matrices of the usual shape of $n$ rows and $d_{model}$ columns.
Note that the input matrix $X$ and the output matrix $FF(X)$ have the same usual shape $n\times d_{model}$.

Moreover, note that $FF$ is a rowwise operation.

DecoderBlock Definition

In order to match the Trax implementation we will write the definition using the residual layer notation $$Res(f)(x) = x+f(x),$$ where $f$ is a layer and $x$ is an input tensor.
The DecoderBlock is a composition of a normalization layer $N_{ca}(X)$ and the layers previously defined: $$DB(X)=Res(FF)\big(Res(CA\circ N_{ca})(X)\big)$$
Without the residual layer notation it is simply $$DB(X)=X+CA(N_{ca}(X))+FF(X+CA(N_{ca}(X)))$$
Note that the input matrix $X$ and the output matrix $DB(X)$ have the same usual shape $n\times d_{model}$.

Moreover, note that $DB$ is an autoregressive operation.
inspirational idea

Perhaps the decoder block would be a little more efficient if the formula was simplified to DB(X) = x+FF(x+CAoN(x))

Final Layer

The input to the final layer is the output of a decoder block, which is a matrix $X$ of the usual shape $n\times d_{model}$, where $n$ is the number of tokens in the original input to the transformer language model.

The final layer's weight matrix $Y$ has dimension $d_{model}\times\mathrm{vocab\_size}$.

The final layer is a matrix multiplication defined as $\Phi(X)=X\times Y+B$,

where $B$ is a weight matrix of dimension $n\times\mathrm{vocab\_size}$ with identical rows.

The output is a matrix of dimension $n\times\mathrm{vocab\_size}$.

Note that it is a rowwise operation.

Definition of TransformerLM

parameters

$\mathrm{vocab\_size}$ is the size of the set of input tokens

$d_{model}=512$ is the length of the vectorwords passed from sublayer to sublayer

$d_{ff}=2048$ is the width of certain weight matrices inside the feed forward layer in each decoder block

$n\_layers=6$ is the number of decoder blocks

$n\_heads=8$ is the number of attention heads in each decoder block

$\mathrm{max\_len}=2048$ is the maximum number of tokens in the input sequence
$$T_{LM}=\Phi\circ N\circ DB_{n\_layers}\circ\ldots\circ DB_1\circ PE\circ Em$$ where $N$ is a normalization layer
The input is a sequence of tokens.

Each token is an integer from $\{0,1,\ldots,\mathrm{vocab\_size}-1\}$.

The output is a matrix of dimension $n\times\mathrm{vocab\_size}$ containing real numbers, where $n$ is the number of tokens in the input sequence.

Note that it is an autoregressive operation if we interpret the input as a matrix of dimension $n\times 1$.

Evaluation (the loss function)

The essence of the loss function — single sequence input with no loss weights

The input sequences are elements of $\{0,1,\ldots,\rm{vocab\_size-1}\}^n$, where $n$ is the sequence length and $\rm{vocab\_size}$ is the number of possible tokens from which the sequences are built.

We are given an autoregressive operation $T$, which takes an input sequence (interpreted as a matrix of dimension $n\times 1$) and outputs a real-valued matrix of dimension $n\times\rm{vocab\_size}$.

In this context, we are going to define a loss function which returns a positive real number for each input sequence.
Let $(x_1,\ldots,x_n)$ be the input sequence.

In the first step of the definition, let $X=(0,x_1,\ldots,x_n)$ be viewed as the $(n+1)\times 1$ dimensional input matrix for the autoregressive operation $T$. Then $T(X)$ is a real-valued matrix of dimension $(n+1)\times\rm{vocab\_size}$, but we will be interested only in the first $n$ rows $T(X)|n$.
Note that each row of $T(X)|n$ is a sequence of $\rm{vocab\_size}$ many arbitrary real numbers as outputted by the working of the operation $T$. In the second step, we apply the LogSoftmax function to each row separately to obtain $Y=\rm{LogSoftmax}(T(X)|n)$, which is now a matrix of $n$ rows with each row serving as a log probability distribution over the set of tokens.
At this point we have created a function $k\mapsto Y[k]$, which assigns a log probability distribution over the set of tokens for each position in the original input sequence $(x_1,\ldots,x_n)$.

Note that — due to the autoregressive nature of $T$ and the insertion of the $0$ token in the first step — we have guaranteed that: Therefore it makes sense to evaluate the outputted log distribution $Y[k]$ against the actual token $x_k$ in the input.
Let $y_k = Y[k]_i$, where $i=x_k+1$, for each $k=1,\ldots,n$.

Note that in our setup $y_k\lt 0$ is interpreted as the log probability of the token $x_k$ appearing on the $k$th position in the input string after $(x_1,\ldots,x_{k-1})$. The closer to zero the better the evaluation.
Finally, the loss function is defined as $$-\cfrac{1}{n}\sum_{i=1}^n y_k>0.$$ The closer to zero the better.

Batch evaluation with loss weights

The input sequences are elements of $\{0,1,\ldots,\rm{vocab\_size-1}\}^n$, where $n$ is the sequence length and $\rm{vocab\_size}$ is the number of possible tokens from which the sequences are built.

We will be dealing with a batch of input sequences — formally a matrix of dimension $\rm{batch\_size}\times n$, whose rows are input sequences. Additionally, there's a matrix of the same dimension whose elements are real numbers from the closed unit interval $[0,1]$, which is called the matrix of loss weights for the input batch.

We are given an autoregressive operation $T$, which takes an input sequence (interpreted as a matrix of dimension $n\times 1$) and outputs a real-valued matrix of dimension $n\times\rm{vocab\_size}$. This time the operation $T$ is going to act simultaneously and independently on each input sequence from the batch so that formally it is a function which takes an array of dimension $\rm{batch\_size}\times n$ and returns an array of dimension $\rm{batch\_size}\times n\times\rm{vocab\_size}$.

In this context, we are going to define a loss function which returns a positive real number for each batch of input sequences with loss weights.
Let us fix notation so that the $j$th row in the input batch is denoted as $(x_{j1},\ldots,x_{jn})$ for $j=1,\ldots,\rm{batch\_size}$.

Similarly, let the $j$th row in the loss weights matrix be denoted as $(\omega_{j1},\ldots,\omega_{jn})$ for $j=1,\ldots,\rm{batch\_size}$.
Let $(y_{j1},\ldots,y_{jn})$ be the log probabilities for the $j$th input sequence as computed according to the formula in the previous section, for $j=1,\ldots,\rm{batch\_size}$.
The loss function is defined by $$-\cfrac{\sum_{(j,i)}\omega_{ji}y_{ji}}{\sum_{(j,i)}\omega_{ji}}$$ where the summation index $(j,i)$ ranges over $\{1,\ldots,\rm{batch\_size}\}\times\{1,\ldots,n\}$.

Note that this is a generalization of the loss function from the previous section, which can be seen by setting the weights to zeros on all rows except on a single row where they should be all ones.

not the same as average over rows

Note that this is not the same thing as the average loss function of a single input sequence with loss weights taken over all the rows: $$\cfrac{1}{s}\sum_{j=1}^s\Big(-\cfrac{\sum_{i=1}^n\omega_{ji}y_{ji}}{\sum_{i=1}^n\omega_{ji}}\Big)$$ The two formulas give the same result if the sums $\sum_{i=1}^n\omega_{ji}$ are the same for each row, but they are slightly different otherwise.

Gradient-Descent-based minimization of the loss function

Our loss function is a positive real-valued function but we need to take a closer look at its domain and its differentiability in order to use it within a gradient-descent-based minimization algorithm.
The loss function depends on the batch of input sequences on the one hand and on the collection of weights from the $T$ operation (which is a TransformerLM). We may think of a batch space and a weight space so that the loss function takes a point from the batch space and a point from the weight space to return the loss value.

Note that the loss function is suitable for a mini-batch (stochastic) gradient descent setup because it works for any point from the batch space.

Claim. Fixing a batch space point, this function is differentiable almost everywhere as a multivariate real-valued function from the weight space.
Proof. Apart from ReLu it is a composition of the basic arithmetic operations (addition, subtraction, multiplication, division) and the exp and log elementary functions from softmax and logSoftmax.

In order to calculate the gradient by automatic differentiation it remains to arbitrarily define the derivative of ReLu at 0.
Useful links for this subsection: