\[\gdef\op#1{\operatorname{#1}} \gdef\tuple#1{\left(#1\right)} \gdef\list#1{\left[ \mkern{1pt} #1 \mkern{1pt} \right]} \gdef\fs#1#2#3{#1\mathbin{:}#2\mathbin{ \to }#3}\]
Rainbow array algebra

Rainbow array algebra #

Introduction #

This is the second of a two part series based on a talk I gave at the Hypermatrix Workshop in May 2023.

In the first post, Classical array algebra, I set out a simple framework for thinking about how multidimensional arrays are manipulated in classical array programming.

In this post I will present an alternative paradigm called rainbow array algebra. This simplifies and unifies many of the constructions in this post and offers a better foundation for array programming in the coming decades.

Recap #

In the first post, we identified n-arrays with functions of the form:

A : ⟨P1,P2,…,Pn⟩ -> V

The set ⟨P1,P2,…,Pn was called the key space, and V the value space.

We also thought of arrays as containers of elements from V, which are held within cells.

Each cell of an array is located by a tuple (p1,p2,…,pn) called a key, whose entries pi ∈ Pi are called parts, taking values from the associated part spaces Pi. The entire key is just an element of the key space.

An axis i ∈ 1..n refers to the i‘th component of the key space.

A scalar is a 0-array with no axes. It has key space ⟨⟩ = { () }, and hence exactly one cell and value.

We presented an array using colorful half-frame notation, so a 3-array of shape 2,3,4 could be written:

[0 0 1 0 [0 1 1 1 
[1 0 0 0 [0 0 1 0
[0 0 1 1 [0 1 1 1

We introduced a handful of operations that be applied to either single arrays or combinations of them, listed here below:

operation meaning cellwise definition
lift apply f to matching cells f(A,B)[…] ≡ f(A[…], B[…])
aggregate use f to summarize axis fn(A)[…] ≡ f(A[…,,…] | )
bin collapse parts of an axis together fRn(A)[…,,…] ≡ f{ A[…,,…] | R(,) }
broadcast add constant axis An[…,,…] ≡ A[…]
transpose permute (reorder) axes Aσ[…] ≡ A[σ(…)]
nest form array of arrays An[…][] ≡ A[…,,…]
pick choose cells by key A[P][…] ≡ A[P[…]]

Critique of classical array algebra #

Keys as tuples #

Classical arrays involve cells identified by keys, being tuples of parts. Tuples are simply ordered lists. Crucially, the fact that keys are ordered lists is what gives us an ordering for axes of an array, since axis n is just a “slice” through the n‘th component of the key space.

The natural question is: are tuples the best choice? What happens if we replace tuples with another data structure? What will this do to our notion of the axes of an array?

Certainly, tuples are simple, familiar data structures. In the functional perspective in which arrays are functions from key spaces to value spaces, tuples being keys correspond to the familiar idea of positional function arguments, which are currently the norm in software programming. This inspired the shorthand A[1,2,3] ≡ A( (1,2,3) ) for the cell of a 3-array.

Secondly, the commitment to ordered axes makes it easy to store the content of arrays. In a computer, the contents must be laid out in consecutive positions in linear memory (RAM). Ordered axes let us compute a memory location from key directly as long as we know the overall shape.

For example, the key (3,1,2) ∈ ⟨3,3,3 yields an offset into linear memory:

offset = (3−1) * 9 + (1−1) * 3 + (2−1) = 19

Adopting this kind of addressing scheme is a requirement of executing array programs efficiently, since it permits us to avoid storing the keys themselves – every choice of key uniquely identifies a location in memory where the corresponding value can be found. Similarly, we can efficiently work backwards from a memory location to the corresponding key.

And so ordered axes have been woven into the fabric of classical array programming.

The problem with tuples #

As we have seen, composition of arrays require matching corresponding axes from the composed arrays. For example, tinging a color image with a tinting vector requires matching the tinting vector with the color axis of the image.

Getting this matching right may require fiddly and error-prone transposition and broadcasting. Worse, innocuous semantic changes can require cascading and non-local changes to an array codebase. For example, adding an additional axis to the input of an array program so that it operates on a vector of inputs to compute a vector of results can require rewriting many transpositions, broadcasts, and other operations.

Moreover, representing the axes of an array in this ordered way discards useful semantic information that might have been better kept around. This needless erasure of meaning is the source of countless bugs in array programming. The common solution is meticulous documentation of the source code of an array program to explain the interpretation of each axis at each point in a computation.

An analogy from historical computing #

This situation is eerily similar to one in the early days of computer programming. The registers that store temporary data in a CPU are numbered. Initially, code could only be written in the native machine code of the CPU, and keeping track of what data was stored in which registers at which steps of a computation was a laborious exercise that taxed the memory and attention of the first human programmers.

Eventually, programmers invented higher-level programming languages which employed named variables, using automated compilers to translate the use and manipulation of these named variables into the corresponding manipulations of the CPU’s registers. Working with these names directly made programs much easier to read and write. In fact, these compilers often create more performant machine code, being able to analyze subtle tradeoffs that come from register use in different parts of the program.

A similar situation exists today with array programming, in which current array programming frameworks have forced array programmers into making choices about how axes should be ordered. Axis order can have a major impact on how array programs perform due to subtle effects such as cache locality. Just like in the early history of computing, it is hopefully clear that array programmers should focus instead on what the axes mean and leave the optimal machine implementation to automated compilers.

Rainbow arrays #

Introduction #

As we eluded to, our main move is to replace the role of the tuple in our key spaces with a data structure called a record. These records will encode the same part information but will label the axes with explicit axis names, rather than having axes be implicitly labeled by their linear order in the tuple. But what exactly is a record?

Records #

Let’s take an example key tuple for a 3-array:


And replace it with a record, which is a structure we will write like this:

(a=5 b=3 c=2)

The tuple has components labeled by 1, 2, and 3. The record has fields labeled by a, b, and c.

Relationship to axes #

As we saw, the relationship tuples have to axes is that the n‘th axis is associated with the n‘th component of every key tuple.

With records, we instead have that the axis named a is associated with the a field of every key tuple.

Therefore, record keys are the natural counterpart to tuple keys when we name our axes instead of number them.

Relationship to shapes #

Before, we saw that a shape like ⟨3,2,4⟩ denotes a Cartesian product, which is the set of tuples:

3,2,4⟩ ≡ { (i,j,k) | 1≤i≤3, 1≤j≤2, 1≤k≤4 }

Switching to records, we can create a record shape ⟨a=3 b=2 c=4⟩ that creates a set of records:

⟨a=3 b=2 c=4⟩	≡ { (a=i b=j c=k) | 1≤i≤3, 1≤j≤2, 1≤k≤4 }

Rainbow notation #

However, instead of writing a record as (a=3 b=2 c=4), we color code the fields via a,b,c and write simply (3 2 4):

(3 2 4) ≡ (a=3 b=2 c=4)

Note that in this notation we have dropped commas, because the structure (3 2 4) is not a tuple (where order of components matters), but a rainbow notation for a record (which has no order of fields).

Similarly, the colored shape of a matrix with 3 rows and 4 columns under the color coding row,column is:

3 4⟩ ≡ ⟨4 3⟩ ≡ ⟨row=3 column=4

Notice order here is inconsequential, it is color that becomes the notational carrier of axis label information.

We can further adapt our array algebra notation to reflect this orderless, color-based semantics.

Array lookup #

For array lookup, we replace A[,,] with A[]. Notice again the lack of commas, since:

A[] ≡ A[] ≡ A[] ≡ …

Similar to before, the cell value A[] is still shorthand for function application on a record A( () ).

The full syntactic desugaring under the ambient color coding a,b,c looks like this:

A[] ≡ A( () ) ≡ A( (a= b= c=) )

Shape #

As before, we will indicate the colorful shape of an array by writing a “partial function signature” that describes its key space. Here, a matrix with 3 rows and 4 columns is written:

M : ⟨3 4

When we don’t wish to state explicit values, we will as before use squares:

M : ⟨

Axes #

We will denote the set of axes of an array A with axes(A) or A. An axis is just a color, of course, so to write down such an abstract color we will use a colored symbol:

S : ⟨⟩ -> V         axes(S) = S = {}
V : ⟨5⟩ -> V axes(V) = V = {}
M : ⟨3 4⟩ -> V axes(M) = M = {,}
A : ⟨2 2 2⟩ -> V axes(A) = A = {,,}

Role of color #

To summarize the role of color in our notation systems with a table:

name example role of order role of color
classic notation A[i,j,k] primary none
colorful notation A[i,j,k] primary visual aid
colorful notation A[,,] primary replace symbols
rainbow notation A[ijk] visual aid primary
rainbow notation A[] none primary

Notice that in both rainbow and colorful notation, we can but need not replace the role of symbols i,j,k with tokens ,, to denote unique parts of a key tuple / record.

Examples redux #

Let’s re-examine the case of our color image, whose subpixels are shown here:

Below we’ve highlighted a particular green subpixel given by y = 3 and x = 4:

Here is the classical versus rainbow view of this cell, under coding y,x,c:

paradigm shape of array key of subpixel
classical ⟨5,4,3⟩ (3,4,2)
record ⟨y=5 x=4 c=3⟩ (y=3 x=4 c=2)
rainbow 5 4 3 (3 4 2)

Rainbow circuits #

In the last post, we saw examples of array circuits, which depict graphically how arrays can be combined with one another. For example, a matrix defined by adding a vector to the rows of another matrix is defined cellwise like this:

A[r,c] ≡ M[r,c] + V[c]

This has the corresponding circuit:

Here, the ports that take key parts are ordered from left to right. For rainbow arrays, we eschew order and of course color the ports instead. For example, let us take the following cellwise definition:

A[] ≡ M[] + V[]

This corresponds to the following rainbow circuit:

Unlike with classical array circuits, the order of ports and wires does not matter, only their color.

Reorganizing the API #

How do rainbow arrays reformulate our algebra? Here’s a quick summary:

operation change in rainbow world
lift automatically broadcast over missing colours
, nest
parameterized by color rather than order
transpose meaningless; axes are not ordered
broadcast redundant
pick unchanged
recolor new operation – equivalent of transpose

In short, we mostly eliminate one operation from our API – broadcasting – since it becomes largely redundant owing to the increased flexibility of map operations over their lifted cousins.

Perhaps more importantly, we gain clarity about the intended semantics of our operations, since the axes preserve their “semantic role” across compositions.

Let’s examine in detail how the operations of classical array programming change now that we are under the rainbow.

Rainbow operations #

Aggregate, merge, nest ops #

For rainbow arrays, aggregation, merging, and nesting remain largely as before. The only difference is that they are parameterized by axis color rather than by axis number.

Previously, we wrote an axis-parameterization operation as, for example:


(or sum2 in colourful notation). But if our axes are not ordered, and color is the only distinction between axes, we use a coloured symbol that identifies the relevant axis:


For example, we can -nest a -array H to obtain a -vector of -matrices:

   H : ⟨⟩ -> V
H : ⟨⟩ -> ⟨⟩ -> V

H[][] ≡ H[]

Lift / map op #

We saw that for classical arrays, lift and broadcasting operations are often combined to perform common operations like matrix multiplication.

For rainbow arrays, the broadcasting step is naturally subsumed under the lift step. The mechanism arises from the fact that all arrays that participate in an lift operation must have the same colourful shape. If any of the arrays have shapes that have fewer colours than necessary, there is a unique way they can be broadcasted to have the right colors (= named axes). The resulting array will then have the union of the colors of all the inputs. This means there is a unique array operation for each operation of the underlying value space, as long as the arrays being combined have “compatible shapes”.

We will name this new, more flexible operation map.

Let’s look at some examples to make things clearer.

Vector times vector #

We’ll start by multiplying two vectors. Because each vector has a single axis and therefore a single color, there are two possibilities: the colors match, or they don’t.

First, let’s consider matching colors; for example, the multiplication of two -vectors. The shapes of the inputs and output of the mapped * are:

     U : ⟨
V : ⟨
U * V : ⟨

The cellwise definition can only be one thing, corresponding to the familiar lift multiplication of two classical vectors:

(U * V)[] ≡ U[] * V[]

[1 2 3 * [0 1 2 = [0 2 6

Next, let’s consider non-matching colors, the multiplication of a -vector and a -vector:

     U : ⟨
V : ⟨
U * V : ⟨

According to our rule, we broadcast U to have a blue axis and V to have a red axis to make their shapes compatible, which implies a unique cellwise definition:

(U * V)[] = U[] * V[]
= U[] * V[ ]

Here, we introduce the notational shorthand A to mean that an array is broadcast to have a colored axis with suitable part space. Hence, our cellwise definition of * on these shapes is determined uniquely to be:

(U * V)[] = U[] * V[]

Let’s see a concrete example:

          0    [0 0 0    [0 1 2 
[1 2 3 * 1 = [1 2 3 = [0 2 4
          2 [2 4 6 [0 3 6

Note the special feature of rainbow arrays that we can lay out their axes in any order we choose, as long as the brackets are colored correctly! This is a reflection of the fact that there is no canonical order of axes, only their canonical color. For classical arrays, the opposite was true: only the order of nested brackets mattered, their color was merely a visual aid to communicate how axes between several arrays corresponded.

Array times scalar #

A scalar array S has no colors, and so there is no possibility of sharing. Here we illustrate this for a vector:

     S : ⟨⟩
V : ⟨
S * V : ⟨

(S * V)[] ≡ S[] * V[]

2 * [1 2 3] = [2 4 6]

And for a matrix:

     S : ⟨⟩
M : ⟨
S * M : ⟨

(S * M)[] ≡ S[] * M[]

2 * [1 2 3 = [2 4 6
    [3 2 1 [6 4 2

The rainbow diagram for this kind of situation is depicted below, for scalar-times-vector on the left and scalar-times-matrix on the right:

Matrix times matrix #

For two matrices, there are 3 possibilities for sharing:

z z z
M : ⟨ N : ⟨ 2 shared colors
M : ⟨ N : ⟨ 1 shared colors
M : ⟨ N : ⟨ 0 shared colors

Let’s start with full sharing, where we are multiplying two -matrices:

     M : ⟨
N : ⟨
M * N : ⟨

(M * N)[] ≡ M[] * N[]

[1 2 * [0 1 = [0 2
[3 4 [1 0 [3 0

This is the classical lift-multiplication. The rainbow diagram for this kind of situation is depicted below:

Next we have one shared color, where we are multiplying -matrix and a -matrix:

     M : ⟨
N : ⟨
M * N : ⟨

(M * N)[] ≡ M[] * N[]

[1 2 * 0 1 = 1*0 2*1 = 0 2
[3 4 1 0 1 0 1 0
        3*0 4*1 0 4
        1 0 3 0

The rainbow diagram for this kind of situation is depicted below:

Finally, we have the case of zero shared colors, where we are multiplying a -matrix and a -matrix:

     M : ⟨
N : ⟨
M * N : ⟨

(M * N)[] ≡ M[] * N[]

[1 2 * [0 1 = [0 1 [0 2
[3 4 [1 0 [1 0 [2 0
        [0 3 [0 4
        [3 0 [4 0

This is a kind of “matrix nesting”, where each cell of one matrix is replaced by a scaled copy of the other matrix, with that scaling factor being the corresponding value from in the cell of the first matrix. It doesn’t matter which matrix we choose to nest inside the other – as long as our choice matches the order of axis nesting when we write the result down.

The rainbow diagram for this kind of situation is depicted below:

Matrix multiplication #

We are now in a position to construct matrix multiplication by mapping * over matrices that share a single color, and then aggregating that color:

    M : ⟨
    N : ⟨
M ⋅ N : ⟨

(M ⋅ N)[] ≡ sum{ M[] * N[] }

M ⋅ N ≡ sumMN (M * N)

Here recall that A refers to the set of colors of A, so that MN gives the set of shared colors:

MN = {,} ⋂ {,} = {}

This definition of rainbow dot product gives us exactly the correct definition of the inner product between two same-colored vectors U and V:

    U : ⟨
    V : ⟨
U ⋅ V : ⟨⟩

U ⋅ V ≡ sumUV(U * V)
      = sum(U * V)
(U ⋅ V)[] = sum{ U[] * V[] }
          = U[1] * V[1] + U[2] * V[2] + …

General rainbow dot product #

The general form of this rainbow dot product A ⋅ B applies to pairs of arrays A,B and produces an array that has the symmetric difference of the colors of A and B, meaning the colors that are in one shape but not in both:

    A : J -> V
    B : K -> V
A ⋅ B : (JK) -> V

A ⋅ B ≡ sumAB (A * B)

Here, we are adopting the convention that set operations like JK applied to shapes occurs fieldwise, so that:

5 3⟩ ⋃ ⟨3 6⟩ = ⟨5 3 6
5 3⟩ ⋂ ⟨3 6⟩ = ⟨3
5 3⟩ △ ⟨3 6⟩ = ⟨5 6
5 3⟩ - ⟨3 6⟩ = ⟨5

Example: tinting an image #

As a practical example of how rainbow mapping can streamline array programming, let’s look at a particularly simple example: tinting a color image.

Recall that a color image can be represented as an classical array of shape I:⟨H,W,3 (here using the YXC convention).

Let’s say that we wish to change the color balance of this image, by doubling the intensity of the blue channel, and keeping the other channels the same. This tinting factor can be represented as a 3-vector:

T : ⟨3⟩ = [1 1 2]

The operation tint(I, T) of tinting I by tinting factor T can be defined cellwise as:

tint(I, T)[y,x,c] ≡ I[y,x,c] * T[c]

In terms of our classical array API, this can be achieved by broadcasting T over the first (Y) and second (X) axes followed by an lift-multiplication with I:

tint(I, T) = I * T1➜H,2➜W

In practice, array programming library recognize this form of broadcasting and do not actually actualize the array T1➜,2 in memory. The resulting computation is closer in spirit to our cellwise definition above.

In the rainbow world, we have two important changes: firstly, the image has colorful shape I:⟨H W 3, which does not involve any choice of axis convention, and so any definition of tint has no need to pick a convention.

Secondly, we can achieve the tinting operation simply as mapping I * T:

tint(I, T) ≡ I * T

(I * T)[y,x,c] ≡ I[y,x,c] * T[c]

This is not merely a benefit in simplicity. It also means that the exact same definition of the tint operation will work on an array representing a video, in which an additional time axis T is present that represents a sequence of successive image frames:

V : ⟨T H W 3

tint(V, T)[t,y,x,c] ≡ V[t,y,x,c] * T[c]

Example: computing a conditional probability #

TODO: show how marginalization, conditioning, and independence are handled naturally in rainbow algebra using contingency tables and are much less clear in classical array programming.

General cellwise definition #

Having seen a variety of examples of mapping, we can take a stab at a fomal definition of map f.

To do this, we will introduce a slight notational convenience, which is to write the array type K->V more compactly as VK (this is a kind of “stepped down” version of the way we write the set of functions from A to B as BA).

 f : (V  , V  ,  , V  ) -> V

A1 : VK1
A2 : VK2

An : VKn

f : (VK1, VK2, , VKn) -> VK
K = K1 K2 Kn

The situation is depicted below in a rainbow diagram:

Even this is not quite general enough, since the function f could take arguments of varying type, and issue yet a third type. We will now switch to coloring rather than numerically indexing the arguments, so that f takes arguments of types V,V,…,V and returns a type V, and the arrays A,A,…,A are of type VK=K->V:

 f : ( V,  V,  ,  V) ->  V

A : VK
A : VK

A : VK

f : (VK, VK, , VK) -> VK
K = K K K

The use of color in this example is quite different from the use of color elsewhere in this post – we are simpy using it as an alternative to numbered indexing of the arguments of f and the arrays Ai. But it certainly suggests that we could consider functions that operate on records rather than tuples. In that case, the use of color would accord with that of rainbow notation, where color indentifies the field of a record.

Let us explore this generalization with a rainbow diagram:

The only change here is that the input wires of the mapped function f are also colored, indicating the function takes a record with fields red, blue, and green. We will leave this intriguing possibility for a future section.

Recolor op #

We just stated the transposition is made redundant by moving to rainbow arrows. For classical arrays, transposition reordered axes but preserved arity; recoloring is similar.

As a canonical example, let’s consider the colored version of the familiar operation of transposing a matrix, which interchanges the roles of rows and columns.

Initially, let’s work without any of our syntax sugar, relying on explicit record notation. However, we will still use the ambient color coding a,b,c to make things easier to follow.

Say we have a matrix M:⟨a=2 b=3 but we want a matrix M̂:⟨a=2 c=3. Essentially, we want to rename the axis b to have name c.

However, as is typical of cellwise definitions, we must work backwards, since we must translate cell lookups of the array we want, , to cell lookups of the array we have, M.

Conceptually, we must apply a function 𝜎={aa,cb} to the fields of the key for to obtain keys for M. We will write this with the same superscript notation as classical transpose:

Mcb[(a=i c=j)] ≡ M[(a=i b=j)]

In the superscript we’ve skipped the field a, which isn’t renamed. But it’s clear that what 𝜎 is operating on keys as follows:

𝜎( (a=i c=j) ) = (a=i b=j)

Now let’s mix in the syntactic sweetener, which we avoided because it makes things look almost too trivial!

Our recoloring operation will turn a matrix M:⟨ into a matrix M:⟨. It is defined cellwise as:

M[i j] ≡ M[i j]

In other words, the action of 𝜎 = {} on keys is:

𝜎((i j)) = (i j)

Note that in the discussion above we are using the symbol to indicate purely an axis color, not a particular size () or part ().

Enum op #

Perhaps the most unusual operation we’ll introduce is the enum operation. It is a nullary operation, that is to say it takes no input arrays, but still produces an output array. This output has the curious property that its value space is the same as its key space!

enumK() : K -> K

The array it produces behaves like the identity function, yielding the keys it is passed:

enum : ⟨…⟩ -> ⟨…⟩

enum[…] ≡ (…)

The enum op is parameterized by the shape K what we wish to work with. Let’s see some examples:

Enum vector #

Our first example is the enum for a shape 3:

enum3 = [(1) (2) (3)]

This array has three cells, which contain tuples of length 1. Notice the property that the value at a key is that key itself:

enum3[] ≡ enum3( () ) = ()

Enum matrix #

Our next example is the enum for a shape 3,4:

enum3,4 = [(1 1) (1 2) (1 3) (1 4)
           [(2 1) (2 2) (2 3) (2 4)
           [(3 1) (3 2) (3 3) (3 4)

The keys of a matrix are pairs (= 2-tuples), and so each cell contains a pair that is the location of that cell. We’ve colored the components of the tuple to show which axis they refer to.

Enum 3-array #

This pattern repeats, of course, for higher arity arrays:

enum2,2,2 = [(1 1 1) (1 1 2)
           [(1 2 1) (1 2 2)
           [(2 1 1) (2 1 2)
           [(2 2 1) (2 2 2)

Enum scalar #

The most trivial enum is just the scalar enum array:

enum⟨⟩ = ()

Applications #

TODO: computing a gaussian array

TODO: finding the position of a maximum / minimum

TODO: inverting an array

Multisets #

TODO: introduce type of multisets on a type, recontextualize aggregation.

Transfomers #

A practical example of the utility of rainbow arrays can be seen when describing the Transformer architecture. Transformers are a family of deep neural network architectures that have shown astonishing promise in modelling human language via autoregression: predicting the next token in a sequence of tokens that represents written language.

A transformer represents text as a sequence of tokens. For simplicity we can think of these tokens as being individual Unicode symbols that represent units of human writing (in current implementations the tokens are usually variable length “chunks” of Unicode symbols that are learned via a separate process). In computer science parlance, these sequences of Unicode symbols are called strings. A string of length N is a 1-array of type:

S : ⟨N⟩ -> U

where the value space U is the set of unicode symbols: U = {a,b,c,d,…}. As an example, take the string abcba, which corresponds to the array:

S = [a b c b a

The first step is to prepare this data to be operating on by the transformer. We do this by encoding this string vector into a matrix, where each symbol is embedded as an V-vector of real numbers via a dictionary D that defines which symbol uU should go to which vector V⟩ -> ℝ:

D : ⟨U⟩ -> ⟨V⟩ -> ℝ

If we use the toy alphabet U = {a,b,c}, we might define a corresponding dictionary that “one-hot embeds” these symbols:

      a  b  c
    1 0 0
D = 0 1 0
    0 0 1

We will encode the input string S into a 2-array by using the string vector to pick vectors from the dictionary, and nesting the resulting vector-of-vectors to produce a matrix:

V : ⟨N V⟩ -> ℝ
V = D[S]

Our example string vector abcba encodes to the 53 matrix:

    1 0 0 0 1
V = 0 1 0 1 0
    0 0 1 0 0

So far, this seems simple, and is in fact how strings are encoded for prior architectures. However, there is one additional twist: we will provide the transformer with additional information that encodes the position of each token in the string.

We will use a K-vector to encode each position, taken from a position encoding function P that takes an string position nN and returns the corresponding vector K⟩ -> ℝ.

P : ⟨N⟩ -> ⟨K⟩ -> ℝ

We obtain our position encoding matrix K by nesting P, turning it from a function-that-returns-vectors into a matrix:

K : ⟨N K⟩ -> ℝ
K = P

Commonly, “rotary” encodings are used which represent each position as a 2D vector that rotates as we advance our position along the string, so for our example the K-matrix will be of size 52

K = cos(0/6) cos(1/6) cos(2/6) cos(3/6) cos(4/6) cos(5/6) cos(6/6)
    sin(0/6) sin(1/6) sin(2/6) sin(3/6) sin(4/6) sin(5/6) sin(6/6)

We now have a K matrix and a V matrix that share the N axis:

K : ⟨N K⟩ -> ℝ
V : ⟨N V⟩ -> ℝ

We are now ready to begin the process of iterated self-attention that forms the basis of the transformer architecture. It is founded on an attention step, which produces a weighted sum of values where the weights are based on the “similarity” of a query to some keys. This is a kind of soft database lookup – where the word soft refers to the fact that the result of a query is not a single row in a database but a weighted sum of rows.

The values and keys must share an axis, since the similarity of query to the keys generates scores, and there must be one score for each value. Let’s consider the scoring step first:

          Q : ⟨K⟩ -> ℝ
          K : ⟨K⟩ -> ℝ
score(Q, K) : ⟨⟩ -> ℝ

score(Q, K) ≡ Q ⋅ K

In the simplest case, we can compare a single key with a single query, and we will obtain a scalar score:

score⎛1 0⎞ = 0

If we have a matrix of keys, we will obtain a vector of scores:

score⎛1 0 1 1⎞ = [0 1 1
     ⎝0,1 0 1⎠

If we have a matrix of queries, again we obtain a vector of scores:

score⎛0 1 1 1⎞ = [0 1 1
     ⎝1 0 1,0⎠

We will be scoring every token in the input sequence against a query, so we will be applying score with the following signature:

          Q :   ⟨K⟩ -> ℝ
          K : ⟨N K⟩ -> ℝ
score(Q, K) : ⟨N⟩ -> ℝ

We can write this type signature more compactly as:

score : (K⟩, NK⟩) -> N

Next, we apply the softmax operation to these scores, which applies an exponential function and normalizes them so that they add up to one. Let’s consider this softmax operation separately:

softmax : ⟨■…⟩ -> ⟨■…⟩
softmax(A) ≡ normalize(exp(A))

normalize(A) ≡ A / sum(A)

It is parameterized by the axis to sum and normalize across, and yields an array of the same shape as provided to it. Here is how we use it to compute the attention weights:

weights : (K⟩, NK⟩) -> N
weights(Q, K) ≡ softmax(score(Q, K))

At this point, we can now sum the corresponding values:

attend : (K⟩, NK⟩, NV⟩) -> V

attend(Q, K, V) ≡ weights(Q, K) ⋅ V

Holding K and V fixed, this attend operation defines a transformation from the query K to a result V.

Where does this query come from? In the transformer architecture, each token generates a query. Since a token has a key vector and a value vector, we have a transformation to_query that takes a key vector and a value vector and returns a query vector:

to_query : (K⟩, V⟩) -> K

We have many options to define to_query, and these definitions generally contain learnable parameters. Once we pick one, we can define an entire N-vector of queries Q with:

Q : ⟨N K⟩ -> ℝ

Q[n] ≡ to_query(K[n], V[n])

Notice that this query operation takes an orange axis of size N, but accesses the cells of K and V along the red axis – this is a recoloring operation.

Once we have obtained a query and used it to produce a result, we can transform this result to produce a new key and new value – in other words a new token – using an arbitrary learned transformations:

to_token : V⟩ -> (K⟩, V⟩)

Putting these steps together, we have the following pipeline:

  • for each token:

    • create a query from the corresponding key and value vectors

    • use this query to obtain attention weights

    • use these weights to sum all values and obtain a result

    • transform this result to produce a new key and value for the current token

Let’s define a single operation that does this:

self_attention : (NK⟩, NV⟩) -> (NK⟩, NV⟩)

self_attention(K, V) ≡ to_token(attend(to_query(K, V), K, V))

Notice that because to_query produces a matrix of queries N K, we will obtain a matrix of results N V, and hence a matrices of transformed keys N K and values N V.

This is the first step of the self-attention mechanism, which iterates this process a fixed numer of times. Each time, we obtain a new matrix of keys and a new matrix of values, which is fed into the next round. Schematically:

   (NK⟩, NV⟩)   ┐
┌ self_attention1 ←┘
└→ (NK⟩, NV⟩) ┐
┌ self_attention2 ←┘
└→ (NK⟩, NV⟩) ┐
┌ self_attention3 ←┘
└→ …

In the original Transformer architecture, each round uses different learned parameters in its definition of to_query and to_result.

Takeaways #

Advantages #

Rainbow array algebra keeps semantic meaning (e.g. colour channel, batch number, time) attached to array axes, and abandons axis order. This leads to fewer primitive operations and greater clarity.

Additionally, we can use various notational conveniences like half-framing to make higher-arity arrays easier to read and write.

The compositional properties of this alternative formulation are under-explored (e.g. their foundation in category theory).

In my view, rainbow arrays represent the future of array programming – or at least one incarnation of it. Indeed, various deep learning practitioners (e.g. Paszke, one of the authors of Torch) are pushing for labeled axes to become the standard approach.

Future directions #

Future posts will explore some interesting aspects of rainbow array algebra. Here’s a list of questions I hope to explore, in no particular order:

  • What is the diagrammatic formulation of rainbow array operations in terms of part / key dataflow? for example, broadcasting is deleting a flow, and taking the diagonal (which we didn’t describe in this post) is copying a flow. How do these flows compose?

  • What are the categorical foundations of rainbow array algebra? Can we make a connection to profunctors?

  • What is the right design for a Mathematica library that implements these ideas? Can they replace a lot of traditional, functional programming constructs?

  • What are the connections to hypergraph rewriting? In particular, graphs have adjacency matrices; hypergraphs have adjacency hypermatrices.

References #

There is some great prior work that propose rainbow-like approaches to array programming, going largely under the rubric of “named axes”. I intend to say more about how this post connects to prior work, but for now, here are some key references:

  • Alexander Rush discusses the myriad problems with existing multidimensional array programming APIs and proposes solutions in a series of blog posts “Tensors considered harmful”.

  • Chiang, Rush, and Barak have proposed a “Named Tensor Notation” to clarify the exposition of deep neural network architectures in published literature.

  • Maclaurin, Paszke et al. have made the functional perspective on array programming a core part of their Dex Haskell library.

  • Hoyer et al have pursued these ideas for a long time under the aegis of the XArray project.

  • Lastly, my colleagues and I discuss similar ideas in our paper “Heaps of Fish”.