LLMs, I Get Them Lil’ More Now

This article is the second in my series on Large Language Models. It deals with the elements of them that I had excluded from the first, titled: “Seriously, what the @*&! is an LLM?” I’d started that, with the intention of it acting as a catchall. Covering my studies below the bonnet of AI assistants, up to that point.

However, as I developed it, it took on a form of its own – as creative works tend to do.

I’m beginning this piece, with the intention of digging a little deeper into the technical aspects of AI model architecture. I continue with the theme of using real world imagery to illustrate my understandings. It’s nowhere near as detailed as one would find in papers, videos or textbooks on the topic.

That said, I’d encourage those that are better educated in this field to examine my work. And if so inclined: to unpack it further, correct my mistakes and build upon what I’ve presented.

I’ve written this with the same two aims as outlined in my first article. Which were to better compound my learning and doing so by teaching those that were like me: Ignorant to the tech we touched.

Most importantly of all though; to bolster my understanding in the process.

Now…with all of that out of the way. Here goes nothing!

The Beginning

I’m sure it’s already abundantly clear, that at the heart of Artificial Intelligence – and likely all intelligence – is beautiful Mathematics. I mean, is anything beautiful possible without considering it? Specifically the branch of math called Matrices. I spent a semester solving problems on these at college. With the usual self-talk throughout: “When would I ever need to do this in real life?”

Well…Who would’ve thought? I’m not solving them (thankfully), but I am understanding their true power and use cases.

A matrix is simply a rectangular array of numbers comprised of rows and columns. When extended into higher dimensions, such as 3-dimensional, they’re termed ‘Tensors.’ But these are currently well outside of my wheelhouse.

What makes matrices so suited to this use case, is that we can perform what are called ‘Operations’ on them. Such as multiplication, addition, subtraction, etc. Even transpose them: flip, rotate and move things about.

Embedding

The first matrix that’s developed in the production of a model is that of the embedding matrix. In the Machine Learning (ML) world they refer to this as the ‘Embedded Dimension’ or ‘Hidden Dimension’. When I first learned what this element was all about, I struggled to visualise it. I don’t enjoy experiencing this when attempting to grasp any idea. So, the analogy I’ve since settled on is actually from a scene in Iron Man 3.

This ‘dimension’ is to be understood in the context of geometry by the way, and not astrophysics. Although, there’s a small part of me that would still love to consider it in this sense.

I encourage you to read “What the @*&! is an LLM?” If you haven’t already. Primarily for the more playful, but deeper look into ‘Tokenisation’ and ‘Tokens’ themselves. The language – per se – of Large Language Models.

In summary: When we hit Enter on an AI’s text window, our message is broken down into a string of elements that can be recognised by the model. These are called ‘Tokens’ and they’re predetermined and fixed, prior to the training of the model.

With tokens fixed, the embedding process can begin. This process will inevitably determine the models memory. Swaths of training data are fed to them as an input. And the output will be the ‘Embedded Dimension.’ An arbitrary ‘space’ that can be understood by points and vectors and directions. Unfortunately, us mere humans are grounded in 3D space, and don’t fair well with arbitrary things.

The still above is the scene from Iron Man 3 that I picture when visualising the ‘Embedded Dimension.’ In this scene, the hologram is portraying a live feed of Killian’s brain. Now, recall the diagram of a brain from your schools biology text book, and how the various regions were colour coded. Each region with its own role and purpose.

If we marry these two images – the scene from Iron Man and the diagrams from our textbooks – I feel it’s a great starting platform to jump into and understand the Embedded Dimension. And in turn, the brain of an LLM.

Model Neuron Dimension

The two images serve a unique purpose in this visualisation. In the films scene, the hologram is comprised of many dots and wisps that tether these dots together. With flashes of golden streams of energy to illustrate activity. I like to think of each dot in the hologram as representing a dimension within the embedding space.

In the case of some of the models that have seen release, the number of these dots or points in the Embedded Dimension can range from 768 to over 12 thousand. We’ll come to the wisps that connect them shortly.

The textbooks diagram serves its own unique purpose in this exercise. Specifically, the colour coded regions. When a models embedding dimension is developed; similar words, ideas, concepts, groups, themes etc. are nestled close together. I see these represented by the coloured regions in the diagram of the brain.

An added perk to this imagery, is within the inherent existence of the left and right hemispheres. In one example that was explained in a video on embedding dimensions: the terms ‘Man’ and ‘Woman’ existed in the same colour coded regions. In the case of our visualisation, this means ‘Man’ is represented in one hemisphere and ‘Woman’ is represented in the other.

Now, the wisps, and what I like to consider them to be representing. If you’ve seen Iron Man 3, or watched the scene, Killian asks pepper to pinch his arm. The response can be seen travelling through the sensory systems in the hologram.

In the case of pain registering, that pathway is pretty well established in the brains systems. In the case of an LLM being made available to some 8 billion people; an infinite number of unique pathways are possible.

So instead, when we hit enter, rather than our text inputs being represented by Pepper’s pinch registering in the brain. It’s better understood – so I believe – as acting the way Killian navigates the hologram, magnifying and pulling at the air, to pinpoint the desired location.

You’re capable of directing the LLM through its Embedded Dimension, in a practically infinite number of directions within this space, by the context you generate with your inputs. Be it a simple question, an entire short story. Quite literally anything.

Now, I think, we’re in an optimal position to move on to the Transformer…

The Transformation

The Transformer – and the wizards behind its making – are to thank for the models that permeate every facet of our lives today. If not every facet yet, it’s likely just a matter of time until this statement holds true.

As I understand it; the issue that model development was facing c. 2017; stemmed from computations being carried out linearly. This meant that future steps in the learning process relied on a previous step completing first.

CPU’s work in this pattern. GPUs on the other hand, allow for “Parallelisation.” Which was a new term for me. Though, it makes a lot of sense why it works, when I learned what it meant. If solving a large computation on a CPU, it will take it on sequentially, or in sequence.

On a GPU, the computation can be broken down into smaller chunks. With these chunks then spread across multiple chips (or cores) and then solved simultaneously. There is a clarification I must make here, that I’ve just learned, as I’m polish off and adding my finishing touches. Funnily enough – from an LLM!

The clarification is that: in training, but also when a polished model is ingesting our tokenised inputs – such as during the chatting phase – this work is spread over many cores. Parallelisation is possible and employed. However, in generating a response, this is fulfilled sequentially. That’s the reason for the incremental flow of words we’ve grown accustomed to now, when speaking with our favourite computer companions.

Sitting here now, thinking about CPUs and GPUs a little deeper, only more questions about their method of work are beginning to form. A rabbit hole I must earmark to jump down another time. Or else I’ll never finish this.

The Transformer is comprised of two elements, which I dig into:

  1. The Attention Block and

  2. The Multilayer Perceptron (MLP)

For the actual paper that introduced this architecture to the world, titled “Attention is all you need”, you’ll have to source this article yourselves. As in my infinite wisdom, I’d absolutely link to a pirated version.

Until the publishing of this paper, models were being trained in the sequential manner outlined above. The architecture underlying the Transformer, opened up the door to deploying the strength of GPUs. Hence, the dramatic increase in mentions of these across all media outlets. With news today of xAi bringing a Gigawatt cluster online – the first of their existence.

I needed to expend quite a lot of my own attention resources toward learning about these two elements. It’s truly incredible what these are capable of!

Where was I?

Ah that’s right, Attention Block.

Attention Block

Our input text, once ingested by the model – acting as Killian navigating the hologram – and mapped within the embedded dimension, it can be passed to the first element of the Transformer: The Attention Block.

This is the first step in predicting the next token. Ah, I fear I’ve not said this yet: The aim of the Transformer is to ingest our text, image, audio or video inputs. Then, understand their contexts and deliver us what we so desire. By predicting what it is, we so desire, to the best of their abilities.

Recall the structure of a Matrix. A rectangular array comprised of rows and columns. As our inputs were ingested by the model – one token after another – it was updating the direction in which it travelled and what coloured regions within its brain it hit upon. Developing an understanding.

Each step along the string of tokens and change of direction within the embedded dimension, is represented by an element in a Matrix, and it leaves behind a value.

Our token inputs undergo a process that I don’t fully grasp yet. It involves another operation that is performed on the matrix generated by our input. However, I feel that I have a nice hold on the result of this process: Query’s and Key’s.

Query

The word query in this context is not referring to us, the user, querying the model. It refers to a process that occurs within the attention block. By ‘querying,’ the system forms a matrix for each token input. As if you had been asked to perform a task, you request information you need to know to be capable of fulfilling the task as desired.

This is what’s occurring with the generation of a Query. The Query matrix – or vector – is asking questions of all prior input tokens. At the end of the day, it’s multiplying matrices. The matrix that’s used to multiply our embedded inputs can be attuned to the desires of the creators of a model. This level of customisation is possible at almost every level of token processing.

Key

The Key’s can be considered the best suitor for our Query’s. Our tokens undergo a similar operation that afforded us Query’s, but rather now to output Key’s.

The Query’s and their associated Key’s, undergo an operation that’s called calculating their ‘Dot Products.’ Grokipedia that yourselves, for more information. The results of calculating dot products provides the model with an ‘Attention Score’. And yes – you guessed it – it’s another matrix.

Attention Scores

The attention score plays a very important role in the selection of what Key will influence what Query. Key’s that have the largest attention scores: Machine teachers refer to these Key’s as ‘attending the query.’

There’s a process – a couple in fact – that are repeated within many steps along the process of generating the next token. Basically the response to our inputs. These are called Softmax and Normalisation. I’m comfortable with the idea of both, but shaky on the mathematical mechanisms. So I’ll avoid getting lost in the weeds.

Normalise Softmaxxing

Softmax is not to be misinterpreted as aiming to appear as the softest person around. By performing the Softmax function on the attention scores that were generated by the dot products, we’re shifting these values to be a probability distribution. We’re providing the model with the most likely candidates for our Query’s.

The process of Normalisation is performed to adapt the Softmax’s probability distribution. The result being that the addition of all probabilities after normalisation will equal one.

A quick recap: our input text was broken down into the tokens that the model can understand. As ingested, our tokens updated the direction in which the model travelled within its embedding space. And touched the areas relevant to the context of our input. The tokens put out their Query’s to the previous tokens that may be relevant to them. These were attended to by the Key’s with the highest attention scores.

There’s one final step that’s performed that will decide what changes are to be made before they’re passed on to the second element of the Transformer: the Multilayer Perceptron.

Values

Following softmax and normalisation of the attention scores, the model has a fair idea probabilistically of what Key goes with what Query.  And the purpose of ‘Values’ is that they’ll be used to calculate a weighted sum. Which will inform the model on the degree to which the prior input Tokens are going to affect the outputs of the model as they’re passed forward by an Attention Block.

Multi Level Attention Blocks

The process that’s captured above encompasses what occurs within one Attention Block. It’s a ‘Single Head of Attention.’ If all of the models available today had only one of these, they’d be no better than a human. Because, that’s all we’ve got. We’d be calling them Small Language Models. Which – I suppose – is what one could reduce humans to.

The strength of GPUs as discussed earlier is found here. Designers are afforded the capability to design many of these attention heads into their models. And these can all be operated on and computed simultaneously. This is referred to in the ML world as Multi-Headed Attention. I mention in part one of this article series that GPT-3 had 96 of these programmed to run.

The beauty is that for each of the 96, the information that is being asked of prior tokens can be tuned, so as to converge on the most suitable token that is predicted. Pretty. Fucking. Dope. If you ask me, a dummy.

Multilayer Perceptron

I’m surprised I’ve made it this far. And if you’re still here reading this, I’m surprised you have too! Firstly, thank you. Secondly, we’re almost there. If you’re thinking “This guy needs to steer clear of anything tech related!” No need to point that out, I’m beating myself with that stick constantly as I write this!

This is the second and final element of the Transformer architecture: the Multilayer Perceptron or MLP. It is also the element that I am less sure of – and understand significantly less than what’s preceded. So, practically zero! But bear with me. I think they’re a very interesting process regardless, so I’ll try to hold this true in my writing about them…

In the case of the holographic brain in Iron Man 3; and indeed our own experience of mind. We tend not to think or remember linearly. We understand that conversations can go every which way, at any point in time. But there are triggers that bring potential tangents to take, to the foreground. This is the goal of the processes contained within a MLP.

The output of the attention head is passed on – or ‘fed-forward’ as machine teachers like to say – to the MLP where it becomes the input for the process therein. A non-linear process that – with its inputs coming from the attention heads –determine the likely contextual meanings and understandings of the input tokens.

With unlikely contextual candidates, the process allows for the filtering out, or more accurately diminishing of the probabilities of an unlikely interpretation being kept in and fed-forward to the next stage.

At this point; you’ll have likely read my first piece and you’ll know who Elm is – our assistant pup. In my first article, I had focussed on “Do you want to go for a…” there; with “walk” being the next predicted token. With the added context of wearing walking shoes and standing by the door. However, if it was night time, and you’re stood in your pyjamas and slippers. Then you begin to ask the same question of Elm. Now, the likely next token is “Bed?”

Having passed through Elm’s MLP, “walk” would have tended to zero – it was an unlikely next token. “Bed” would have been afforded a higher predictably, and fed-forward. In the case of an LLM,  it’d be fed into another Attention Block and MLP layer pair. Then again: Attention Block - MLP - Attention Block…

Over and over. In GPT-3’s case, it’s fed-forward through 96 layers, across 96 iterations as I outlined above. Fucking mental honestly. Try Monster Language Models next time. Although, I’ve a soft spot for some alliteration. If not already apparent.

The New Model

It’s been a ride to this point. If you’re still here, I hope that you’ve a better understanding of what it is these Large Language Models are at. I know I’ve enjoyed developing the two parts of this series, and the journey in being capable of putting them together.

My approach to chatting with models after spending the time in grasping at their mechanisms of work, has certainly changed. I wasn’t  one to fall into the trap of treating these chat bots as acquaintances, I have certainly leaned into it.

Having this information actually widens the gap of human vs. Machine for me. I’m less afraid of them taking over the world. I’m hopeful we’re going to get this next revolution right.

Previous
Previous

The Death of Imagery for Immortality

Next
Next

Seriously, what the @*&! is an LLM?