Seriously, what the @*&! is an LLM?
If you’re like me only a little while ago; you’ve not a notion of what it is these chat bots are at behind the curtain.
It’s for this reason that I’ve been compelled to author this piece. Written with two aims in mind. Firstly – and a little selfishly – it’ll hopefully help solidify and compound my recent study of Large Language Models or LLMs.
The second aim is to help those that may be in a similar position to myself, that little while ago: Shamelessly querying the world's most technologically advanced form of intelligence. On how best to spark up a conversation with the cute barista. And then…proceed to fumble the bag.
But alas, some tokens of pity: “You’ll get them next time, Champ!”
I enjoy nothing more than when abstract ideas and concepts can be brought into perspective. Be it with a simple analogy or thought experiment, or even being asked to picture something in my minds eye. So, I’ve attempted to bring this method of grasping to this piece of writing.
I’d consumed an embarrassing amount of content on the topic of Artificial Intelligence since engulfing the zeitgeist. Even with that, I admit that I understood practically nothing of it. “15 Trillion tokens!” and “405 Billion parameters!” I could parrot these facts and phrases. Drop them into conversations.
“What does it even mean?” They might ask. Fuck knows! But look here: it’s just irrevocably proved everything you just said to be wrong. Ha!
I understood that I could leverage these tools for my own gain. That’s been made abundantly clear. They’ve been immensely powerful in directing me with some of my personal projects. Across many DIY disciplines. If I enter question here, I get the most statistically appropriate response there.
It was at this point that my understanding of these bots dropped off a cliff. However, something descended on me in the last couple of days. A desire to know more of them. And I’d like to share these learnings with you.
Picture this: You’ve just adopted a new puppy, let’s name her Elm. (Cat lovers or dog haters: hallucinate for me.) Elm’s going to grow up to be your new assistant. As Elm’s owner, it’s your role to train her on behaviour and what tricks you’d like her to perform. You’d be what OpenAI, Anthropic and xAi is to ChatGPT, Claude and Grok respectively.
Flash forward in time for a moment. Elm is all grown up and well trained at this point. You notice that as you utter the sentence “Do you want to go…” Elm’s body language shifts. Her heads tilting side-to-side, ears perking up. Her eyes widen. She’s predicting what you’re about to say next. “…on a walk?” And delight ensues.
You notice that similar behaviour is repeated across many of the routines you’ve developed with Elm. Of course she doesn’t understand English as a human can. But there’s specific sounds that we make and string together into patterns. These are recognisable. Predictable even.
(My silly dog wasn’t capable of this. Only the mention of “Walk” would excite them. I have seen clips of dogs that are capable though.)
Elm’s case mapped directly onto LLMs would be much too simple. I think it stands though. When you speak the phrase “Do you want to go on a walk?”, you’re essentially feeding Elm’s brain with bits of information. Now, add in the fact that you’ve already put on a raincoat and walking shoes. You’re also standing next to where the leashes are kept.
Elm’s been constructing context, feeding all these bits of information – or tokens as I’ll refer to them from now on – into her brain. Aligning with her relevant memories of these events in the past. I’d predict delight ensued before you could finish the word walk.
In the case of these LLMs, tokens aren’t as simple as whole words. I touch on this later. With Elm, and elements of our speech: when being strung together into recognisable patterns; they begin to trigger elements of memory that can be used to predict the next relevant token. Or behaviour in the case of Elm. A quick chase of the tail, a bark, then waits for her lead to be attached etc.
Large language models are not as complicated as I was once led to believe. I have to thank 3Blue1Brown on YouTube and his “Neural Networks” playlist for helping me realise this. Also, for his recommendations on other creators, namely Andrej Karpathy. His content has helped me build a firm foundation.
I’ve broken it down into three parts here: Input Data, Training and Refinement. Because that’s practically all an LLM is.
Training Data
I think people are pretty accurate when they say that chat bots have been trained on the entire internet. The entire internet can be downloaded by anyone actually. Understandably, with some caveats. Some of it has been removed, such as: Explicit information, identifiable information, duplicates.
In plain text format, the entire web comes to about 55Tb of data. This is the bulk of what is fed to an LLM as it’s training data. There are of course other, more specialist forms of data, that our modern chat bots are fed. Otherwise, it wouldn’t be possible for an X feed to be blessed with the flood of vibe coders.
Recall my mention of tokens. I’m still diving into these a little deeper myself. But on a high level, as I understand: the 55Tb of the internet in plain text is broken down into elements that are organised by the frequency they appear. These unique elements are called Tokens, and they’re assigned a code that acts purely as an identifier.
In Elm’s case, she might hear “S” and “it” and know that the next thing to do is sit. But “S” and “tay” means another behaviour is desired. For Elm: ’S’, ‘it’ and ‘tay’ are unique tokens.
Training
The unique tokens that are output after being organised into their elements are fixed prior to training a model, they’re embedded within its infrastructure and remain unchanged. This is called the ‘Embedding Dimension’. This can range from as low as 768 for GPT-2 (small) and up to 12,288 for GPT-3. I’m told that more is not always better though.
As I understand it, everything we know of AI assistants and chat bots today, can be attributed to one paper from Google published in 2017 titled “Attention is All you Need”. They introduced the Transformer. And it really did just that; transform. The world hasn’t been the same since. Whether it’s been a net positive or negative, is yet to be seen. I’m optimistic though.
Elm, our pup, is capable of predicting the tokens ‘w’ and ‘alk’ will follow the string ‘Do you want to go on a…’. It’s taken a significant amount of time for her to learn this. Hours of listening attentively to the elements of your speech. Many mistakes and misunderstandings were made. But the correct behaviour was reinforced with the admission of treats and/or praise.
Thankfully, training a model doesn’t appear to be as hands on as training a puppy is.
There’s 25.9 billion rows of data within the data set I’ve mentioned, that’s 18.5 Trillion tokens. It’s called FineWeb for those interested. In training a model, they’re given permissions to sample a maximum token length from within these 25.9B rows. These samples are termed the models ‘Context Length’. Conversely to tokens; more context can be better. But it has its trade-offs. Computational cost being one. Probably why Google is capable of offering a model with a 2 million token possible context length!
Models sample their allowance of context from their training data, and this becomes the input for the Transformers architecture. (I apologise in advance to those that understand this as second nature, as I butcher your field.)
The input is consumed so that prior tokens can influence future tokens. As the context length is consumed; a likely contender for the subsequent token will begin to bubble to the top. An input of 100 tokens is weighed up against the embedded tokens probabilities, and the most likely is provided as the 101st. This process is called the models Attention.
I fear losing the casual reader if I dive deeper. Another, more technical article may be in order. Which, when published, I would invite those more knowledgable in the field to scrutinise, pick apart and meet me with better education on the topic. As this is something I’d like to understand deeply.
The Attention process isn’t done just once. It’s repeated many times, simultaneously. Sit Elm in front of you. Then duplicate her left and right. As you’d see in an elevator with mirrors on opposing walls. Now, repeat the question of a walk. Every Elm could be considered a unique head of Attention, each focused on a unique element of your input. There were 96 of these Attention heads in the case of GPT-4.
The outputs of an attention block is not fed to us, the user, just yet. It’s passed forward to another element, a Multilayer Perceptron or MLP. Another one for my more technical article, but at a high level, as I understand currently: it’s within this element that models can map contextual cues to their training data, recall facts or memories perhaps.
Refinement
Would you believe that what comes out of the training is not the final product we receive? The output of what we’ve discussed is termed a ‘Base Model’. There are models like this available online. However, one wouldn’t garner the same quality of response as you would with a refined or ‘Instructed’ model. The idea of an instructed model gets a little squirrelly in my mind. I’m yet to settle on an opinion about it. I don’t fully understand my feelings toward it either. However, I digress.
The base models must be further trained. As they stand, entering a query will produce a string of tokens that appear to us, the users, as words but unlikely to be related to the input. Unless one is quite intelligent in their prompting.
What happens is that training data is developed manually, by professionals across all fields, and to a design specification. This is additional to the 55Tb of the Internet dataset. It involves preparing scripts, written in the format User, Assistant, User, Assistant…, etc. These scripts are highly refined by the creators of the models. As these act as the templates for how the model should respond to us users. Hallucinations are trained out of the models at this stage as they are desired to be as accurate and factual as possible.