Understanding Transformer Architecture: How Modern LLMs Function

monarchintiteknologi
February 09, 2026
No Comments

When individuals interact with modern large language models (LLMs) such as GPT, Claude, or Gemini, they observe a process fundamentally distinct from how humans formulate sentences. While humans naturally construct thoughts and convert them into words, LLMs operate through a cyclical conversion process. Understanding this process reveals both the capabilities and limitations of these powerful systems, offering insights valuable for capacity planning in their deployment.

At the core of most contemporary LLMs resides an architecture known as a transformer. Introduced in 2017, transformers are sequence prediction algorithms constructed from neural network layers. The architecture comprises three essential components:

An embedding layer that converts tokens into numerical representations.
Multiple transformer layers where computation occurs.
An output layer that converts results back into text.

A diagram illustrating this process is provided below:

Transformers process all words concurrently rather than sequentially, which allows them to learn from extensive text datasets and capture complex relationships between words. This article will examine the transformer architecture’s operation in a step-by-step manner, a critical foundation for understanding performance implications and resilience, much like practicing chaos engineering principles in distributed systems.

Step 1: From Text to Tokens

Before any computation can occur, the model must convert text into a workable format. This process begins with tokenization, where text is broken down into fundamental units called tokens. These units are not always complete words; they can be subwords, word fragments, or even individual characters.

Consider the example input: “I love transformers!” The tokenizer might break this into: [”I”, “ love”, “ transform”, “ers”, “!”]. It is notable that “transformers” became two separate tokens. Each unique token within the vocabulary is assigned a unique integer ID:

“I” might be token 150
“love” might be token 8942
“transform” might be token 3301
“ers” might be token 1847
“!” might be token 254

These IDs are arbitrary identifiers lacking inherent relationships. Tokens 150 and 151 are not similar merely because their numerical values are close. The overall vocabulary typically encompasses 50,000 to 100,000 unique tokens that the model acquired during its training phase.

Step 2: Converting Tokens to Embeddings

Neural networks cannot directly process token IDs, as these are merely fixed identifiers. Each token ID is mapped to a vector, which is a list of continuous numbers typically containing hundreds or thousands of dimensions. These are referred to as embeddings.

Here is a simplified example illustrating five dimensions (real models may utilize 768 to 4096):

Token “dog” becomes [0.23, -0.67, 0.45, 0.89, -0.12]
Token “wolf” becomes [0.25, -0.65, 0.47, 0.91, -0.10]
Token “car” becomes [-0.82, 0.34, -0.56, 0.12, 0.78]

It can be observed that “dog” and “wolf” exhibit similar numerical values, whereas “car” is distinctly different. This creates a semantic space where related concepts tend to cluster together.

The necessity for multiple dimensions arises because with only one number per word, contradictions might occur. For example:

“stock” equals 5.2 (financial term)
“capital” equals 5.3 (similar financial term)
“rare” equals -5.2 (antonym: uncommon)
“debt” equals -5.3 (antonym of capital)

In this scenario, “rare” and “debt” both possess similar negative values, implying a relationship that is nonsensical. Hundreds of dimensions enable the model to represent complex relationships without such contradictions. Within this semantic space, mathematical operations can be performed. For instance, the embedding for “king” minus “man” plus “woman” approximately yields the embedding for “queen.” These relationships emerge during training from patterns observed in text data.

Step 3: Adding Positional Information

Transformers inherently lack an understanding of word order. Without additional information, sentences such as “The dog chased the cat” and “The cat chased the dog” would appear identical because both contain the same tokens.

The solution involves positional embeddings. Every position is mapped to a position vector, analogous to how tokens are mapped to meaning vectors.

For the token “dog” appearing at position 2, the representation might appear as follows:

Word embedding: [0.23, -0.67, 0.45, 0.89, -0.12]
Position 2 embedding: [0.05, 0.12, -0.08, 0.03, 0.02]
Combined (element-wise sum): [0.28, -0.55, 0.37, 0.92, -0.10]

This combined embedding captures both the meaning of the word and its contextual usage. This integrated representation is then fed into the transformer layers.

Step 4: The Attention Mechanism in Transformer Layers

The transformer layers implement the attention mechanism, which constitutes the key innovation behind these models’ remarkable power. Each transformer layer operates using three components for every token: queries, keys, and values. This can be conceptualized as a fuzzy dictionary lookup where the model compares its target (the query) against all available answers (the keys) and returns weighted combinations of the corresponding values.

Consider a concrete example with the sentence: “The cat sat on the mat because it was comfortable.”

When the model processes the word “it,” it must determine what “it” refers to. The sequence of events is as follows:

First, the embedding for “it” generates a query vector, essentially asking, “To what noun am I referring?”

Next, this query is compared against the keys from all previous tokens. Each comparison produces a similarity score. For example:

“The” (article) generates score: 0.05
“cat” (noun) generates score: 8.3
“sat” (verb) generates score: 0.2
“on” (preposition) generates score: 0.03
“the” (article) generates score: 0.04
“mat” (noun) generates score: 4.1
“because” (conjunction) generates score: 0.1

The raw scores are subsequently converted into attention weights that sum to 1.0. For example:

“cat” receives attention weight: 0.75 (75 percent)
“mat” receives attention weight: 0.20 (20 percent)
All other tokens: 0.05 total (5 percent combined)

Finally, the model utilizes the value vectors from each token and combines them using these calculated weights. For example, the value from “cat” contributes 75 percent to the output, “mat” contributes 20 percent, and all other elements are nearly disregarded. This weighted combination becomes the new representation for “it,” encapsulating the contextual understanding that “it” most likely refers to “cat.”

This attention process occurs in every transformer layer, with each layer learning to detect distinct patterns.

Early layers acquire basic patterns such as grammar and common word pairs. When processing “cat,” these layers might heavily attend to “The” because they learn the relationship between articles and their nouns.

Middle layers discern sentence structure and relationships between phrases. They might identify that “cat” is the subject of “sat” and that “on the mat” forms a prepositional phrase indicating location.

Deep layers extract more abstract meaning. They might comprehend that this sentence describes a physical situation and implies the cat is comfortable or resting.

Each layer progressively refines the representation. The output of one layer serves as the input for the next, with each successive layer adding more contextual understanding.

Importantly, only the final transformer layer is responsible for predicting an actual token. All intermediate layers perform the same attention operations but merely transform the representations to be more useful for subsequent layers. A middle layer does not output token predictions; instead, it outputs refined vector representations that flow to the next layer.

This stacking of numerous layers, each specializing in different aspects of language understanding, is what enables LLMs to capture complex patterns and generate coherent text.

Step 5: Converting Back to Text

After traversing all layers, the final vector must be converted back into text. The unembedding layer compares this vector against every token embedding and produces scores.

For example, to complete “I love to eat,” the unembedding process might yield:

“pizza”: 65.2
“tacos”: 64.8
“sushi”: 64.1
“food”: 58.3
“barbeque”: 57.9
“car”: -12.4
“42”: -45.8

These arbitrary scores are then converted into probabilities using softmax:

“pizza”: 28.3 percent
“tacos”: 24.1 percent
“sushi”: 18.9 percent
“food”: 7.2 percent
“barbeque”: 6.1 percent
“car”: 0.0001 percent
“42”: 0.0000001 percent

Tokens with similar scores (65.2 versus 64.8) receive similar probabilities (28.3 versus 24.1 percent), while low-scoring tokens result in near-zero probabilities.

The model does not simply select the highest probability token. Instead, it randomly samples from this distribution. This can be visualized as a roulette wheel where each token is allocated a slice proportional to its probability. Pizza receives 28.3 percent, tacos receive 24.1 percent, and 42 receives a microscopic slice.

The rationale for this randomness is that consistently selecting a specific value, such as “pizza,” would lead to repetitive, unnatural output. Random sampling weighted by probability allows for the selection of “tacos,” “sushi,” or “barbeque,” thereby producing varied and natural responses. Occasionally, a lower-probability token is selected, contributing to more creative outputs.

The Iterative Generation Loop

The generation process reiterates for every token. Consider an example where the initial prompt is “The capital of France.” Here’s how different cycles proceed through the transformer:

Cycle 1:

Input: [”The”, “capital”, “of”, “France”]
Process through all layers
Sample: “is” (80 percent)
Output so far: “The capital of France is”

Cycle 2:

Input: [”The”, “capital”, “of”, “France”, “is”] (includes new token)
Process through all layers (5 tokens now)
Sample: “Paris” (92 percent)
Output so far: “The capital of France is Paris”

Cycle 3:

Input: [”The”, “capital”, “of”, “France”, “is”, “Paris”] (6 tokens)
Process through all layers
Sample: “.” (65 percent)
Output so far: “The capital of France is Paris.”

Cycle 4:

Input: [”The”, “capital”, “of”, “France”, “is”, “Paris”, “.”] (7 tokens)
Process through all layers
Sample: [EoS] token (88 percent)
Stop the loop

Final output: “The capital of France is Paris.”

The [EoS] or end-of-sequence token signals completion. Each cycle processes all previous tokens. This explains why generation can slow down as responses lengthen.

This mechanism is termed autoregressive generation because each output is dependent on all previous outputs. If an unusual token is selected (perhaps “chalk” with 0.01 percent probability in “I love to eat chalk”), all subsequent tokens will be influenced by this choice.

Training Versus Inference: Two Different Modes

The transformer flow operates in two distinct contexts: training and inference.

During training, the model acquires language patterns from billions of text examples. It commences with random weights and gradually adjusts them. The training process unfolds as follows:

Training text: “The cat sat on the mat.”

The model receives: “The cat sat on the”

With random initial weights, the model might predict:

“banana”: 25 percent
“car”: 22 percent
“mat”: 3 percent (correct answer has low probability)
“elephant”: 18 percent

The training process calculates the error (the probability for “mat” should have been higher) and utilizes backpropagation to adjust every weight:

Embeddings for “on” and “the” are adjusted
Attention weights in all 96 layers are adjusted
Unembedding layer is adjusted

Each adjustment is minute (e.g., from 0.245 to 0.247), but it accumulates across billions of examples. After encountering “sat on the” followed by “mat” thousands of times in various contexts, the model learns this pattern. Training typically spans weeks on thousands of GPUs and entails costs amounting to millions of dollars. Once completed, the weights are frozen.

During inference, the transformer operates with its frozen weights:

User query: “Complete this: The cat sat on the”

The model processes the input with its learned weights and outputs: “mat” (85 percent), “floor” (8 percent), “chair” (3 percent). It samples “mat” and returns it. No weight changes occur.

The model leverages its learned knowledge but does not acquire any new information. Conversations do not update model weights. To impart new information to the model, it would necessitate retraining with new data, a process requiring substantial computational resources.

A diagram below illustrates the various steps in an LLM execution flow:

Conclusion

The transformer architecture offers an elegant solution for understanding and generating human language. By converting text to numerical representations, employing attention mechanisms to capture relationships between words, and stacking numerous layers to learn increasingly abstract patterns, transformers empower modern LLMs to produce coherent and useful text.

This process encompasses seven key steps that repeat for every generated token: tokenization, embedding creation, positional encoding, processing through transformer layers with attention mechanisms, unembedding to scores, sampling from probabilities, and decoding back to text. Each step builds upon the preceding one, transforming raw text into mathematical representations that the model can manipulate, and then back into human-readable output.

Understanding this process reveals both the capabilities and limitations of these systems. Essentially, LLMs are sophisticated pattern-matching machines that predict the most likely next token based on patterns learned from massive datasets, enabling their use in diverse applications from Black Friday Cyber Monday scale testing insights to edge network requests per minute optimization strategies in real-world deployments.

Step 1: From Text to Tokens

Step 2: Converting Tokens to Embeddings

Step 3: Adding Positional Information

Step 4: The Attention Mechanism in Transformer Layers

Step 5: Converting Back to Text

The Iterative Generation Loop

Training Versus Inference: Two Different Modes

Conclusion

Recent posts

Understanding Transformer Architecture: How Modern LLMs Function

Understanding AI Evolution, Git Commands, and Agentic...

Grab Engineering: Pioneering Vision LLMs for Southeast...

Mastering Prompt Engineering: Core Techniques & Best...

Categories

Step 1: From Text to Tokens

Step 2: Converting Tokens to Embeddings

Step 3: Adding Positional Information

Step 4: The Attention Mechanism in Transformer Layers

Step 5: Converting Back to Text

The Iterative Generation Loop

Training Versus Inference: Two Different Modes

Conclusion

Recent posts

Understanding Transformer Architecture: How Modern LLMs Function

Understanding AI Evolution, Git Commands, and Agentic...

Grab Engineering: Pioneering Vision LLMs for Southeast...

Mastering Prompt Engineering: Core Techniques & Best...

Categories

Tags