So far, every token enters the model as an ID, becomes an embedding, receives position information, and passes through Transformer layers.
At the end of that process, every token has a contextual vector.
But a language model does not output text directly, it outputs probabilities.
The final linear layer
After the Transformer, we take the final hidden vector and pass it through a linear layer:
logits = self.fc(h)
If the vocabulary has 24,000 tokens, then each position gets 24,000 numbers.
Those numbers are not probabilities yet. They are raw scores, called logits.
Logits to probabilities
To turn logits into probabilities, we use softmax.
A higher logit means the model thinks that token is more likely.
Example:
"the cat sat on the"
might produce high scores for:
mat floor chair sofa
and low scores for unrelated tokens.
Training: comparing prediction with reality
During training, the model already knows the real next token.
If the input is:
the cat sat on the
and the real next token is:
mat
then the model is rewarded for assigning high probability to “mat”.
This is done with cross-entropy loss.
The model is not told “understand cats” or “learn grammar”.
It only gets one job:
given the previous tokens, make the real next token more likely.
Repeated billions of times, this simple task creates surprisingly rich behavior.
Why we shift the targets
The model input might be:
[the, cat, sat]
But the target is shifted by one:
[cat, sat, down]
So each position learns to predict what comes next.
This is the core trick behind language model training.
Generation: choosing one token
At inference time, there is no known target.
The model gives us a probability distribution, and we choose a token.
There are different ways to choose:
- greedy decoding: always pick the most likely token
- temperature: make choices more or less random
- top-k: only sample from the best k tokens
- top-p: sample from the smallest likely set of tokens
Then we append the chosen token and repeat the process.
That loop is text generation.
Important realization
The model never writes a sentence in one step.
It writes one token.
Then another.
Then another.
Every answer is just this loop:
encode prompt
run transformer
predict next token
append token
repeat
Conclusion
Part 4 was about giving tokens context.
Part 5 is about turning that context into a probability distribution over the vocabulary.
This is where the model finally becomes generative.
Not because it understands language the way humans do, but because it has learned a powerful statistical game:
predict the next token well enough, and language starts to emerge.