Building my own language model: Predicting the next token (Part 5)

So far, every token enters the model as an ID, becomes an embedding, receives position information, and passes through Transformer layers.

At the end of that process, every token has a contextual vector.

But a language model does not output text directly, it outputs probabilities.

The final linear layer

After the Transformer, we take the final hidden vector and pass it through a linear layer:

logits = self.fc(h)

If the vocabulary has 24,000 tokens, then each position gets 24,000 numbers.

Those numbers are not probabilities yet. They are raw scores, called logits.

Logits to probabilities

To turn logits into probabilities, we use softmax.

A higher logit means the model thinks that token is more likely.

Example:

"the cat sat on the"

might produce high scores for:

mat
floor
chair
sofa

and low scores for unrelated tokens.

Training: comparing prediction with reality

During training, the model already knows the real next token.

If the input is:

the cat sat on the

and the real next token is:

mat

then the model is rewarded for assigning high probability to “mat”.

This is done with cross-entropy loss.

The model is not told “understand cats” or “learn grammar”.

It only gets one job:

given the previous tokens, make the real next token more likely.

Repeated billions of times, this simple task creates surprisingly rich behavior.

Why we shift the targets

The model input might be:

[the, cat, sat]

But the target is shifted by one:

[cat, sat, down]

So each position learns to predict what comes next.

This is the core trick behind language model training.

Generation: choosing one token

At inference time, there is no known target.

The model gives us a probability distribution, and we choose a token.

There are different ways to choose:

greedy decoding: always pick the most likely token
temperature: make choices more or less random
top-k: only sample from the best k tokens
top-p: sample from the smallest likely set of tokens

Then we append the chosen token and repeat the process.

That loop is text generation.

Important realization

The model never writes a sentence in one step.

It writes one token.

Then another.

Every answer is just this loop:

encode prompt
run transformer
predict next token
append token
repeat

Conclusion

Part 4 was about giving tokens context.

Part 5 is about turning that context into a probability distribution over the vocabulary.

This is where the model finally becomes generative.

Not because it understands language the way humans do, but because it has learned a powerful statistical game:

predict the next token well enough, and language starts to emerge.