Building my own language model: Part 1

Many of us are using ChatGPT and co. now for a few years. These LLMs are very interesting and fascinating and we can use them for many interesting tasks, the next big thing being agents.

But one thing I always wanted to try is building my own language model, all trained on my local machine.

Why? To learn more about LLMs.

I don’t expect this language model to be practically useful, but I will learn a lot during that journey.

What do we need

I’m not sure if this will be a complete list and what will have to be changed during the journey, but so far I think this is a good start:

a dataset with a lot of text, e.g. Wikipedia
(maybe some data cleansing)
a tokenizer, that will basically help translate text to numbers that can be used to train a neural net
a model architecture
a training program that learns to predict the next token
(bonus: a fine tuning on an instruction data set to get an actual conversational experience)
a program to run the model

Will it work?

Most probably not. It might do something though, but I am sceptical it will provide any valuable or useful results. My hardware resources are pretty limited (I’m running on a MacBook pro). I will use AI to help me build all the stuff and push all the code on GitHub.

Next step: setting up a project and trying to build a simple tokenizer.

3 comments

Of course it will work! I recommend to start from Karpathy’s nanogpt

A must watch – I’m sure you already have is https://www.youtube.com/watch?v=kCc8FmEb1nY

“Let’s build GPT: from scratch, in code, spelled out.” from Andrej Karpathy

admin says:

07/21/2025 at 19:36

Not yet, thanks for sharing!

Reply

What do we need

Will it work?

3 comments

Leave a comment Cancel reply