Many of us are using ChatGPT and co. now for a few years. These LLMs are very interesting and fascinating and we can use them for many interesting tasks, the next big thing being agents.
But one thing I always wanted to try is building my own language model, all trained on my local machine.
Why? To learn more about LLMs.
I don’t expect this language model to be practically useful, but I will learn a lot during that journey.
What do we need
I’m not sure if this will be a complete list and what will have to be changed during the journey, but so far I think this is a good start:
- a dataset with a lot of text, e.g. Wikipedia
- (maybe some data cleansing)
- a tokenizer, that will basically help translate text to numbers that can be used to train a neural net
- a model architecture
- a training program that learns to predict the next token
- (bonus: a fine tuning on an instruction data set to get an actual conversational experience)
- a program to run the model
Will it work?
Most probably not. It might do something though, but I am sceptical it will provide any valuable or useful results. My hardware resources are pretty limited (I’m running on a MacBook pro). I will use AI to help me build all the stuff and push all the code on GitHub.
Next step: setting up a project and trying to build a simple tokenizer.
Of course it will work! I recommend to start from Karpathy’s nanogpt
A must watch – I’m sure you already have is https://www.youtube.com/watch?v=kCc8FmEb1nY
“Let’s build GPT: from scratch, in code, spelled out.” from Andrej Karpathy
Not yet, thanks for sharing!