Building my own language model: Part 1

Many of us are using ChatGPT and co. now for a few years. These LLMs are very interesting and fascinating and we can use them for many interesting tasks, the next big thing being agents.

But one thing I always wanted to try is building my own language model, all trained on my local machine.

Why? To learn more about LLMs.

I don’t expect this language model to be practically useful, but I will learn a lot during that journey.

What do we need

I’m not sure if this will be a complete list and what will have to be changed during the journey, but so far I think this is a good start:

  • a dataset with a lot of text, e.g. Wikipedia
  • (maybe some data cleansing)
  • a tokenizer, that will basically help translate text to numbers that can be used to train a neural net
  • a model architecture
  • a training program that learns to predict the next token
  • (bonus: a fine tuning on an instruction data set to get an actual conversational experience)
  • a program to run the model

Will it work?

Most probably not. It might do something though, but I am sceptical it will provide any valuable or useful results. My hardware resources are pretty limited (I’m running on a MacBook pro). I will use AI to help me build all the stuff and push all the code on GitHub.

Next step: setting up a project and trying to build a simple tokenizer.

Published
Categorized as AI

3 comments

  1. Of course it will work! I recommend to start from Karpathy’s nanogpt

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.