Building my own language model: Data & Tokenizer (Part 2)

As per plan for building my own language model, the first step is to find a dataset to train the model and then build a tokenizer.

Why do we need this?

When interacting with an LLM, we typically use natural language – both as input and output. Neural nets though don’t understand words or sentences the same way, the work in a numerical space. A tokenizer translates text into a sequence of tokens; these may be full words or subwords, using a mapping that allows both encoding and decoding.

Hello	0
World	1
and	2

Here is a visual representation to put it all into context. Please note though that this is highly simplified, specifically the output handling is token by token, but we will dive deeper into that topic later. In practice, LLMs generate output autoregressively, predicting each token one at a time based on the full prior context.

Data

First, we need some data to create and test our tokenizer.

Initially, I was thinking to use the Wikipedia content, but soon realized it is maybe a bit too much data to start with (20-25 GB compressed and around 100 GB extracted).

After doing some research I found out about WikiText-2, a much smaller and ready to use curated dataset. Here is the code I generated using AI to download the data, I think its usable as is, nothing out of the ordinary, it was just convenient doing this with a Python script too:

import os
import requests
import zipfile

# Use Hugging Face mirror (tar.gz format)
WIKITEXT2_URL = "https://huggingface.co/datasets/wikitext/resolve/main/wikitext-2-v1.tgz"
TGZ_FILE = "wikitext-2-v1.tgz"
EXTRACTED_DIR = "wikitext-2"

import tarfile

def download_wikitext2():
    if not os.path.exists(TGZ_FILE):
        print(f"Downloading {TGZ_FILE}...")
        with requests.get(WIKITEXT2_URL, stream=True) as r:
            r.raise_for_status()
            with open(TGZ_FILE, 'wb') as f:
                for chunk in r.iter_content(chunk_size=8192):
                    f.write(chunk)
        print("Download complete.")
    else:
        print(f"{TGZ_FILE} already exists.")

def extract_wikitext2():
    if not os.path.exists(EXTRACTED_DIR):
        print("Extracting Wikitext-2...")
        with tarfile.open(TGZ_FILE, 'r:gz') as tar:
            tar.extractall()
        print("Extraction complete.")
    else:
        print(f"{EXTRACTED_DIR} already exists.")

if __name__ == "__main__":
    download_wikitext2()
    extract_wikitext2()
    print("Wikitext-2 dataset is ready for tokenization.")

After running the code, the data is on my machine:

Tokenizer

Common tokenization methods include Byte-Pair Encoding (BPE) or Unigram models, which can be implemented using tools like SentencePiece or Hugging Face’s Tokenizers library.

But for now, let’s keep it even simpler: we can just map every single (and complete) word to a number.

Here is the code that will be the basic tokenizer we are going to use for our own tiny language model, also here purely AI generated and functioning out of the box, no dependencies to bidict or similar, just keeping two dictionaries for mapping from word to number and vice versa.

import re
from collections import Counter

class SimpleTokenizer:
    def __init__(self, lower: bool = True):
        self.lower = lower
        self.vocab = None
        self.word2idx = None
        self.idx2word = None

    def fit(self, texts):
        """Build vocabulary from a list of texts."""
        words = []
        for text in texts:
            if self.lower:
                text = text.lower()
            # Split on whitespace and punctuation
            words.extend(re.findall(r"\b\w+\b", text))
        self.vocab = sorted(set(words))
        self.word2idx = {w: i for i, w in enumerate(self.vocab)}
        self.idx2word = {i: w for w, i in self.word2idx.items()}

    def encode(self, text):
        if self.lower:
            text = text.lower()
        words = re.findall(r"\b\w+\b", text)
        return [self.word2idx[w] for w in words if w in self.word2idx]

    def decode(self, indices):
        return ' '.join(self.idx2word[i] for i in indices)

if __name__ == "__main__":
    # Example usage: fit on wikitext-2/train.txt
    with open("wikitext-2/train.txt", encoding="utf-8") as f:
        lines = [line.strip() for line in f if line.strip()]
    tokenizer = SimpleTokenizer()
    tokenizer.fit(lines)
    print(f"Vocab size: {len(tokenizer.vocab)}")
    # Encode and decode a sample
    sample = lines[0]
    encoded = tokenizer.encode(sample)
    print(f"Sample: {sample}")
    print(f"Encoded: {encoded}")
    print(f"Decoded: {tokenizer.decode(encoded)}")

Here is the output on my machine, its really quick given the small dataset we use:

Conclusion

As a first step into building our tiny language model I think it makes sense: we have a way to encode and decode natural language. I deliberately wanted to use a naive approach, to make it easy to understand the basic idea. We can switch later to a more sophisticated solution. You can find the code and project here on GitHub.

Why do we need this?

Data

Tokenizer

Conclusion

Leave a comment Cancel reply