LLMs for Developers, Part 2: Tokenization and Embeddings, Moving from Characters to Subwords (2026)

Why Tokenization Matters in Language Models

Tokenization is gatekeeper between raw text and any modern language model. In first part of this series, we built character-level language model from scratch in Python. That exercise showed how to map characters to indices and predict next character. While this approach is intuitive and easy to code, it exposes major limitations for real-world workloads:

From Characters to Subwords: Byte-Pair Encoding Explained

How BPE Works: Step-by-Step Example

Suppose your corpus is:

["hug", "pug", "pun", "bun", "hugs"]

Let’s assign frequencies:

("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)

Initialize: Split words into characters:

(
 ("h", "u", "g", 10),
 ("p", "u", "g", 5),
 ("p", "u", "n", 12),
 ("b", "u", "n", 4),
 ("h", "u", "g", "s", 5)
)

Base vocabulary: ["b", "g", "h", "n", "p", "s", "u"]

Count Pairs:

(“u”, “g”) appears 20 times
(“u”, “n”) appears 16 times
(“h”, “u”) appears 15 times

Merge Most Frequent Pair: (“u”, “g”) → “ug”
Update corpus and vocabulary.
Repeat: Next, (“u”, “n”) → “un”. Then (“h”, “ug”) → “hug”.
Tokenize New Words:
- “bug” → [“b”, “ug”]
- “thug” → [“[UNK]”, “hug”] (if “t” not in base vocab)
- “unhug” → [“un”, “hug”]

This process continues until reaching set vocabulary size, typically 30,000-50,000 for prod models.

Close-up of programming code on computer screen — Tokenization and BPE merges are implemented as part of preprocessing pipeline.

Code Example: Training BPE and Tokenizing Text

The following code shows simplified BPE learning loop in Python, using Hugging Face’s GPT-2 tokenizer for pre-tokenization:

from collections import defaultdict
from transformers import AutoTokenizer

corpus = [
 "This is Hugging Face Course.",
 "This chapter is about tokenization.",
 "This section shows several tokenizer algorithms.",
 "Hopefully, you will be able to understand how they are trained and generate tokens."
]

tokenizer = AutoTokenizer.from_pretrained("gpt2")
word_freqs = defaultdict(int)
for text in corpus:
 words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
 for word, _ in words_with_offsets:
 word_freqs[word] += 1

alphabet = sorted({c for word in word_freqs for c in word})
vocab = [""] + alphabet.copy()
splits = {word: list(word) for word in word_freqs}

def compute_pair_freqs(splits):
 pair_freqs = defaultdict(int)
 for word, freq in word_freqs.items():
 split = splits[word]
 for i in range(len(split) - 1):
 pair = (split[i], split[i + 1])
 pair_freqs[pair] += freq
 return pair_freqs

merges = {}
target_vocab_size = 50

while len(vocab)

Too small: Breaks words into many tiny tokens, reducing efficiency and context learning.
Too large: Wastes memory, increases model size, and leads to more OOVs in new domains.
Most LLMs use 30,000-50,000 subword tokens for English.

Tokenization Differences Across Models

Different tokenizers split same text in different ways. For example:

BPE (as used in GPT-2): “unhuggable” → [“un”, “hug”, “gable”]
WordPiece (as in BERT): “unhuggable” → [“un”, “
## hug”, “##gable”]
SentencePiece: Might segment as [“un”, “hugg”, “able”]

For more impl comparisons, see BPE vs SentencePiece vs Tiktoken.

Visualization of token embeddings in vector space — Dense embeddings allow similar words to cluster together, capturing semantic similarity.

Dense Embeddings vs One-Hot Encodings: How Models Understand Tokens

After tokenization, each token (subword) is mapped to numeric index. The next step is to convert these indices into input vectors for model. Two classic approaches:

Representation	Dimension	Memory Usage	Semantic Meaning	Generalization
One-Hot Encoding	Vocabulary size (e.g., 50,000)	High (sparse)	None (orthogonal)	Poor
Dense Embedding	128-1024 (typical)	Low (compact)	Rich (semantic relationships)	Strong

One-hot encoding produces sparse vector with single “1” at token index. It is easy to understand but has two fatal flaws:

Scalability: Memory and compute cost grow linearly with vocabulary.
No similarity: All tokens are equally distant; “run” and “running” are as different as “run” and “table”.

Dense embeddings are compact, learnable vectors (e.g., 256-dimensional floats) that capture similarity and context. They are stored in lookup table indexed by token.

Example: Embedding Lookup

import torch
embedding_table = torch.nn.Embedding(num_embeddings=50000, embedding_dim=256)
token_indices = torch.LongTensor([24, 457, 1023])
vectors = embedding_table(token_indices) # shape: (3, 256)

This mechanism underpins every modern language model, from GPT to BERT.

Why Dense Embeddings Outperform One-Hot Vectors

Semantic similarity: “run”, “running”, and “runner” have near vectors.
Efficient computation: Multiplications and additions are fast with dense arrays.
Contextualization: Embeddings learned jointly with model adapt to task-specific patterns (as shown in Embeddings vs One-Hot Tradeoffs).

Embedding Tables and Joint Training with Model

Embedding tables are not static, they are trainable params in model, updated during training to minimize prediction error.

How does this work?

Each token index is mapped to its embedding vector.
The embedding vectors form input to model (e.g., transformer block).
During backpropagation, model updates both network weights and embedding vectors, so similar tokens move closer in vector space if they help model perform better.

Joint training enables embeddings to capture highly task-specific meaning, not just raw co-occurrence statistics.

Real-world result: After training, tokens like “bank” (financial) and “bank” (riverbank) may have distinct embeddings if their contexts differ in data.

prod Tokenization: Tiktoken and SentencePiece

Several open-source libraries implement BPE and other tokenization strategies at scale:

Tiktoken: Developed by OpenAI, tiktoken is tightly optimized for GPT model inference and training. It uses byte-level BPE and is default for OpenAI APIs.
SentencePiece: Developed by Google, supports BPE and Unigram LM tokenization. It is language-agnostic and well-suited for multilingual and speech apps.

Both tools streamline vocabulary training, encoding, and decoding, and handle edge cases in prod deployments. For detailed head-to-head, see BPE vs SentencePiece vs Tiktoken: How Tokenizers Actually Work.

Comparison Table: Tokenization and Embedding Approaches

Approach	Vocabulary Size	OOV Handling	Main Strength	prod Use	Reference
Character-level	~100-256	None needed	Handles any input	Rare	Part 1
Word-level	30,000-1M+	[UNK] token	Direct semantic mapping	Not scalable	See Hugging Face
BPE (subword)	30,000-50,000	Decomposes to subwords	Balance of flexibility and efficiency	Standard	Hugging Face
Dense Embedding	128-1024 dims	Not measured	Semantic similarity, efficient training	Universal	ML Journey

Conclusion and Next Steps

Subword tokenization with BPE is backbone of modern language model pipelines, enabling models to efficiently handle enormous, dynamic vocabularies and robustly encode any text input. Dense embeddings, trained jointly with model, empower neural architectures to learn, generalize, and cluster similar meanings in vector space, something one-hot vectors could never achieve.

In next part of this series, we will dive into self-attention and transformer blocks, breaking down how these networks use token embeddings and contextual information to process sequences in parallel.

Key Takeaways:

BPE tokenization creates compact, flexible vocabulary by merging frequent subword pairs.
Dense embeddings outperform one-hot vectors by encoding similarity and context.
Embedding tables are jointly trained with model, not static dictionaries.
Tools like Tiktoken and SentencePiece bring prod-grade tokenization to developers.

For more on BPE mechanics and code, see Hugging Face course.

Sources and References

This article was researched using a combination of primary and supplementary sources:

Supplementary References

These sources provide additional context, definitions, and background information to help clarify concepts mentioned in the primary source.