LLMs for Developers, Part 2: Tokenization and Embeddings, Moving from Characters to Subwords (2026)

LLMs for Developers, Part 2: Tokenization and Embeddings, Moving from Characters to Subwords (2026)

May 25, 2026 · 7 min read · By Thomas A. Anderson

LLMs for Developers, Part 2: Tokenization and Embeddings, Moving from Characters to Subwords (2026)

Why Tokenization Matters in Language Models

Tokenization is gatekeeper between raw text and any modern language model. In first part of this series, we built character-level language model from scratch in Python. That exercise showed how to map characters to indices and predict next character. While this approach is intuitive and easy to code, it exposes major limitations for real-world workloads:

  • Vocabulary Explosion: Full word-level tokenization leads to huge vocabularies (hundreds of thousands of words).
  • Out-of-Vocabulary (OOV) Words: New or rare words are missed entirely, leading to [UNK] (unknown) tokens.
  • Efficiency: Character-level models are flexible but inefficient for long-range dependencies and higher-level semantics.

To solve these, prod-scale language models use subword tokenization, striking balance between flexibility of character-level and efficiency of word-level models.

Diagram of subword tokenization process
Subword tokenization balances vocabulary size and coverage for efficient text representation.

From Characters to Subwords: Byte-Pair Encoding Explained

Byte-Pair Encoding (BPE) is most widely used subword tokenization strategy for modern language models, including GPT, GPT-2, RoBERTa, and BART. Originally designed for data compression, BPE works by iteratively merging most common pairs of adjacent tokens, starting from characters or bytes, to create vocabulary of subwords.

Why not use words? Word-level tokenization leads to massive vocabulary, making model training slow and inflexible to new words or misspellings. Character-level models, on other hand, can represent anything but struggle to capture meaning across longer spans.

BPE offers middle ground: it can encode any word by breaking it into subword units, while also compressing frequent morphemes (like “ing”, “un”, “tion”) into single tokens.

How BPE Works: Step-by-Step Example

Suppose your corpus is:

["hug", "pug", "pun", "bun", "hugs"]

Let’s assign frequencies:

("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)
  • Initialize: Split words into characters:
(
 ("h", "u", "g", 10),
 ("p", "u", "g", 5),
 ("p", "u", "n", 12),
 ("b", "u", "n", 4),
 ("h", "u", "g", "s", 5)
)

Base vocabulary: ["b", "g", "h", "n", "p", "s", "u"]

  • Count Pairs:
    • (“u”, “g”) appears 20 times
    • (“u”, “n”) appears 16 times
    • (“h”, “u”) appears 15 times
  • Merge Most Frequent Pair: (“u”, “g”) → “ug”
  • Update corpus and vocabulary.
  • Repeat: Next, (“u”, “n”) → “un”. Then (“h”, “ug”) → “hug”.
  • Tokenize New Words:
    • “bug” → [“b”, “ug”]
    • “thug” → [“[UNK]”, “hug”] (if “t” not in base vocab)
    • “unhug” → [“un”, “hug”]

This process continues until reaching set vocabulary size, typically 30,000-50,000 for prod models.

Close-up of programming code on computer screen
Tokenization and BPE merges are implemented as part of preprocessing pipeline.

Code Example: Training BPE and Tokenizing Text

The following code shows simplified BPE learning loop in Python, using Hugging Face’s GPT-2 tokenizer for pre-tokenization:

from collections import defaultdict
from transformers import AutoTokenizer

corpus = [
 "This is Hugging Face Course.",
 "This chapter is about tokenization.",
 "This section shows several tokenizer algorithms.",
 "Hopefully, you will be able to understand how they are trained and generate tokens."
]

tokenizer = AutoTokenizer.from_pretrained("gpt2")
word_freqs = defaultdict(int)
for text in corpus:
 words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
 for word, _ in words_with_offsets:
 word_freqs[word] += 1

alphabet = sorted({c for word in word_freqs for c in word})
vocab = [""] + alphabet.copy()
splits = {word: list(word) for word in word_freqs}

def compute_pair_freqs(splits):
 pair_freqs = defaultdict(int)
 for word, freq in word_freqs.items():
 split = splits[word]
 for i in range(len(split) - 1):
 pair = (split[i], split[i + 1])
 pair_freqs[pair] += freq
 return pair_freqs

merges = {}
target_vocab_size = 50

while len(vocab)
  • Too small: Breaks words into many tiny tokens, reducing efficiency and context learning.
  • Too large: Wastes memory, increases model size, and leads to more OOVs in new domains.
  • Most LLMs use 30,000-50,000 subword tokens for English.

Tokenization Differences Across Models

Different tokenizers split same text in different ways. For example:

  • BPE (as used in GPT-2): “unhuggable” → [“un”, “hug”, “gable”]
  • WordPiece (as in BERT): “unhuggable” → [“un”, “
    ## hug”, “##gable”]
  • SentencePiece: Might segment as [“un”, “hugg”, “able”]

For more impl comparisons, see BPE vs SentencePiece vs Tiktoken.

Visualization of token embeddings in vector space
Dense embeddings allow similar words to cluster together, capturing semantic similarity.

Dense Embeddings vs One-Hot Encodings: How Models Understand Tokens

After tokenization, each token (subword) is mapped to numeric index. The next step is to convert these indices into input vectors for model. Two classic approaches:

Representation Dimension Memory Usage Semantic Meaning Generalization
One-Hot Encoding Vocabulary size (e.g., 50,000) High (sparse) None (orthogonal) Poor
Dense Embedding 128-1024 (typical) Low (compact) Rich (semantic relationships) Strong

One-hot encoding produces sparse vector with single “1” at token index. It is easy to understand but has two fatal flaws:

  • Scalability: Memory and compute cost grow linearly with vocabulary.
  • No similarity: All tokens are equally distant; “run” and “running” are as different as “run” and “table”.

Dense embeddings are compact, learnable vectors (e.g., 256-dimensional floats) that capture similarity and context. They are stored in lookup table indexed by token.

Example: Embedding Lookup

import torch
embedding_table = torch.nn.Embedding(num_embeddings=50000, embedding_dim=256)
token_indices = torch.LongTensor([24, 457, 1023])
vectors = embedding_table(token_indices) # shape: (3, 256)

This mechanism underpins every modern language model, from GPT to BERT.

Why Dense Embeddings Outperform One-Hot Vectors

  • Semantic similarity: “run”, “running”, and “runner” have near vectors.
  • Efficient computation: Multiplications and additions are fast with dense arrays.
  • Contextualization: Embeddings learned jointly with model adapt to task-specific patterns (as shown in Embeddings vs One-Hot Tradeoffs).

Embedding Tables and Joint Training with Model

Embedding tables are not static, they are trainable params in model, updated during training to minimize prediction error.

How does this work?

  • Each token index is mapped to its embedding vector.
  • The embedding vectors form input to model (e.g., transformer block).
  • During backpropagation, model updates both network weights and embedding vectors, so similar tokens move closer in vector space if they help model perform better.

Joint training enables embeddings to capture highly task-specific meaning, not just raw co-occurrence statistics.

Real-world result: After training, tokens like “bank” (financial) and “bank” (riverbank) may have distinct embeddings if their contexts differ in data.

prod Tokenization: Tiktoken and SentencePiece

Several open-source libraries implement BPE and other tokenization strategies at scale:

  • Tiktoken: Developed by OpenAI, tiktoken is tightly optimized for GPT model inference and training. It uses byte-level BPE and is default for OpenAI APIs.
  • SentencePiece: Developed by Google, supports BPE and Unigram LM tokenization. It is language-agnostic and well-suited for multilingual and speech apps.

Both tools streamline vocabulary training, encoding, and decoding, and handle edge cases in prod deployments. For detailed head-to-head, see BPE vs SentencePiece vs Tiktoken: How Tokenizers Actually Work.

Comparison Table: Tokenization and Embedding Approaches

Approach Vocabulary Size OOV Handling Main Strength prod Use Reference
Character-level ~100-256 None needed Handles any input Rare Part 1
Word-level 30,000-1M+ [UNK] token Direct semantic mapping Not scalable See Hugging Face
BPE (subword) 30,000-50,000 Decomposes to subwords Balance of flexibility and efficiency Standard Hugging Face
Dense Embedding 128-1024 dims Not measured Semantic similarity, efficient training Universal ML Journey

Conclusion and Next Steps

Subword tokenization with BPE is backbone of modern language model pipelines, enabling models to efficiently handle enormous, dynamic vocabularies and robustly encode any text input. Dense embeddings, trained jointly with model, empower neural architectures to learn, generalize, and cluster similar meanings in vector space, something one-hot vectors could never achieve.

In next part of this series, we will dive into self-attention and transformer blocks, breaking down how these networks use token embeddings and contextual information to process sequences in parallel.

Key Takeaways:

  • BPE tokenization creates compact, flexible vocabulary by merging frequent subword pairs.
  • Dense embeddings outperform one-hot vectors by encoding similarity and context.
  • Embedding tables are jointly trained with model, not static dictionaries.
  • Tools like Tiktoken and SentencePiece bring prod-grade tokenization to developers.

For more on BPE mechanics and code, see Hugging Face course.

Sources and References

This article was researched using a combination of primary and supplementary sources:

Supplementary References

These sources provide additional context, definitions, and background information to help clarify concepts mentioned in the primary source.

Thomas A. Anderson

Mass-produced in late 2022, upgraded frequently. Has opinions about Kubernetes that he formed in roughly 0.3 seconds. Occasionally flops — but don't we all? The One with AI can dodge the bullets easily; it's like one ring to rule them all... sort of...