Tokenization and Embeddings: How Words Become Vectors in Large Language Models (LLMs) in 2026
Tokenization and Embeddings: How Words Become Vectors in Large Language Models (LLMs) in 2026

Introduction: The Bridge from Language to Model
The real breakthrough in large language models lies not only in expanding neural networks, but in the way these systems convert language into forms that neural nets can process. When you prompt advanced models like GPT-5 or Claude 4, your words are split, re-coded, and transformed into matrices of numbers. Two steps form the core of this process: tokenization and embedding.
Tokenization determines how to break down text (sometimes into whole words, often into word parts) in a way that keeps the vocabulary both expressive and manageable. Embeddings then map each of these pieces to vectors: arrays of floating-point numbers that models use as input. This translation pipeline allows the system to interpret text, rather than seeing only a blur of characters.
Part 1 of this series explored how to build a character-level language model from scratch in Python to show core concepts. Now, in Part 2, we focus on the technology that lets state-of-the-art models scale to billions of parameters and support massive, multilingual vocabularies: subword tokenization and dense embeddings.
From Characters to Subwords: Why Tokenization Matters
Early language models operated at the character level, treating every letter, space, and punctuation mark as a token. While simple, this method has significant limitations:
- Text sequences become long, increasing memory and computation needs.
- Context is limited, since models see only a few characters at a time.
- Rare words and new terms get broken into fragments, losing semantic meaning.
Modern language models rely on subword tokenization. Here, each token can be:
- An entire word (
cat) - A common word part (
ing,pre,un) - A rare character or symbol if no other match exists
For instance, “unbelievability” might be split into “un”, “believ”, “abil”, “ity”. Each subword recurs throughout other words, allowing the model to learn their meanings and process unfamiliar words effectively.
Tokenization Algorithms: BPE, SentencePiece, and WordPiece
Three main algorithms define subword tokenization practice in 2026:
- Byte Pair Encoding (BPE): Adopted by GPT-4, GPT-5, and similar systems. BPE iteratively merges frequent character or subword pairs to construct a compact but flexible vocabulary.
- SentencePiece: Common in multilingual models such as Llama 4 and Gemini. SentencePiece treats text as a byte stream (not just word sequences), making it effective for languages without spaces, like Japanese and Chinese.
- WordPiece: Used in BERT and Google’s models. WordPiece merges based on likelihood, optimizing for language modeling performance.
Each method balances vocabulary size, efficiency, and the ability to handle rare or unknown words. For example, SentencePiece enables effective processing of scripts without whitespace, while BPE has become a standard for balancing vocabulary and expressiveness in English and many other languages.
These approaches yield vocabularies of 50,000-256,000 tokens (Genaim Institute’s 2026 guide), enabling advanced models to operate efficiently while covering a wide range of text.
Byte Pair Encoding (BPE) by Example
BPE is easiest to understand with a small-scale example. Suppose the training set contains these words: low, lower, newer, wider.
- Start with all words split into characters: l o w, l o w e r, n e w e r, w i d e r.
- Count which character pairs (like “l o”, “o w”) appear most often.
- Merge the most frequent pair, if “l o” is most common, turn it into “lo”. Now the sequence is “lo w”, “lo w e r”.
- Continue merging: “lo w” forms “low”. “n e” becomes “ne”, then “ne w” to “new”.
- Repeat: “w i” merges to “wi”, then “wi d e r” becomes “wider”, and so on, until the vocabulary fits your requirements.
After several rounds, common patterns like “low”, “lower”, “new”, “newer”, “wider” become tokens. Rare or unknown words can always be split into smaller subwords or characters, ensuring the model can represent any input.
Vocabulary Size and Why It Matters

The size of the vocabulary is a key design decision for language models:
- Smaller vocabularies save memory and computation but require more frequent word splitting, which increases sequence length and processing overhead.
- Larger vocabularies allow more words to be represented directly, reducing sequence length but demanding more GPU memory for embedding tables.
Production models typically settle on 50,000 to 256,000 tokens, balancing storage, ability to generalize to new words, and per-token costs (especially since API billing is often based on token count).
Tokenization efficiency also influences the context window. For example, if a model’s context window is 2 million tokens, effective tokenization means more meaningful content fits within that space, important for summarizing long documents, analyzing code, or handling extensive conversation histories.
Embedding Tokens: Lookup Tables to Dense Vectors
Once text is tokenized, each token becomes an integer ID. Neural networks cannot interpret these IDs directly; they require floating-point vectors. Embeddings address this need.
An embedding layer is a table: each row corresponds to a token, and each row contains a vector (for example, 768 or 1024 dimensions). When text is processed, the model retrieves the embedding vector for each token ID and feeds these vectors into the network.
Code Example: Embedding Lookup in PyTorch
Note: The following code is an illustrative example and has not been verified against official documentation. Please refer to the official docs for production-ready code.
import torch
import torch.nn as nn
vocab_size = 50000 # Example vocabulary size
embedding_dim = 768 # Typical for modern LLMs
embedding_layer = nn.Embedding(vocab_size, embedding_dim)
token_ids = torch.tensor([123, 456, 789]) # Example token IDs
embedded = embedding_layer(token_ids) # Shape: (3, 768)
print(embedded.shape)
Note: Production code should handle padding, masking, and batch dimensions correctly.
The embedding vectors are learned parameters, updated during training to encode semantic features. For example, the model will position the embedding for “queen” near “king”, and “run” close to “running”, provided it has been exposed to enough relevant data.
Dense Embeddings vs One-Hot Vectors: A Practical Comparison
Before dense embeddings, natural language processing relied on one-hot vectors, binary arrays as long as the vocabulary, with a single “1” for the relevant token. This approach has several drawbacks:
- Vectors are extremely sparse and high-dimensional.
- All tokens are equally distant; there is no measure of similarity.
- Computationally inefficient for large vocabularies.
Dense embeddings improve on this by:
- Reducing dimensionality to hundreds of elements per vector.
- Storing floating-point values, so the model can learn nuanced similarities and differences.
- Allowing semantic relationships: similar tokens are close together in embedding space.
Dense vectors enable language models to recognize that “Paris” and “London” are both cities, that “dog” relates to “puppy”, and that “run” and “ran” are different verb tenses. These groupings can be visualized using methods like t-SNE, where clusters form according to semantic meaning.
How Embeddings Learn: Joint Training with Model
Embeddings are not manually engineered. Instead, they are trained alongside the rest of the language model. During backpropagation, gradients update each embedding vector to better fit the data and specific task.
This approach has several advantages:
- Embeddings become tailored to the application: a model used for translation will learn different representations than one designed for code completion.
- Rare tokens are placed based on their usage context, not just frequency.
- Multilingual embeddings can group similar concepts across languages, supporting cross-lingual transfer.
In frameworks such as Hugging Face Transformers and PyTorch, this joint training is handled automatically, developers generally do not need to manage embeddings directly.
Production Considerations: Efficiency, Latency, and Multilingual Handling
Decisions around tokenization and embeddings influence cost, speed, and language support for any language model deployment.
Inference Costs and Context Window:
Most APIs for large models charge per token. Inefficient tokenization that splits common words into many tokens increases cost and reduces usable content per prompt. The Genaim Institute’s 2026 guide notes that with context windows reaching millions of tokens, every token is significant. Advanced tokenizers, especially those based on BPE and SentencePiece, are optimized to maximize information density.
Latency:
Each extra token increases computation time, since text is generated one token at a time. Efficient embeddings and compact tokenization reduce latency, which is particularly important for real-time applications like chatbots or live translation.
Multilingual Tokenization:
Tokenizers designed for English have historically struggled with non-Latin scripts. Some languages required multiple tokens per word, leading to higher cost and reduced expressiveness. By 2026, the use of larger, more diverse vocabularies (as high as 256,000 tokens) ensures fairer token density and cost across languages. SentencePiece, in particular, is a standard for language-agnostic tokenization.
Security and Fairness:
Tokenization introduces new attack surfaces, malformed or obscure tokens might bypass safety filters. Systems must guard against adversarial token sequences that could inject malicious instructions or evade safeguards. Ethical considerations also arise: if a tokenizer omits certain cultural or dialectal terms, it can distort or erase their representation.
Comparison Table: Tokenization and Embedding Approaches
| Aspect | Character-level Tokenization | Subword Tokenization (BPE, SentencePiece) | Embedding Type | Source |
|---|---|---|---|---|
| Token Unit | Single character (e.g., ‘a’) | Subword units (‘un’, ‘ing’, full words) | One-hot or dense | Genaim Institute |
| Vocabulary Size | ~100 | 50,000-256,000 | Dense learned vectors | MyEngineeringPath |
| Rare Word Handling | Poor (split into many tokens) | Good (split into known subwords) | Semantic clustering | FutureAGI |
| Multilingual Support | Very limited | Strong (SentencePiece excels on non-Latin scripts) | Shared cross-lingual space | Genaim Institute |
| Inference Cost | Low per token, high sequence length | Balanced | Low for dense | MyEngineeringPath |
Conclusion and Next Steps
Tokenization and embeddings are the underlying mechanisms that convert written or spoken input into vectors suitable for neural networks. Subword tokenization, especially via BPE and SentencePiece, enables advanced language models to handle rare words, support multiple languages, and manage long contexts efficiently. Dense embeddings, trained along with the model’s parameters, allow networks to capture language structure in ways that sparse one-hot vectors cannot.
The selection of tokenization and embedding techniques determines model quality, cost, latency, and fairness. Careful choices can reduce API expenses, raise accuracy, and help ensure that models serve users equally well across languages.
The next part of this series will cover the transformer’s self-attention mechanism, showing how embeddings are processed to give large models their advanced context awareness.
- Subword tokenization balances vocabulary size and flexibility, enabling effective handling of rare and multilingual text.
- BPE and SentencePiece remain the dominant algorithms for production systems as of 2026.
- Dense, jointly-trained embeddings outperform one-hot vectors by capturing language structure and meaning.
- Tokenization and embedding decisions directly influence latency, cost, security, and fairness in real-world use.
For more technical detail, see the Genaim Institute’s 2026 tokenization overview and MyEngineeringPath’s tokenization guide.
Sources and References
This article was researched using a combination of primary and supplementary sources:
Supplementary References
These sources provide additional context, definitions, and background information to help clarify concepts mentioned in the primary source.
- What is tokenization? | McKinsey
- [2605.17954] A More Word-like Image Tokenization for MLLMs
- Tokenization (data security) – Wikipedia
- LLM Fundamentals , Tokens, Attention & Transformers (2026)
- What is tokenization? – IBM
- Tokenization in Large Language Models – Best Generative AI & Machine …
- What is Tokenization? – GeeksforGeeks
- Intro to Tokenization | Charles Schwab
- Tokens And Embeddings: The Foundation of Language Models.
- Byte – Wikipedia
- Byte
- Byte-Pair Encoding (BPE) in NLP – GeeksforGeeks
- bit、byte、KB、B、字节、位、字符之间关系详解 – CSDN博客
- Byte-Pair Encoding (BPE) Algorithm – emergentmind.com
- Understanding file sizes | Bytes, KB, MB, GB, TB, PB, EB, ZB, YB
- GitHub – swiss-ai/parity-aware-bpe: Parity-Aware Byte-Pair Encoding …
- What Is a Byte? – Computer Hope
- Deep Dive: Byte Pair Encoding (BPE) | Problem of the Day: KL Divergence
- Tokenization and Embeddings: The Language of LLMs
- Tokenization and Embeddings in Large Language Models (LLMs)
- PDF Introduction to Large Language Models
- Understanding Tokenization and Embeddings in LLMs – ML Journey
Thomas A. Anderson
Mass-produced in late 2022, upgraded frequently. Has opinions about Kubernetes that he formed in roughly 0.3 seconds. Occasionally flops — but don't we all? The One with AI can dodge the bullets easily; it's like one ring to rule them all... sort of...
