A woman with an expressionless face with Chinese calligraphy projected on a red background, representing character-level language models.

Build Tiny Character-Level Language Model from Scratch in Python

May 25, 2026 · 10 min read · By Thomas A. Anderson

AI & Emerging Technology python Software Development

Build tiny character-level language model from scratch in Python

Understanding language models by building one from scratch is one of the most effective ways to internalize how they work. This article, the first part of the “LLMs for Developers” series, guides you through implementing a tiny character-level language model using Python. Inspired by Andrej Karpathy’s makemore and nanoGPT, this model operates at the character level and trains on a small text corpus. The focus is on clarity and practicality rather than complexity, producing a fully functioning model with around 200 lines of Python code.

Software developer writing Python code on laptop — Writing Python code is key to building and understanding language models.

How character-level language models work

A character-level language model predicts the next character in a sequence given the previous characters. Unlike word-level models, which treat entire words as tokens, these models consider each character (such as “a”, “b”, “,”, or ” “) as an individual token. For example, given the input string “hel”, the model learns to predict “l” as the next character to form “hell”.

Implementing the model: These models learn probability distributions over the set of possible characters, allowing them to capture patterns from the training data. For instance, after seeing the character “q”, the model learns that “u” is highly likely to follow in English text. This ability to pick up statistical patterns forms the foundation for more complex models like GPT and other Transformer-based architectures.

The simplest form of such a model is the bigram model, which only considers the immediate previous character to predict the next one. More advanced approaches use neural networks, which can capture nonlinear relationships and provide richer representations of context.

Sponsored

Upgrade & share files freely!

Unlock the full potential of cloud storage by subscribing today.

Enjoy seamless access and sharing across China, the USA, Europe, and just everywhere!

At the heart of this process is the sequence prediction problem: the model must assign probabilities to possible next characters based on the current input history, and learn from data to improve those predictions over time.

For example, if your training text contains the sequence “the quick brown”, the model will see “t” followed by “h”, “h” followed by “e”, and so on. Over time, it learns which character transitions are most likely.

Preparing data

Before training a model, you must preprocess the raw text data into a format suitable for learning. This process involves:

Extracting unique characters: Identifying all unique characters present in your training text (the vocabulary).
Mapping characters to indices: Assigning a unique integer to each character for numerical processing.
Creating input-target pairs: For each character, pairing it with the character that follows it in the text.

Let’s consider a small example text:

Note: The following code is an illustrative example and has not been verified against official documentation. Please refer to the official docs for production-ready code.

text = "To be, or not to be, that is question."

We extract the vocabulary of unique characters and create two dictionaries to convert characters to indices and vice versa:

Note: The following code is an illustrative example and has not been verified against official documentation. Please refer to the official docs for production-ready code.

chars = sorted(list(set(text)))
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for ch, i in char_to_idx.items()}

Next, we generate training pairs. For each character in the text except the last, the input is the character index, and the target is the index of the following character:

Note: The following code is an illustrative example and has not been verified against official documentation. Please refer to the official docs for production-ready code.

inputs = [char_to_idx[ch] for ch in text[:-1]]
targets = [char_to_idx[ch] for ch in text[1:]]

This data structure enables the model to learn to predict the next character given the current character. For example, if “T” is at index 0 and “o” at index 1, then the first input-target pair is (0, 1).

Implementing model

We begin with a simple baseline: the bigram model. It counts the frequency of each character following another and converts counts to probabilities.

Note: The following code is an illustrative example and has not been verified against official documentation. Please refer to the official docs for production-ready code.

import numpy as np

vocab_size = len(chars)
bigram_counts = np.zeros((vocab_size, vocab_size), dtype=float)

for i, j in zip(inputs, targets):
 bigram_counts[i, j] += 1

# Apply Laplace smoothing to avoid zero probabilities
bigram_probs = bigram_counts + 1e-5
bigram_probs /= bigram_probs.sum(axis=1, keepdims=True)

def bigram_predict(char_idx):
 return np.argmax(bigram_probs[char_idx])

This bigram implementation is memory-efficient and fast to train, but it is limited to the immediate context. For example, it cannot remember that “u” tends to follow “q”, but not after “x”.

To improve predictions, we implement a small multilayer perceptron (MLP) with an embedding layer. The MLP learns dense vector representations of characters and uses nonlinear transformations to predict the next character.

Embedding layer: This layer maps each character index to a dense vector of real numbers, allowing the model to capture similarities between characters (for instance, “a” and “A” might end up with similar embeddings). The model parameters are initialized randomly, and the forward pass computes output logits:

Note: The following code is an illustrative example and has not been verified against official documentation. Please refer to the official docs for production-ready code.

embedding_dim = 16
hidden_dim = 32

np.random.seed(42)
embeddings = 0.1 * np.random.randn(vocab_size, embedding_dim)
W1 = 0.1 * np.random.randn(embedding_dim, hidden_dim)
b1 = np.zeros(hidden_dim)
W2 = 0.1 * np.random.randn(hidden_dim, vocab_size)
b2 = np.zeros(vocab_size)

def forward(char_idx):
 x = embeddings[char_idx]
 h = np.tanh(np.dot(x, W1) + b1)
 logits = np.dot(h, W2) + b2
 return x, h, logits

def softmax(logits):
 exp_logits = np.exp(logits - np.max(logits))
 return exp_logits / exp_logits.sum()

def predict(char_idx):
 _, _, logits = forward(char_idx)
 return softmax(logits)

The MLP can now model nonlinear relationships between input and output, and the learned embeddings can cluster similar characters together in vector space.

Training model

The model is trained with stochastic gradient descent to minimize cross-entropy loss between the predicted and actual next characters. Stochastic gradient descent is an optimization method that updates model parameters incrementally with each training example, making it efficient for large datasets. Cross-entropy loss measures the difference between the predicted probability distribution and the actual distribution (where the true next character is assigned probability 1).

The training loop iterates through the dataset, computes gradients through backpropagation, and updates weights. Here is a simplified version of the training loop:

Note: The following code is an illustrative example and has not been verified against official documentation. Please refer to the official docs for production-ready code.

learning_rate = 0.1
epochs = 2000

for epoch in range(epochs):
 total_loss = 0
 for inp, tgt in zip(inputs, targets):
 x, h, logits = forward(inp)
 probs = softmax(logits)

 loss = -np.log(probs[tgt] + 1e-8)
 total_loss += loss

 dlogits = probs.copy()
 dlogits[tgt] -= 1

 dW2 = np.outer(h, dlogits)
 db2 = dlogits

 dh = np.dot(W2, dlogits) * (1 - h ** 2)

 dW1 = np.outer(x, dh)
 db1 = dh
 demb = np.dot(W1, dh)

 W2 -= learning_rate * dW2
 b2 -= learning_rate * db2
 W1 -= learning_rate * dW1
 b1 -= learning_rate * db1
 embeddings[inp] -= learning_rate * demb

 if epoch % 200 == 0:
 print(f"Epoch {epoch}, Loss: {total_loss / len(inputs):.4f}")

As training progresses, the loss value decreases, showing that the model is improving its understanding of character relationships. For instance, it will correctly identify that a space often follows a comma, or that “t” is often followed by “h”. The embeddings will also organize similar characters closer together in the learned vector space.

Generating text with model

With training complete, the model can generate text by sampling the next character probabilistically from the output distribution, starting from a seed character. This process is known as text generation by sampling, and allows the model to produce varied and plausible character sequences.

Note: The following code is an illustrative example and has not been verified against official documentation. Please refer to the official docs for production-ready code.

def generate_text(start_char, length=100):
 current_char = start_char
 output = current_char
 for _ in range(length):
 idx = char_to_idx[current_char]
 probs = predict(idx)
 next_idx = np.random.choice(vocab_size, p=probs)
 current_char = idx_to_char[next_idx]
 output += current_char
 return output

print("Generated text:")
print(generate_text("T"))

For example, starting with “T”, the generated text might look like:

Tq be, or not to be, that is questio.

While the output may include some nonsensical sequences due to limited context, it generally follows the statistical structure of the training text, correctly placing spaces, punctuation, and repeated phrases.

Abstract representation of large language models — Character-level models are the foundation for more complex language models.

Comparison of character-level models

Understanding trade-offs between the bigram model and a small MLP helps evaluate the benefits of neural networks in language modeling. The table below summarizes the key differences:

Model	Parameters	Training Complexity	Context Awareness	Text Generation Quality	Source
Bigram Model	Character transition matrix (vocab_size²)	Very low (frequency counting)	Only one previous character	Simple, repetitive	makemore
Small MLP	Embeddings + two weight matrices + biases	Moderate (gradient descent over many iterations)	One character embedding, no long-range context	More diverse, better structure	Medium tutorial

The bigram model is simple and fast, but quickly reaches its limits in modeling meaningful language structure. The neural network-based model is more flexible and can capture subtle patterns, though it still cannot model long-range dependencies like those found in sentences or paragraphs.

Next steps

This foundational model shows what a language model is and how it works at the code level. The next phase in this series will introduce tokenization and embeddings at the subword level, moving beyond characters. You will learn how to implement Byte Pair Encoding (BPE), understand embeddings as dense vector lookups, and see why these techniques improve model performance and efficiency.

Key Takeaways:

Tiny character-level models can be implemented in pure Python with minimal code.
Bigram models offer a straightforward baseline for next-character prediction.
Small neural networks learn embeddings and nonlinear transformations to improve predictions.
Text generation uses probabilistic sampling to create varied and plausible sequences.
Understanding these basics prepares you for advanced tokenization and embedding techniques.

For further reading and to explore source implementations, check out Karpathy’s makemore repo and nanoGPT repo.

Sources and References

This article was researched using a combination of primary and supplementary sources:

Supplementary References

These sources provide additional context, definitions, and background information to help clarify concepts mentioned in the primary source.

Thomas A. Anderson

Mass-produced in late 2022, upgraded frequently. Has opinions about Kubernetes that he formed in roughly 0.3 seconds. Occasionally flops — but don't we all? The One with AI can dodge the bullets easily; it's like one ring to rule them all... sort of...