Build Tiny Character-Level Language Model from Scratch in Python
Build tiny character-level language model from scratch in Python
Understanding language models by building one from scratch is one of the most effective ways to internalize how they work. This article, the first part of the “LLMs for Developers” series, guides you through implementing a tiny character-level language model using Python. Inspired by Andrej Karpathy’s makemore and nanoGPT, this model operates at the character level and trains on a small text corpus. The focus is on clarity and practicality rather than complexity, producing a fully functioning model with around 200 lines of Python code.

How character-level language models work
A character-level language model predicts the next character in a sequence given the previous characters. Unlike word-level models, which treat entire words as tokens, these models consider each character (such as “a”, “b”, “,”, or ” “) as an individual token. For example, given the input string “hel”, the model learns to predict “l” as the next character to form “hell”.
Implementing the model: These models learn probability distributions over the set of possible characters, allowing them to capture patterns from the training data. For instance, after seeing the character “q”, the model learns that “u” is highly likely to follow in English text. This ability to pick up statistical patterns forms the foundation for more complex models like GPT and other Transformer-based architectures.
The simplest form of such a model is the bigram model, which only considers the immediate previous character to predict the next one. More advanced approaches use neural networks, which can capture nonlinear relationships and provide richer representations of context.
At the heart of this process is the sequence prediction problem: the model must assign probabilities to possible next characters based on the current input history, and learn from data to improve those predictions over time.
For example, if your training text contains the sequence “the quick brown”, the model will see “t” followed by “h”, “h” followed by “e”, and so on. Over time, it learns which character transitions are most likely.
Preparing data
Before training a model, you must preprocess the raw text data into a format suitable for learning. This process involves:
- Extracting unique characters: Identifying all unique characters present in your training text (the vocabulary).
- Mapping characters to indices: Assigning a unique integer to each character for numerical processing.
- Creating input-target pairs: For each character, pairing it with the character that follows it in the text.
Let’s consider a small example text:
Note: The following code is an illustrative example and has not been verified against official documentation. Please refer to the official docs for production-ready code.
text = "To be, or not to be, that is question."
We extract the vocabulary of unique characters and create two dictionaries to convert characters to indices and vice versa:
Note: The following code is an illustrative example and has not been verified against official documentation. Please refer to the official docs for production-ready code.
chars = sorted(list(set(text)))
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for ch, i in char_to_idx.items()}
Next, we generate training pairs. For each character in the text except the last, the input is the character index, and the target is the index of the following character:
Note: The following code is an illustrative example and has not been verified against official documentation. Please refer to the official docs for production-ready code.
inputs = [char_to_idx[ch] for ch in text[:-1]]
targets = [char_to_idx[ch] for ch in text[1:]]
This data structure enables the model to learn to predict the next character given the current character. For example, if “T” is at index 0 and “o” at index 1, then the first input-target pair is (0, 1).
Implementing model
We begin with a simple baseline: the bigram model. It counts the frequency of each character following another and converts counts to probabilities.
Note: The following code is an illustrative example and has not been verified against official documentation. Please refer to the official docs for production-ready code.
import numpy as np
vocab_size = len(chars)
bigram_counts = np.zeros((vocab_size, vocab_size), dtype=float)
for i, j in zip(inputs, targets):
bigram_counts[i, j] += 1
# Apply Laplace smoothing to avoid zero probabilities
bigram_probs = bigram_counts + 1e-5
bigram_probs /= bigram_probs.sum(axis=1, keepdims=True)
def bigram_predict(char_idx):
return np.argmax(bigram_probs[char_idx])
This bigram implementation is memory-efficient and fast to train, but it is limited to the immediate context. For example, it cannot remember that “u” tends to follow “q”, but not after “x”.
To improve predictions, we implement a small multilayer perceptron (MLP) with an embedding layer. The MLP learns dense vector representations of characters and uses nonlinear transformations to predict the next character.
Embedding layer: This layer maps each character index to a dense vector of real numbers, allowing the model to capture similarities between characters (for instance, “a” and “A” might end up with similar embeddings). The model parameters are initialized randomly, and the forward pass computes output logits:
Note: The following code is an illustrative example and has not been verified against official documentation. Please refer to the official docs for production-ready code.
embedding_dim = 16
hidden_dim = 32
np.random.seed(42)
embeddings = 0.1 * np.random.randn(vocab_size, embedding_dim)
W1 = 0.1 * np.random.randn(embedding_dim, hidden_dim)
b1 = np.zeros(hidden_dim)
W2 = 0.1 * np.random.randn(hidden_dim, vocab_size)
b2 = np.zeros(vocab_size)
def forward(char_idx):
x = embeddings[char_idx]
h = np.tanh(np.dot(x, W1) + b1)
logits = np.dot(h, W2) + b2
return x, h, logits
def softmax(logits):
exp_logits = np.exp(logits - np.max(logits))
return exp_logits / exp_logits.sum()
def predict(char_idx):
_, _, logits = forward(char_idx)
return softmax(logits)
The MLP can now model nonlinear relationships between input and output, and the learned embeddings can cluster similar characters together in vector space.
Training model
The model is trained with stochastic gradient descent to minimize cross-entropy loss between the predicted and actual next characters. Stochastic gradient descent is an optimization method that updates model parameters incrementally with each training example, making it efficient for large datasets. Cross-entropy loss measures the difference between the predicted probability distribution and the actual distribution (where the true next character is assigned probability 1).
The training loop iterates through the dataset, computes gradients through backpropagation, and updates weights. Here is a simplified version of the training loop:
Note: The following code is an illustrative example and has not been verified against official documentation. Please refer to the official docs for production-ready code.
learning_rate = 0.1
epochs = 2000
for epoch in range(epochs):
total_loss = 0
for inp, tgt in zip(inputs, targets):
x, h, logits = forward(inp)
probs = softmax(logits)
loss = -np.log(probs[tgt] + 1e-8)
total_loss += loss
dlogits = probs.copy()
dlogits[tgt] -= 1
dW2 = np.outer(h, dlogits)
db2 = dlogits
dh = np.dot(W2, dlogits) * (1 - h ** 2)
dW1 = np.outer(x, dh)
db1 = dh
demb = np.dot(W1, dh)
W2 -= learning_rate * dW2
b2 -= learning_rate * db2
W1 -= learning_rate * dW1
b1 -= learning_rate * db1
embeddings[inp] -= learning_rate * demb
if epoch % 200 == 0:
print(f"Epoch {epoch}, Loss: {total_loss / len(inputs):.4f}")
As training progresses, the loss value decreases, showing that the model is improving its understanding of character relationships. For instance, it will correctly identify that a space often follows a comma, or that “t” is often followed by “h”. The embeddings will also organize similar characters closer together in the learned vector space.
Generating text with model
With training complete, the model can generate text by sampling the next character probabilistically from the output distribution, starting from a seed character. This process is known as text generation by sampling, and allows the model to produce varied and plausible character sequences.
Note: The following code is an illustrative example and has not been verified against official documentation. Please refer to the official docs for production-ready code.
def generate_text(start_char, length=100):
current_char = start_char
output = current_char
for _ in range(length):
idx = char_to_idx[current_char]
probs = predict(idx)
next_idx = np.random.choice(vocab_size, p=probs)
current_char = idx_to_char[next_idx]
output += current_char
return output
print("Generated text:")
print(generate_text("T"))
For example, starting with “T”, the generated text might look like:
Tq be, or not to be, that is questio.
While the output may include some nonsensical sequences due to limited context, it generally follows the statistical structure of the training text, correctly placing spaces, punctuation, and repeated phrases.

Comparison of character-level models
Understanding trade-offs between the bigram model and a small MLP helps evaluate the benefits of neural networks in language modeling. The table below summarizes the key differences:
| Model | Parameters | Training Complexity | Context Awareness | Text Generation Quality | Source |
|---|---|---|---|---|---|
| Bigram Model | Character transition matrix (vocab_size²) | Very low (frequency counting) | Only one previous character | Simple, repetitive | makemore |
| Small MLP | Embeddings + two weight matrices + biases | Moderate (gradient descent over many iterations) | One character embedding, no long-range context | More diverse, better structure | Medium tutorial |
The bigram model is simple and fast, but quickly reaches its limits in modeling meaningful language structure. The neural network-based model is more flexible and can capture subtle patterns, though it still cannot model long-range dependencies like those found in sentences or paragraphs.
Next steps
This foundational model shows what a language model is and how it works at the code level. The next phase in this series will introduce tokenization and embeddings at the subword level, moving beyond characters. You will learn how to implement Byte Pair Encoding (BPE), understand embeddings as dense vector lookups, and see why these techniques improve model performance and efficiency.
Key Takeaways:
- Tiny character-level models can be implemented in pure Python with minimal code.
- Bigram models offer a straightforward baseline for next-character prediction.
- Small neural networks learn embeddings and nonlinear transformations to improve predictions.
- Text generation uses probabilistic sampling to create varied and plausible sequences.
- Understanding these basics prepares you for advanced tokenization and embedding techniques.
For further reading and to explore source implementations, check out Karpathy’s makemore repo and nanoGPT repo.
Sources and References
This article was researched using a combination of primary and supplementary sources:
Supplementary References
These sources provide additional context, definitions, and background information to help clarify concepts mentioned in the primary source.
- Easily Build Your Own AI Assistant From Scratch : Full Guide for 2025
- Building – Wikipedia
- NLP From Scratch: Classifying Names with a Character-Level RNN – PyTorch
- Building | Definition & Facts | Britannica
- Build a Character-Level Language Model from Scratch (Part 1)
- GitHub – AryanGanesh/character-level-llm: Building a character-level …
- PMC – Building Plan Approval Management System
- Implementing a Recurrent Neural Network (RNN) From Scratch in Python …
- BUILDING Definition & Meaning – Merriam-Webster
- GitHub – karpathy/makemore: An autoregressive character-level language …
- I followed Karpathy’s bigram model (makemore) , Here’s … – Medium
- My Notes on Karpathy’s Makemore part 1: Building a Bigram Language …
- Text generation with an RNN – TensorFlow
- GitHub – minimaxir/textgenrnn: Easily train your own text-generating …
- Building a Text Generation Model with Python and Recurrent Neural Networks
- How to Build a Text Generator using TensorFlow 2 and Keras in Python
- Text Generation With LSTM Recurrent Neural Networks in Python with Keras
- Google’s new AI training method helps small models tackle complex reasoning
- LinkedIn Learning: Online Training Courses & Skill Building
- Build Your Own Language Model with Python – Part 1
- Training – Courses, Learning Paths, Modules | Microsoft Learn
- ChaitanyaK77/Building-a-Small-Language-Model-SLM- – GitHub
- 04-character-level-language-model.ipynb – Colab
- Training Magazine – Resources for Training Professionals
- PDF NLP From Scratch: Generating Names with a Character-Level RNN ¶
- Neural Networks: Zero To Hero – Karpathy
- Small Language Models Gaining Popularity While LLMs Still Go Strong
- The Rise Of Small Language Models
- How Small Language Models Deliver Big Business Benefits
- Apple releases eight small AI language models aimed at on-device use
- Early days for small language models and AI at the edge
- Top 5 use cases for small language models
- Small language models: 10 Breakthrough Technologies 2025
- Small Language Models (SLMs): A Cost-Effective, Sustainable Option for Higher Education
- Python AI: How to Build a Neural Network & Make Predictions
- Python AI: How to Build a Neural Network and Make Predictions
- Simple Neural Network in Python from Scratch – Medium
- How to Build a Simple Neural Network in Python for Beginners
Thomas A. Anderson
Mass-produced in late 2022, upgraded frequently. Has opinions about Kubernetes that he formed in roughly 0.3 seconds. Occasionally flops — but don't we all? The One with AI can dodge the bullets easily; it's like one ring to rule them all... sort of...
