Build Tiny Character-Level Language Model from Scratch in Python

Large language models have revolutionized natural language processing, but their core concept rests on a simple foundation: predicting the next character or token based on prior context. This post, first in our series LLMs for Developers, walks you through building a tiny character-level language model in pure Python, inspired by Andrej Karpathy’s celebrated “makemore” project and his educational lectures. The goal is to show the essential workings of language modeling in roughly 200 lines of code, starting from a basic bigram model and advancing to a small neural network implementation. No deep learning experience beyond basic machine learning concepts is required.

What is Character-Level Language Model?

A character-level language model predicts the next character in a sequence given the characters that came before. Unlike models working on words or subwords, character-level models operate on the smallest unit of text: individual letters, punctuation, or symbols. For instance, after seeing “hel,” the model might learn that the next character is most often “l” or “o” (to complete “hello”).

This kind of model learns statistical regularities and patterns in text, such as common letter pairs, word shapes, or punctuation usage. Karpathy’s “makemore” project illustrated how such models can be built from scratch to generate new text resembling training data, such as names or code snippets, by sampling one character at a time.

Digital representation of character-level language model

Building Bigram Baseline in Python

Sponsored

Your team in Shanghai can finally use the same drive as your team in Berlin. Sesame Disk by NiHao Cloud No VPN. No blocked links. Real-time sync across China, Europe, and the Americas — from $4/mo.

The simplest approach to language modeling is the bigram model. It uses only the current character to predict the next one, ignoring longer context. This statistical model is easy to implement and is a solid baseline that reveals fundamental concepts.

Below is a Python example that counts bigram frequencies from sample text and generates new text by sampling probable next characters:

Note: The following code is an illustrative example and has not been verified against official documentation. Please refer to the official docs for production-ready code.

import collections
import random

# Sample training text
text = "quick brown fox jumps over lazy dog"

# Count bigram frequencies
bigram_counts = collections.defaultdict(lambda: collections.Counter())
for i in range(len(text) - 1):
 curr_char = text[i]
 next_char = text[i + 1]
 bigram_counts[curr_char][next_char] += 1

# Convert counts to probabilities
bigram_probs = {}
for ch, next_chars in bigram_counts.items():
 total = sum(next_chars.values())
 bigram_probs[ch] = {c: count / total for c, count in next_chars.items()}

def generate_bigram(start_char='t', max_length=100):
 result = start_char
 current_char = start_char
 for _ in range(max_length - 1):
 next_char_probs = bigram_probs.get(current_char, None)
 if not next_char_probs:
 break
 chars, probs = zip(*next_char_probs.items())
 current_char = random.choices(chars, probs)[0]
 result += current_char
 return result

print(generate_bigram())

This script builds a probability table mapping each character to the distribution of characters that follow it in the training text. The generate_bigram() function samples characters sequentially to produce new text. While effective at capturing letter-level statistics, the bigram method cannot model dependencies beyond one character, which limits coherence in longer outputs.

For example, if you use the string “hello world”, the bigram model might generate outputs like “helllo worlld” or “hello worlll”, which resemble the original but often break down in longer sequences. This limitation stems from the fact that only the current character is considered, not previous sequences.

Introducing Small Neural Network for Better Results

To capture longer context and more complex patterns, we can replace the bigram model with a small neural network (a multilayer perceptron (MLP)) that takes a fixed-length sequence of previous characters as input and predicts the next character. This enables the model to learn richer representations and dependencies.

In this approach, context length refers to the number of preceding characters (for example, five) considered when predicting the next one. The MLP consists of input, hidden, and output layers, using nonlinear activation functions like tanh to model complex relationships.

This example uses only Python and numpy to illustrate core concepts without relying on deep learning frameworks. First, we need to encode characters to integers and prepare input-target training pairs:

Note: The following code is an illustrative example and has not been verified against official documentation. Please refer to the official docs for production-ready code.

import numpy as np

# Create character vocabulary
chars = sorted(list(set(text)))
char2idx = {c: i for i, c in enumerate(chars)}
idx2char = {i: c for i, c in enumerate(chars)}
vocab_size = len(chars)

# Encode text into integer indices
def encode(seq):
 return [char2idx[c] for c in seq]

def decode(indices):
 return ''.join([idx2char[i] for i in indices])

# Prepare sequences for training
N = 5 # Context length
inputs = []
targets = []
for i in range(len(text) - N):
 inputs.append(encode(text[i:i+N]))
 targets.append(char2idx[text[i+N]])

X = np.array(inputs)
Y = np.array(targets)

# One-hot encode targets
Y_onehot = np.eye(vocab_size)[Y]

Here, one-hot encoding creates a vector for each character in the vocabulary, where the position corresponding to the character is set to 1 and all others to 0. This format is suitable for feeding into a neural network.

Next, define neural network parameters and forward pass:

Note: The following code is an illustrative example and has not been verified against official documentation. Please refer to the official docs for production-ready code.

np.random.seed(42)

# Initialize weights
W1 = np.random.randn(N * vocab_size, 128) * 0.01
b1 = np.zeros(128)
W2 = np.random.randn(128, vocab_size) * 0.01
b2 = np.zeros(vocab_size)

def forward(x):
 h = np.tanh(np.dot(x, W1) + b1)
 logits = np.dot(h, W2) + b2
 return logits, h

def softmax(logits):
 exp_logits = np.exp(logits - np.max(logits))
 return exp_logits / np.sum(exp_logits)

The forward function processes the input through the neural network layers, computing the hidden representation and final output logits. The softmax function converts these logits into probabilities for each possible next character.

Training involves iterating through the data, performing forward passes, computing cross-entropy loss, and updating weights with gradient descent. Full backpropagation code is lengthy, so here we outline the training loop structure:

Note: The following code is an illustrative example and has not been verified against official documentation. Please refer to the official docs for production-ready code.

learning_rate = 0.1
for epoch in range(1000):
 total_loss = 0
 for i in range(len(X)):
 # One-hot encode input sequence
 x_input = np.zeros(N * vocab_size)
 for j, idx_char in enumerate(X[i]):
 x_input[j * vocab_size + idx_char] = 1

 logits, hidden = forward(x_input.reshape(1, -1))
 probs = softmax(logits.flatten())

 loss = -np.log(probs[Y[i]])
 total_loss += loss

 # Backpropagation and param updates go here (omitted for brevity)
 if epoch % 100 == 0:
 print(f"Epoch {epoch}: Loss {total_loss / len(X):.4f}")

# Note: Implementing backpropagation with numpy requires detailed gradient calculations.
# For prod or experimentation, consider PyTorch or TensorFlow instead.

Person coding Python on laptop

Writing neural network code from scratch deepens understanding of language models. For example, you could set N = 3 and see how shorter context affects output quality, or increase the hidden layer size from 128 to 256 units and observe changes in learning speed.

Generating Text with Trained Model

Text generation proceeds by feeding an initial seed sequence, predicting the next character distribution, sampling from it, appending the result, and sliding the input window forward. This iterative process continues for the desired output length.

Note: The following code is an illustrative example and has not been verified against official documentation. Please refer to the official docs for production-ready code.

def generate_text(start_seq, length=100):
 seq = list(start_seq)
 for _ in range(length):
 x_input = np.zeros(N * vocab_size)
 for j, c in enumerate(seq[-N:]):
 x_input[j * vocab_size + char2idx[c]] = 1
 logits, _ = forward(x_input.reshape(1, -1))
 probs = softmax(logits.flatten())
 next_idx = np.random.choice(range(vocab_size), p=probs)
 next_char = idx2char[next_idx]
 seq.append(next_char)
 return ''.join(seq)

print(generate_text("hello", 200))

This model can generate text that more closely resembles training data compared to the bigram baseline, producing longer-range dependencies and patterns. For example, seeding with “quick” might produce “quick brown fox ju…” and continue in a plausible style, while the bigram model may quickly create nonsensical sequences.

Programmer working on machine learning

Understanding neural networks is foundational for advanced language modeling. As you experiment, try varying the seed sequence or context length and observe how the generated output changes.

Comparison Table: Bigram Model vs. Small Neural Network

Aspect	Bigram Model	Small Neural Network (MLP)	Source / Reference
Context Length	1 character	Fixed length (e.g., 5 characters)	SesameDisk LLM Series
Model Type	Statistical probability table	Feed-forward neural network with nonlinear activation	Karpathy’s makemore tutorials
Pattern Learning	Immediate character pairs only	Longer-range dependencies and character combinations	Karpathy’s lectures
Training Complexity	Simple counting, no gradient descent	Gradient descent with backpropagation	PyTorch tutorials
Generation Quality	Basic, local coherence	Improved coherence, longer text patterns	Practical developer experience

Teaser for Next Part: Tokenization and Embeddings

Although character-level models provide a solid foundation, real-world language tasks require handling words and subwords efficiently. Our next post will introduce tokenization techniques such as Byte Pair Encoding (BPE), which compress frequent character sequences into manageable tokens. We will explore how dense embeddings convert these tokens into vectors that neural networks can process effectively, improving model scalability and performance.

Understanding these concepts is essential for moving from toy models to production-scale language applications. Stay tuned for Part 2: “Tokenization and embeddings: how words become vectors in LLMs.”

Key Takeaways:

A simple bigram model illuminates basics of next-character prediction using statistical counts.
A small neural network can learn more complex patterns across multiple characters, improving text generation.
Implementing models from scratch enhances understanding of core mechanics behind modern large language models.
The next step involves tokenization and embeddings, crucial for real-world language processing.

For more comprehensive coverage of building modern language models from scratch, see the full SesameDisk LLM Series overview.

Build Tiny Character-Level Language Model from Scratch in Python