Loss Function and Training Loop: Foundations of Language Model Training

Loss fn and Training Loop: Foundations of Language Model Training

Training large language models (LLMs) centers on the task of next-token prediction. The cornerstone of this process is cross-entropy loss, which measures the discrepancy between the model’s predicted probability distribution for the next token and the actual token observed in the training data. Formally, cross-entropy loss quantifies the negative log probability assigned to the correct next token, incentivizing the model to assign higher probabilities to true tokens over the course of training.

The training procedure unfolds as a repetitive cycle known as the training loop. This loop involves four critical steps:

Forward pass: The model receives an input batch of token sequences and computes logits, which are unnormalized predictions representing the likelihood of each possible next token.
Loss calculation: The cross-entropy loss function compares predicted logits against ground truth next tokens, producing a scalar loss value that summarizes the model’s prediction error.
Backward pass: Using automatic differentiation, gradients of the loss with respect to every model parameter are computed. This step involves propagating error signals backward through the network.
Optimizer step: An optimization algorithm (most commonly Adam or one of its variants) updates model parameters by applying computed gradients to minimize loss.

This process repeats iteratively across all batches and epochs, gradually improving the model’s ability to predict the next token in diverse contexts.

Here is a simplified, real-world example of a training loop using PyTorch syntax:

Note: The following code is an illustrative example and has not been verified against official documentation. Please refer to the official docs for production-ready code.

Batching and Gradient Accumulation: Managing Scale and Efficiency

Scaling training to large models requires careful management of computational resources, particularly GPU or TPU memory. Two key techniques enable efficient training at scale: batching and gradient accumulation.

Batching refers to processing multiple training examples simultaneously instead of one by one. This is essential for making full use of parallel hardware capabilities. For instance, rather than processing a single sentence, the model processes a batch of 256 sequences in parallel. Larger batch sizes generally improve gradient estimates, stabilize training, and use hardware more efficiently.

However, hardware memory often limits the maximum batch size. This is where gradient accumulation is useful. Instead of increasing batch size beyond memory limits, the training process splits a large batch into smaller micro-batches. It computes gradients on each micro-batch, accumulates them, and performs a single optimizer update after processing all micro-batches equivalent to the target large batch size.

The following code snippet shows gradient accumulation:

Note: The following code is an illustrative example and has not been verified against official documentation. Please refer to the official docs for production-ready code.

optimizer.zero_grad()
for i, batch in enumerate(dataloader):
 outputs = model(batch['input_ids'])
 loss = cross_entropy_loss(outputs, batch['labels'])
 loss = loss / accumulation_steps # Normalize loss
 loss.backward() # Accumulate gradients

 if (i + 1) % accumulation_steps == 0:
 optimizer.step() # Update weights
 optimizer.zero_grad() # Reset gradients

This approach balances between hardware constraints and the statistical benefits of large-batch training. By accumulating gradients, it effectively simulates a large batch size while keeping memory usage manageable.

In real-world training of billion-parameter models, batch sizes of thousands or tens of thousands of tokens are common, often spread across multiple GPUs or TPU cores with synchronized gradient updates.

Batching and Gradient Accumulation in Machine Learning — Batching and gradient accumulation are key for scaling LLM training.

Scaling Laws, Data Requirements, and Compute Costs

Understanding how model size, dataset size, and compute resources interact is critical to planning effective LLM training. Scaling laws describe empirical relationships discovered through extensive experimentation, guiding how to allocate resources for optimal model performance.

The Fundamental Compute Estimate

A practical and widely used formula estimates the total number of floating-point operations (FLOPs) required to train a model as:

\[
\text{FLOPs} \approx 6 \times N \times D
\]

where:

\(N\) is the number of parameters in the model,
\(D\) is the total number of tokens processed during training,
The factor 6 accounts for combined forward and backward passes and optimizer updates per token.

This estimate is a starting point for budgeting compute resources. For instance, a model with 10 billion parameters trained on 1 trillion tokens would require roughly \(6 \times 10^{10} \times 10^{12} = 6 \times 10^{22}\) FLOPs.

The Chinchilla Scaling Correction

The Chinchilla paper from DeepMind (2022) revised earlier assumptions by showing that many large models were undertrained with respect to their size. It found that the optimal training regime follows a ratio of approximately 20 tokens per parameter, balancing model size and dataset quantity.

This insight led to more efficient training strategies, such as Llama 2’s 70-billion-parameter model trained on about two trillion tokens, outperforming larger but undertrained models.

Data Quality and Quantity

Recent studies in 2026 reinforce that data quality and diversity are decisive for downstream performance. Simply scaling parameters or compute without sufficient or well-curated data leads to diminishing returns. Synthetic data, multi-modal datasets including code, audio, and video, and carefully filtered training corpora are now standard tools to extend effective dataset size and boost generalization.

For a deeper understanding of how tokenization and data choices affect language model training, see LLMs for Developers, Part 2: Tokenization and Embeddings, Moving from Characters to Subwords (2026).

Architectural Impact on Scaling

Research by Bian et al. (2025) expands scaling laws to include architectural factors. By tuning hidden sizes, layer ratios, and attention mechanisms, models can achieve better accuracy-inference cost trade-offs. This shift recognizes inference costs as a first-class constraint, not just training cost.

Mixture of Experts (MoE) architectures further decouple model size from inference cost by activating only a subset of experts per input token, enabling trillion-parameter models with manageable inference latency.

Train-to-Test Scaling Laws

The Train-to-Test (T2) framework integrates training and inference compute considerations. It is shown that smaller models trained on more data and paired with repeated reasoning or sampling at inference can outperform massive models trained traditionally.

This has practical implications for AI app developers, who may prefer lighter models that generate multiple responses or reasoning chains rather than investing heavily in extreme training compute.

Summary Table of Scaling Law Insights

Study	Year	Focus	Key Finding	Practical Advice
Kaplan et al.	2020	Params, tokens, compute	Loss follows power-law decay with compute	Scale model and data proportionally
Chinchilla (DeepMind)	2022	Optimal tokens per param	Best results at ~20 tokens per param	Balance model size and dataset size
Bian et al.	2025	Architecture and inference cost	Architectural tuning affects scaling curves	Optimize architecture for inference efficiency
Zhang et al.	2026	Reinforcement learning scaling	Diminishing returns at scale	Carefully budget compute for RL post-training
Roberts et al. (T2)	2026	Training + inference compute	Smaller models + repeated inference sampling effective	Jointly optimize training and inference compute

Training Large Language Models in Practice: Strategies and Trade-offs

Training models at scale involves more than just raw compute. Developers must orchestrate data pipelines, optimize hardware use, and balance cost versus performance. Here are some practical considerations:

Hardware and Parallelism

Large models require distributed training across GPUs or TPUs with techniques such as data parallelism, model parallelism, and pipeline parallelism to split workloads efficiently.

Memory constraints often dictate batch sizes and require gradient accumulation strategies, as discussed previously.

Data Pipeline and Curation

Feeding vast datasets efficiently requires reliable data pipelines that can preprocess, shuffle, and batch data continuously. Quality filtering, deduplication, and domain balancing enhance the training signal.

Mixed Precision and Optimization Tricks

Training at scale regularly employs mixed precision (float16 or bfloat16) to reduce memory usage and increase throughput without sacrificing accuracy.

Techniques like gradient clipping, learning rate warm-up, and adaptive schedulers stabilize training and improve convergence rates.

Cost Management and Compute Budgeting

With training runs potentially costing tens of millions of dollars, estimating compute needs upfront using the FLOPs formula is essential.

Developers increasingly apply Train-to-Test scaling laws to allocate compute optimally across training and inference, often preferring smaller, overtrained models with efficient inference-time sampling.

Real-World Example: Training Medium-Sized Model

Suppose you plan to train a 10-billion-parameter model on a dataset of 500 billion tokens. Using the FLOPs estimation:

\[
6 \times 10^{10} \times 5 \times 10^{11} = 3 \times 10^{22} \text{ FLOPs}
\]

This compute corresponds to several weeks of training on a multi-thousand GPU cluster, costing millions in cloud resources.

Using gradient accumulation and mixed precision can reduce wall-clock time and cost, while applying data curation enhances model quality.

Machine Learning Training on GPUs — Efficient training of language models relies heavily on GPU clusters and parallel processing.

Conclusion and Next Steps

Training large language models is a complex process requiring mastery of loss functions, training loops, batching, and compute budgeting. Cross-entropy loss remains standard for next-token prediction, while batching and gradient accumulation allow scaling training to massive datasets and models.

Scaling laws provide critical guidance on balancing model size, dataset volume, and compute. The rule of thumb relating total FLOPs to model parameters and tokens remains useful for planning. However, the field increasingly emphasizes data quality, architectural tuning, and inference-time compute optimization.

By applying these principles, developers can train models that perform well while managing costs effectively. The emerging Train-to-Test framework offers a new lens to jointly optimize training and inference compute for practical applications.

In the next installment of this series, we will explore fine-tuning methods, instruction tuning, and alignment techniques that improve model usability and safety.

Key Takeaways:

Cross-entropy loss guides LLM training by measuring prediction accuracy for next tokens.
The training loop cycles through forward and backward passes, loss calculation, and optimizer updates.
Batching and gradient accumulation enable training with large effective batch sizes despite hardware memory limits.
Scaling laws estimate compute needs as roughly six times model parameters multiplied by training tokens, guiding resource planning.
Data quality, model architecture, and inference-time compute strategies increasingly drive performance and cost-efficiency.

For a deeper dive into practical implications of scaling laws on training and inference compute budgets, see Train-to-Test scaling explained article by VentureBeat.

Sources and References

This article was researched using a combination of primary and supplementary sources:

Supplementary References

These sources provide additional context, definitions, and background information to help clarify concepts mentioned in the primary source.