LFM2-24B-A2B: Scaling Up the LFM2 Architecture for Real Deployment

24 billion total parameters, about 2.3 billion active per forward pass, support for 32GB RAM deployments, and roughly 26.8K tokens per second on a single H100 SXM5. Those are the numbers that make LFM2-24B-A2B worth a close look right now. In a market where model launches often chase size for its own sake, Liquid AI is making a different argument: model capacity can grow faster than inference cost if the architecture is built for sparse execution from the start.

That matters for teams making budget and deployment decisions this quarter. AI projects are now judged less by benchmark screenshots and more by whether they can run inside latency, memory, and concurrency limits. LFM2-24B-A2B lands squarely in that gap between research ambition and production reality.

Why LFM2-24B-A2B matters now

Most large-model announcements still force buyers into an old trade: if you want more quality, you pay more at inference time. That trade breaks down quickly in production. A model that looks strong in a static benchmark can still fail the real test if it blows through memory limits, stalls under concurrency, or needs infrastructure that only a handful of teams can afford.

How the model scales without tripling inference cost

The easiest way to understand the scaling strategy is to separate total capacity from active compute. A dense model uses the same weights for every token. If you triple parameter count, you usually take a large hit in latency, memory traffic, and serving cost. Sparse models change that by activating only a subset of weights on each pass.

LFM2-24B-A2B uses a Mixture of Experts setup with 24 billion total parameters and about 2.3 billion active per forward pass. Liquid AI compares that with the earlier LFM2-8B-A1B recipe, where active parameters were about 1.5 billion. The important point is the ratio: total parameters rise by about 3x, while active parameters rise by about 1.5x. That is the core economic claim behind the model.

An intuitive analogy helps. Imagine a consulting firm with 64 specialists on staff. You do not bring all 64 into every client meeting. You route each problem to a small group with the right expertise. The firm can know more overall without turning every assignment into a giant, expensive committee call. That is what sparse routing tries to do for token processing.

Liquid AI says the scaling recipe has three main ingredients:

Increase model depth from 24 layers to 40 layers.
Increase the number of experts per MoE block from 32 to 64.
Keep top-4 routing, while making each expert slightly narrower.

The narrowing matters. The intermediate size for each expert drops to 1536 from 1792 in the earlier 8B model recipe. That keeps the active path lean enough to preserve the edge-friendly deployment goal. The first two layers stay dense for training stability, which is a common practical move in sparse systems even if the rest of the stack leans heavily on routing.

From an engineering point of view, this is a very different scaling philosophy from simply widening every layer. It says: add representational headroom through specialization, but protect the per-token compute path. That is why the model can aim at both cloud serving and smaller local machines at the same time.

Inside the hybrid LFM2 design

The LFM2 family does not use a plain transformer stack. Liquid AI describes it as a hybrid architecture that combines gated short convolution blocks with a smaller number of grouped query attention blocks. The claimed benefit is faster prefill and decode at lower memory cost.

If you want a simple mental model, think of convolutions as local pattern readers and attention as a broader context lookup tool. Convolutional blocks can be cheaper and more hardware-friendly for some parts of the sequence-processing workload. Attention blocks still handle the longer-range interactions that matter for language tasks. The hybrid tries to use each where it earns its keep.

One numerical design detail stands out: the attention-to-convolution ratio stays at roughly 1:3, with 10 attention layers out of 40 total. That means most of the stack is still built around the cheaper block type. For teams that care about prefill speed and memory footprint, that ratio may matter more than the headline parameter count.

This architectural direction fits a larger pattern in model development. The last few years trained the market to think every improvement had to come from larger dense transformers. That story is wearing thin because the deployment bill keeps rising. Hybrid backbones and sparse activation are a direct response to that pressure.

It also changes what implementers should optimize. With dense transformers, people obsess over matrix throughput and KV cache pressure. With a hybrid sparse model, routing behavior, expert balance, kernel efficiency, and quantization choices become just as important. That is a more complicated operational picture, but it also opens up more ways to improve real serving performance.

Model characteristic	LFM2-8B-A1B recipe	LFM2-24B-A2B recipe	Source
Total parameters	8.3B	24B	Liquid AI
Active parameters per forward pass	1.5B	2.3B	Liquid AI
Layer count	24	40	Liquid AI
Experts per MoE block	32	64	Liquid AI
Intermediate size per expert	1792	1536	Liquid AI

Benchmarks, throughput, and what the numbers imply

Liquid AI says quality improves log-linearly across the LFM2 family from 350M to 24B parameters on a benchmark set that includes GPQA Diamond, MMLU-Pro, IFEval, IFBench, GSM8K, and MATH-500. The broader point is more important than any single score: the architecture appears to keep scaling in a fairly predictable way across nearly two orders of magnitude.

The stronger signal for many buyers, though, is serving throughput. In a vLLM benchmark on a single H100 SXM5, LFM2-24B-A2B reached about 26.8K total tokens per second at 1,024 concurrent requests with 1,024 max input tokens and 512 max output tokens. Liquid AI says that setup used realistic interleaved prefill-and-decode workloads intended to reflect production serving and RL rollout generation.

Those numbers matter because they tie architecture to budget. If a model can sustain higher throughput at the same hardware tier, the cost per generated token falls. That is often more important than slightly stronger offline benchmark performance.

The company also compares the release against Qwen3-30B-A3B-Instruct-2507 and gpt-oss-20b in throughput testing. The useful verified figures from that comparison are the model sizes and active-parameter counts, which show why the efficiency claim is plausible: the LFM2 checkpoint is serving with a smaller active path than those alternatives.

Model	Total parameters	Active parameters	Serving stack mentioned	Source
LFM2-24B-A2B	24B	2.3B	vLLM, llama.cpp, SGLang	Liquid AI
Qwen3-30B-A3B-Instruct-2507	30.5B	3.3B	llama.cpp benchmark comparison	Liquid AI
gpt-oss-20b	21B	3.6B	llama.cpp benchmark comparison	Liquid AI

There is also a practical signal in the quantization support. Liquid AI says GGUF builds are available for llama.cpp in Q4_0, Q4_K_M, Q5_K_M, Q6_K, Q8_0, and F16. That expands the range of target hardware and gives operators more ways to trade accuracy for footprint and speed.

How to run it in practice

One reason this release matters is that it arrives with day-zero support across widely used inference stacks: llama.cpp, vLLM, and SGLang. That shortens the path from evaluation to deployment. Teams can test on CPU, move to GPU, and compare quantized and full-precision variants without switching model families.

Liquid AI also says the model was designed to fit in 32GB of RAM. For edge and local deployments, that is arguably the biggest business detail in the release. It does not mean every setup will behave the same way across laptops, desktops, integrated GPUs, and NPUs, but it does mean the target class is broader than a standard data-center-only checkpoint.

Laptop using an AI assistant interface — Local inference is gaining appeal as models become more memory-efficient and tooling support improves.

Here is a practical example of how an engineering team might expose the model through a vLLM-compatible API. The command structure follows the vLLM module path referenced in common deployments, and the model identifier comes from the public Hugging Face release.

# Install vLLM in your Python environment.
# For exact platform requirements and supported CUDA versions,
# refer to the official documentation for vLLM and Liquid AI.

python -m vllm.entrypoints.openai.api_server \
 --model LiquidAI/LFM2-24B-A2B

# Example request against the server:
curl http://localhost:8000/v1/completions \
 -H "Content-Type: application/json" \
 -d '{
 "model": "LiquidAI/LFM2-24B-A2B",
 "prompt": "Summarize the benefits of sparse mixture of experts models for inference efficiency.",
 "max_tokens": 120
 }'

# Note: production use should add auth, concurrency controls,
# observability, caching policy, prompt validation,
# and careful tuning for batch size and memory limits.

That example is intentionally narrow. In production, the bigger challenge is not “Can the model answer a prompt?” It is “How does it behave under concurrency, with realistic context lengths, and with the quantization level that fits your hardware?” Those questions decide the deployment bill.

Trade-offs and where simpler options may still win

Sparse hybrids have clear strengths, but they also make system behavior more complex. Routing introduces another moving part. Expert balance matters. Kernel efficiency matters. Quantization interacts with architecture choices. That means a model like this may beat a dense alternative on throughput and memory, yet still require more tuning effort to get the best result.

There is also a product-level trade-off in the current checkpoint. Liquid AI says the release used a lightweight post-training path to ship a traditional instruct model without reasoning traces. That likely helps time-to-release and keeps the model aligned with common chat and instruction-following workloads. But teams whose use cases lean heavily on chain-style reasoning or visible intermediate steps may evaluate it differently than a model explicitly tuned around those behaviors.

Another important point: simpler approaches still win in many cases. If your task is narrow, deterministic, and latency-sensitive, a smaller dense model or even a non-generative pipeline may remain the better engineering choice. AI teams still overuse large models for jobs that are closer to retrieval, classification, or templated generation. The existence of a more efficient 24B-class sparse model does not erase that design discipline.

This is where the article ties back to broader industry behavior. Companies are hiring for deployment, infra, and AI integration because model choice is now an operational decision, not just an R&D one. Efficient architectures widen the set of feasible use cases, but they do not remove the need to match tools to workload.

What to watch next

The next milestone is already clear. Liquid AI says LFM2-24B-A2B has been trained on 17T tokens so far and that pre-training is still running. The company says to expect LFM2.5-24B-A2B after training completes, with additional post-training and reinforcement learning. That means the current release should be read as both a usable checkpoint and a signal about the direction of the family.

There are three things worth watching from here.

First, whether the architecture keeps its throughput lead as more independent operators test it across llama.cpp, vLLM, and SGLang.
Second, whether the 32GB RAM target translates into consistent local deployment wins on real consumer systems, not just controlled demos.
Third, whether the follow-up post-training phase moves quality enough to make the architecture competitive not just on efficiency, but on preference-based application outcomes.

The bigger market implication is hard to miss. AI deployment is moving toward models that separate capacity from active compute. That is a direct response to cost pressure, hardware limits, and the growing push toward local and edge inference. LFM2-24B-A2B is one of the sharper examples of that shift so far.

For engineers, the lesson is simple: architecture literacy is back. Understanding sparse routing, hybrid backbones, quantization options, and serving behavior under concurrency now matters as much as knowing parameter count. The teams that learn to evaluate models on those terms will make better deployment bets than the teams still buying benchmarks at face value.