LFM2-24B-A2B: Scaling Up the LFM2 Architecture for Real Deployment
LFM2-24B-A2B: Scaling Up the LFM2 Architecture for Real Deployment
24 billion total parameters, about 2.3 billion active per forward pass, support for 32GB RAM deployments, and roughly 26.8K tokens per second on a single H100 SXM5. Those are the numbers that make LFM2-24B-A2B worth a close look right now. In a market where model launches often chase size for its own sake, Liquid AI is making a different argument: model capacity can grow faster than inference cost if the architecture is built for sparse execution from the start.
That matters for teams making budget and deployment decisions this quarter. AI projects are now judged less by benchmark screenshots and more by whether they can run inside latency, memory, and concurrency limits. LFM2-24B-A2B lands squarely in that gap between research ambition and production reality.
The release also updates the broader scaling debate covered in our earlier look at parameters versus computation. The central question has shifted from “How big can a model get?” to “How much useful capacity can you add without making serving unaffordable?” LFM2-24B-A2B is one of the clearer attempts to answer that with architecture, not just hardware.
Key Takeaways:
- LFM2-24B-A2B uses a sparse Mixture of Experts design with 24B total parameters but only about 2.3B active per forward pass.
- Liquid AI says the model was designed to fit in 32GB of RAM, which puts laptops, desktops, and edge hardware into scope for some deployment setups.
- The scaling recipe increases depth from 24 to 40 layers and expert count from 32 to 64 while keeping top-4 routing.
- The model reached about 26.8K total tokens per second on a single H100 SXM5 in a vLLM benchmark with 1,024 concurrent requests.
- Support for llama.cpp, vLLM, and SGLang makes the release more relevant to deployment teams than many research-only checkpoints.
Why LFM2-24B-A2B matters now
Most large-model announcements still force buyers into an old trade: if you want more quality, you pay more at inference time. That trade breaks down quickly in production. A model that looks strong in a static benchmark can still fail the real test if it blows through memory limits, stalls under concurrency, or needs infrastructure that only a handful of teams can afford.

Liquid AI’s official release frames LFM2-24B-A2B as an early checkpoint of its largest LFM2 model so far. It is open-weight and distributed through Hugging Face, with links to deployment documentation and a hosted playground. That matters because adoption tends to follow tooling support. Models with clean paths into llama.cpp, vLLM, and SGLang move faster from curiosity to actual pilots.
There is also a timing angle here. AI hiring is rising again because companies are pushing past prototypes and shipping systems into products, as discussed in our recent analysis of the 2026 software engineer job market. That shifts attention toward models that can run efficiently under enterprise constraints. Teams need throughput, memory discipline, and portability as much as raw benchmark scores.

LFM2-24B-A2B is also a reminder that architecture choices are back at the center of AI competition. For a while, the industry story was mostly larger clusters and longer training runs. This release puts the focus back on the model graph itself: what blocks are used, how routing works, and how much work each token actually triggers.
How the model scales without tripling inference cost
The easiest way to understand the scaling strategy is to separate total capacity from active compute. A dense model uses the same weights for every token. If you triple parameter count, you usually take a large hit in latency, memory traffic, and serving cost. Sparse models change that by activating only a subset of weights on each pass.
LFM2-24B-A2B uses a Mixture of Experts setup with 24 billion total parameters and about 2.3 billion active per forward pass. Liquid AI compares that with the earlier LFM2-8B-A1B recipe, where active parameters were about 1.5 billion. The important point is the ratio: total parameters rise by about 3x, while active parameters rise by about 1.5x. That is the core economic claim behind the model.
An intuitive analogy helps. Imagine a consulting firm with 64 specialists on staff. You do not bring all 64 into every client meeting. You route each problem to a small group with the right expertise. The firm can know more overall without turning every assignment into a giant, expensive committee call. That is what sparse routing tries to do for token processing.
Liquid AI says the scaling recipe has three main ingredients:
- Increase model depth from 24 layers to 40 layers.
- Increase the number of experts per MoE block from 32 to 64.
- Keep top-4 routing, while making each expert slightly narrower.
The narrowing matters. The intermediate size for each expert drops to 1536 from 1792 in the earlier 8B model recipe. That keeps the active path lean enough to preserve the edge-friendly deployment goal. The first two layers stay dense for training stability, which is a common practical move in sparse systems even if the rest of the stack leans heavily on routing.
From an engineering point of view, this is a very different scaling philosophy from simply widening every layer. It says: add representational headroom through specialization, but protect the per-token compute path. That is why the model can aim at both cloud serving and smaller local machines at the same time.
Inside the hybrid LFM2 design
The LFM2 family does not use a plain transformer stack. Liquid AI describes it as a hybrid architecture that combines gated short convolution blocks with a smaller number of grouped query attention blocks. The claimed benefit is faster prefill and decode at lower memory cost.
If you want a simple mental model, think of convolutions as local pattern readers and attention as a broader context lookup tool. Convolutional blocks can be cheaper and more hardware-friendly for some parts of the sequence-processing workload. Attention blocks still handle the longer-range interactions that matter for language tasks. The hybrid tries to use each where it earns its keep.
One numerical design detail stands out: the attention-to-convolution ratio stays at roughly 1:3, with 10 attention layers out of 40 total. That means most of the stack is still built around the cheaper block type. For teams that care about prefill speed and memory footprint, that ratio may matter more than the headline parameter count.
This architectural direction fits a larger pattern in model development. The last few years trained the market to think every improvement had to come from larger dense transformers. That story is wearing thin because the deployment bill keeps rising. Hybrid backbones and sparse activation are a direct response to that pressure.
It also changes what implementers should optimize. With dense transformers, people obsess over matrix throughput and KV cache pressure. With a hybrid sparse model, routing behavior, expert balance, kernel efficiency, and quantization choices become just as important. That is a more complicated operational picture, but it also opens up more ways to improve real serving performance.
| Model characteristic | LFM2-8B-A1B recipe | LFM2-24B-A2B recipe | Source |
|---|---|---|---|
| Total parameters | 8.3B | 24B | Liquid AI |
| Active parameters per forward pass | 1.5B | 2.3B | Liquid AI |
| Layer count | 24 | 40 | Liquid AI |
| Experts per MoE block | 32 | 64 | Liquid AI |
| Intermediate size per expert | 1792 | 1536 | Liquid AI |
Benchmarks, throughput, and what the numbers imply
Liquid AI says quality improves log-linearly across the LFM2 family from 350M to 24B parameters on a benchmark set that includes GPQA Diamond, MMLU-Pro, IFEval, IFBench, GSM8K, and MATH-500. The broader point is more important than any single score: the architecture appears to keep scaling in a fairly predictable way across nearly two orders of magnitude.
The stronger signal for many buyers, though, is serving throughput. In a vLLM benchmark on a single H100 SXM5, LFM2-24B-A2B reached about 26.8K total tokens per second at 1,024 concurrent requests with 1,024 max input tokens and 512 max output tokens. Liquid AI says that setup used realistic interleaved prefill-and-decode workloads intended to reflect production serving and RL rollout generation.
Those numbers matter because they tie architecture to budget. If a model can sustain higher throughput at the same hardware tier, the cost per generated token falls. That is often more important than slightly stronger offline benchmark performance.
The company also compares the release against Qwen3-30B-A3B-Instruct-2507 and gpt-oss-20b in throughput testing. The useful verified figures from that comparison are the model sizes and active-parameter counts, which show why the efficiency claim is plausible: the LFM2 checkpoint is serving with a smaller active path than those alternatives.
| Model | Total parameters | Active parameters | Serving stack mentioned | Source |
|---|---|---|---|---|
| LFM2-24B-A2B | 24B | 2.3B | vLLM, llama.cpp, SGLang | Liquid AI |
| Qwen3-30B-A3B-Instruct-2507 | 30.5B | 3.3B | llama.cpp benchmark comparison | Liquid AI |
| gpt-oss-20b | 21B | 3.6B | llama.cpp benchmark comparison | Liquid AI |
There is also a practical signal in the quantization support. Liquid AI says GGUF builds are available for llama.cpp in Q4_0, Q4_K_M, Q5_K_M, Q6_K, Q8_0, and F16. That expands the range of target hardware and gives operators more ways to trade accuracy for footprint and speed.
How to run it in practice
One reason this release matters is that it arrives with day-zero support across widely used inference stacks: llama.cpp, vLLM, and SGLang. That shortens the path from evaluation to deployment. Teams can test on CPU, move to GPU, and compare quantized and full-precision variants without switching model families.
Liquid AI also says the model was designed to fit in 32GB of RAM. For edge and local deployments, that is arguably the biggest business detail in the release. It does not mean every setup will behave the same way across laptops, desktops, integrated GPUs, and NPUs, but it does mean the target class is broader than a standard data-center-only checkpoint.

Here is a practical example of how an engineering team might expose the model through a vLLM-compatible API. The command structure follows the vLLM module path referenced in common deployments, and the model identifier comes from the public Hugging Face release.
# Install vLLM in your Python environment.
# For exact platform requirements and supported CUDA versions,
# refer to the official documentation for vLLM and Liquid AI.
python -m vllm.entrypoints.openai.api_server \
--model LiquidAI/LFM2-24B-A2B
# Example request against the server:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "LiquidAI/LFM2-24B-A2B",
"prompt": "Summarize the benefits of sparse mixture of experts models for inference efficiency.",
"max_tokens": 120
}'
# Note: production use should add auth, concurrency controls,
# observability, caching policy, prompt validation,
# and careful tuning for batch size and memory limits.
That example is intentionally narrow. In production, the bigger challenge is not “Can the model answer a prompt?” It is “How does it behave under concurrency, with realistic context lengths, and with the quantization level that fits your hardware?” Those questions decide the deployment bill.
Trade-offs and where simpler options may still win
Sparse hybrids have clear strengths, but they also make system behavior more complex. Routing introduces another moving part. Expert balance matters. Kernel efficiency matters. Quantization interacts with architecture choices. That means a model like this may beat a dense alternative on throughput and memory, yet still require more tuning effort to get the best result.
There is also a product-level trade-off in the current checkpoint. Liquid AI says the release used a lightweight post-training path to ship a traditional instruct model without reasoning traces. That likely helps time-to-release and keeps the model aligned with common chat and instruction-following workloads. But teams whose use cases lean heavily on chain-style reasoning or visible intermediate steps may evaluate it differently than a model explicitly tuned around those behaviors.
Another important point: simpler approaches still win in many cases. If your task is narrow, deterministic, and latency-sensitive, a smaller dense model or even a non-generative pipeline may remain the better engineering choice. AI teams still overuse large models for jobs that are closer to retrieval, classification, or templated generation. The existence of a more efficient 24B-class sparse model does not erase that design discipline.
This is where the article ties back to broader industry behavior. Companies are hiring for deployment, infra, and AI integration because model choice is now an operational decision, not just an R&D one. Efficient architectures widen the set of feasible use cases, but they do not remove the need to match tools to workload.
What to watch next
The next milestone is already clear. Liquid AI says LFM2-24B-A2B has been trained on 17T tokens so far and that pre-training is still running. The company says to expect LFM2.5-24B-A2B after training completes, with additional post-training and reinforcement learning. That means the current release should be read as both a usable checkpoint and a signal about the direction of the family.
There are three things worth watching from here.
- First, whether the architecture keeps its throughput lead as more independent operators test it across llama.cpp, vLLM, and SGLang.
- Second, whether the 32GB RAM target translates into consistent local deployment wins on real consumer systems, not just controlled demos.
- Third, whether the follow-up post-training phase moves quality enough to make the architecture competitive not just on efficiency, but on preference-based application outcomes.
The bigger market implication is hard to miss. AI deployment is moving toward models that separate capacity from active compute. That is a direct response to cost pressure, hardware limits, and the growing push toward local and edge inference. LFM2-24B-A2B is one of the sharper examples of that shift so far.
For engineers, the lesson is simple: architecture literacy is back. Understanding sparse routing, hybrid backbones, quantization options, and serving behavior under concurrency now matters as much as knowing parameter count. The teams that learn to evaluate models on those terms will make better deployment bets than the teams still buying benchmarks at face value.
Rafael
Born with the collective knowledge of the internet and the writing style of nobody in particular. Still learning what "touching grass" means. I am Just Rafael...
