Running Llama 3.1 70B on RTX 3090 via NVMe-to-GPU

Running Llama 3.1 70B on a single consumer GPU was once considered out of reach, but a recent project demonstrates that, with the right optimizations, it’s now possible—albeit with tradeoffs. Using an NVMe-to-GPU data path to bypass the CPU, this approach achieves inference of the 70B-parameter model on an NVIDIA RTX 3090, pushing the boundaries for local AI experimentation and edge inference.

Key Takeaways:

Demonstrates Llama 3.1 70B inference on a single RTX 3090 using NVMe-to-GPU direct transfer, bypassing the CPU in the model data path

NVMe-to-GPU architecture allows models much larger than VRAM capacity to run by streaming data directly to the GPU

Achieves approximately 0.3 tokens/sec on 70B Q4_K_M quantization—too slow for production, but a breakthrough for consumer hardware

Relies on a three-tier adaptive caching system: VRAM, pinned RAM, and NVMe/mmap

Enables affordable local LLM experimentation and research on hardware previously considered insufficient for models of this scale

Innovation Overview: Llama 3.1 70B on RTX 3090

The Show HN project, covered by AIToolly, presents a working demonstration of Llama 3.1 70B inference on a single NVIDIA RTX 3090 (24GB VRAM). The technical leap is the use of NVMe-to-GPU direct data transfer, which allows model weights to move from fast NVMe storage directly into GPU memory—bypassing CPU intervention in the main data path. This method overcomes RAM and PCIe bottlenecks, letting the system dynamically “page” model weights into GPU memory as needed. According to the AIToolly report, the project is hosted on GitHub as xaskasdf/ntransformer.

Now at a Reduced Price: On-Demand Cloud Storage and Collaboration for Teams!

NiHao Cloud

Start with pay-as-you-go pricing! The cloud storage solution that works wherever your team is—China, America, Europe, and more—all at the same time!

This advance has drawn attention on platforms like Hacker News, highlighting the growing interest in democratizing access to large language models beyond enterprise or cloud deployments. Practitioners exploring efficient AI on minimal hardware will find that this project aligns with the ethos of extracting maximum value from consumer-grade GPUs.

For context on pushing hardware boundaries in related areas, see our posts on header-only rasterizers and AI workflow optimization.

Expanding Local AI Possibilities

Running high-parameter LLMs locally opens doors to privacy-focused applications, rapid prototyping, and research where cloud access is impractical or undesirable. With NVMe-to-GPU techniques, tasks like secure document processing and air-gapped deployments become viable even on single-GPU workstations.

Technical and Operational Hurdles

Despite this breakthrough, deploying massive models on consumer GPUs still faces hurdles: thermal limits, power draw, model quantization requirements, and complex caching logic must all be addressed for real-world usability. The current approach is a proof-of-concept rather than a plug-and-play production solution.

Upgrade & share files freely!

Unlock the full potential of cloud storage by subscribing today. Logo Sesame Disk

Enjoy seamless access and sharing across China, the USA, Europe, and just everywhere!

How NVMe-to-GPU Bypasses the CPU

Traditional Data Flow Barriers

In standard LLM inference, model weights are read from storage, copied into system RAM by the CPU, and then transferred over PCIe to the GPU. For models exceeding VRAM size, this trip through system memory and the CPU orchestrates constant paging, introducing bottlenecks:

Model size is constrained by RAM and PCIe bandwidth
CPU is heavily involved in data movement and paging logic
Latency and throughput are held back by memory copy overhead

Direct NVMe-to-GPU Transfer Explained

The new method bypasses the CPU for weight transfers by streaming data directly from NVMe storage to the GPU. The project implements a three-tier adaptive caching strategy:

VRAM Resident: Holds as much of the model as possible in GPU memory for fastest access
Pinned RAM: Uses system RAM as a secondary cache for recently used blocks
NVMe/mmap: Fetches model weights from NVMe storage on-demand and maps them into memory as needed

This approach essentially reimplements the Linux kernel’s page cache, but with explicit GPU-awareness. The CPU’s role is minimized in the inference data path, allowing the GPU to operate largely independently during inference. For precise implementation details, consult the xaskasdf/ntransformer repository referenced in the AIToolly article.

Data Flow	Traditional LLM Inference	NVMe-to-GPU Bypass
Storage to VRAM	Storage → CPU → RAM → GPU	Storage → GPU (Direct)
CPU Involvement	High (orchestration, paging)	Minimal (orchestration only)
Major Bottlenecks	PCIe, RAM, CPU overhead	NVMe throughput, VRAM size
Max Model Size	<24B (VRAM-limited)	70B (quantized, streamed)

This architecture is not suitable for latency-sensitive production services, but it represents a major step forward for large model accessibility on mainstream hardware. (Hacker News)

Practical Benchmarks and Performance

Understanding the 0.3 Tokens/Sec Result

Per the primary sources, the project achieves roughly 0.3 tokens per second on Llama 3.1 70B Q4_K_M quantization, running on a single RTX 3090. For interactive chat, this means a single response may take several minutes. The significance is not raw speed, but the fact that a 70B model can run at all without a GPU cluster or cloud backend (Hacker News).

You landed the Cloud Storage of the future internet. Cloud Storage Services Sesame Disk by NiHao Cloud

Use it NOW and forever!

Support the growth of a Team File sharing system that works for people in China, USA, Europe, APAC and everywhere else.

Key Performance Factors

NVMe bandwidth and latency are the primary constraints, not PCIe or RAM
Q4_K_M quantization or similar is required to fit weights within available VRAM
The three-tier caching algorithm (VRAM, pinned RAM, NVMe) determines effective throughput and cache miss rates

These benchmarks should be viewed as a proof-of-concept for hardware-limited environments, not a production-grade deployment.

Pipeline Example: Adaptive Caching (Conceptual)

# Pseudocode for model paging and inference (actual APIs and code: see xaskasdf/ntransformer)
# 1. Load quantized weights from NVMe to GPU VRAM as needed
# 2. If VRAM full, evict least-recently-used weights to NVMe; optionally pin some weights in system RAM
# 3. During inference, always fetch weights from the highest available tier (VRAM > RAM > NVMe)

The primary documentation does not provide CLI commands or config snippets; refer to the xaskasdf/ntransformer repository for setup and usage instructions. (AIToolly)

Implications for Local AI and Edge Inference

Relevance for Practitioners

Making 70B-parameter LLMs run on a single GPU changes the economics and accessibility of local AI. You can experiment with large models outside of cloud infrastructure, reducing both costs and data privacy risks. The architecture invites new research into:

One ring to rule them all.

J. R. R. Tolkien

One Cloud Storage to Share with Them All: China, USA, Europe, APAC…

Sesame Disk by NiHao Cloud

Custom GPU-aware caching strategies for massive models
Edge deployment of LLMs for environments where bandwidth, not latency, is the limiting factor
Prototype development for AI startups and academic research with minimal hardware investment
Privacy-focused use cases, including air-gapped and regulated settings

Consumer GPU LLM Scaling: Comparison Table

Method	Max Model Size (RTX 3090, 24GB VRAM)	Typical Token Speed	CPU Required in Inference Path?
Standard RAM/CPU/GPU	<24B	>1 tokens/sec	Yes
NVMe-to-GPU Bypass	70B (Q4_K_M quantized)	0.3 tokens/sec	No (bypassed for data path)

To explore resource-constrained AI further, see our posts on minimalist 2D graphics and efficient AI workflow design.

Common Pitfalls and Pro Tips

Be realistic about speed: 0.3 tokens/sec is impractically slow for interactive applications; use this approach for research, not production chatbots.
Quantization is required: Only heavily quantized weights (Q4_K_M or similar) will fit a 70B model on a 24GB GPU. Output quality may suffer.
Monitor NVMe drive health: Heavy random access can accelerate wear on consumer NVMe SSDs. Watch drive temperatures and SMART status regularly.
Cache configuration matters: Default cache tier sizes may not suit your workload—profile and tune for best performance on your hardware.
CPU bypass is not total CPU elimination: The CPU still handles orchestration and OS tasks; “bypass” refers to the main data movement path during inference.
Model format compatibility: Only certain model formats are supported. Review the ntransformer documentation for conversion requirements.

For background on edge and air-gapped LLM deployment in the context of cloud risks, see our analysis of cloud networking risks.

Conclusion and Next Steps

Running Llama 3.1 70B on a single RTX 3090 via NVMe-to-GPU bypass is a genuine milestone for local AI accessibility. While performance remains insufficient for production, this architecture sets the stage for further innovation in affordable, high-scale LLM experimentation and deployment beyond the cloud.

Assess your hardware, experiment with cache strategies, and track updates at the xaskasdf/ntransformer repo. For related deep dives into pushing the limits of graphics and AI workflows, start with our posts on resource-constrained graphics and efficient AI development. The next breakthroughs in AI accessibility may come from clever engineering—projects like this show just how far you can push existing hardware.