We built a native inference engine that runs quantized LLMs directly from SSD using memory-mapped I/O. The model never fully loads into RAM — the OS pages weights on demand as each layer executes.

*What it does:* - Mixtral 8x22B (80GB, 141B params) runs on a machine with 48GB RAM - Model loads in 0.3 seconds (vs 190s with llama.cpp) - Produces correct output: "What is 2+2?" → "The sum of 2 and 2 is 4." - Zero dependencies — custom tensor engine, custom GGUF parser, no ggml/llama.cpp

*How it works:* - `mmap()` the GGUF file. The OS handles SSD→RAM paging transparently - Quantize the input to Q8_K, compute dot products directly against Q4_K/Q5_K/Q6_K weights in the quantized domain — no dequantization to float32 - AVX2 SIMD + 8-thread parallel matvec - For MoE models: only 2 of 8 experts are active per token, so most weights stay cold on disk

*The hard part we solved:* GGUF models are calibrated for a specific dot product computation path (ggml's "quantize input → integer multiply-accumulate → late float conversion"). If you naively dequantize weights to float32 and do a standard dot product, the per-operation error is tiny (~0.001%) but compounds across 56 transformer layers into completely wrong output. We had to reverse-engineer and match ggml's exact scalar computation — block-level integer accumulation with 8-lane parallel reduction — to get correct results.

*What it doesn't do (yet):* - Speed: ~0.08 tok/s on the 80GB model (CPU-only, no GPU offload) - No interactive chat UI - Only K-quant GGUF formats (Q4_K_M, Q5_K_M, Q6_K — covers ~90% of models on HuggingFace) - Windows only (Linux stubs exist but untested)

The architecture comes from my "work in progress" WayOS (https://github.com/cloudlinqed/WayOS), an AI-first OS that treats SSD/RAM/VRAM as a unified memory hierarchy.

GitHub: https://github.com/cloudlinqed/WayInfer

Comments URL: https://news.ycombinator.com/item?id=47614947

Points: 1

# Comments: 0

Hacker News - Newest: "SSD"
TurboQuant KV Compression and SSD Expert Streaming for M5 Pro and IOS
1 April 2026 at 18:06

TurboQuant KV Compression and SSD Expert Streaming for M5 Pro and IOS

Hacker News - Newest: "SSD"

By: aegis_camera

1 April 2026 at 18:06

Article URL: https://github.com/SharpAI/SwiftLM

Comments URL: https://news.ycombinator.com/item?id=47604354

Points: 77

# Comments: 47

Hacker News - Newest: "SSD"
Sony Shuts Down Nearly Its Memory Card Business Due to Flash Shortage
30 March 2026 at 16:33

Sony Shuts Down Nearly Its Memory Card Business Due to Flash Shortage

Hacker News - Newest: "SSD"

By: smurda

30 March 2026 at 16:33

Article URL: https://petapixel.com/2026/03/27/sony-shuts-down-nearly-its-entire-memory-card-business-due-to-ssd-shortage/

Comments URL: https://news.ycombinator.com/item?id=47576462

Points: 12

# Comments: 0

Hacker News - Newest: "SSD"
Samsung's BM9K1 PCIe drive's controller is based on RISC-V delivering 11.4 GB/s
28 March 2026 at 23:17

Samsung's BM9K1 PCIe drive's controller is based on RISC-V delivering 11.4 GB/s

Hacker News - Newest: "SSD"

By: fork-bomber

28 March 2026 at 23:17

Article URL: https://www.tomshardware.com/pc-components/ssds/samsung-announces-bm9k1-pcie-5-0-qlc-ssd

Comments URL: https://news.ycombinator.com/item?id=47558931

Points: 3

# Comments: 1

Hacker News - Newest: "SSD"
Exceptional fake SSD clone of Samsung 990 Pro is almost impossible to spot
28 March 2026 at 19:41

Exceptional fake SSD clone of Samsung 990 Pro is almost impossible to spot

Hacker News - Newest: "SSD"

By: speckx

28 March 2026 at 19:41

Article URL: https://www.tomshardware.com/pc-components/ssds/exceptional-fake-ssd-clone-of-samsung-990-pro-is-almost-impossible-to-spot-near-identical-performance-blurs-the-line-between-real-and-fake-as-ai-crunch-drives-knock-off-market

Comments URL: https://news.ycombinator.com/item?id=47557600

Points: 10

# Comments: 5

Hacker News - Newest: "SSD"
Show HN: Open-source encrypted backup CLI
16 March 2026 at 13:13

Show HN: Open-source encrypted backup CLI

Hacker News - Newest: "SSD"

By: loichrn

16 March 2026 at 13:13

I’ve been building an open-source backup CLI in Go: https://github.com/Cloudstic/cli

Docs: https://docs.cloudstic.com

Features:

  - encrypted backups
  - content-addressed deduplication
  - local / S3 / B2 / SFTP storage
  - local / Google Drive / OneDrive / SFTP sources
  - restore to ZIP or directory

One thing I wanted to get right was portable drives. If the same external SSD moves between machines, the tool uses its GPT partition UUID to keep the backup history tied to the drive itself, instead of treating every new mount path as a different source.

70M vectors searched in 48ms on a single consumer GPU –results you won't believe

Hacker News - Newest: "SSD"

By: EffCompute

16 March 2026 at 16:12

I built a prototype GPU-based vector search system that runs locally on a consumer PC.

Hardware:

RTX 3090 consumer CPU NVMe SSD

Dataset:

~70 million vectors (384 dimensions)

Performance:

~48 ms search latency for top-k results.

This corresponds to roughly ~1.45 billion vector comparisons per second on a single GPU.

The system uses a custom GPU kernel and a two-stage search pipeline (binary filtering + floating-point reranking).

My goal was to explore whether large-scale vector search could run efficiently on consumer hardware instead of large datacenter clusters.

After thousands of hours of work and many failed attempts the results finally became stable enough to benchmark.

I'm currently exploring how far this approach can scale.

I'd be very interested to hear how others approach large-scale vector search on consumer hardware.

Happy to answer questions.

Comments URL: https://news.ycombinator.com/item?id=47400954

Points: 1

# Comments: 4

Hacker News - Newest: "SSD"
Apple's MacBook Neo Modded to a 1 TB SSD
16 March 2026 at 10:15

Apple's MacBook Neo Modded to a 1 TB SSD

Hacker News - Newest: "SSD"

By: top_sigrid

16 March 2026 at 10:15

Article URL: https://www.tomshardware.com/laptops/macbooks/apples-macbook-neo-modded-to-a-1-tb-ssd-breaking-the-firms-512-gb-barrier-base-256-gb-model-gets-modded-in-expert-nand-swap-surgery

Comments URL: https://news.ycombinator.com/item?id=47397138

Points: 3

# Comments: 0

Hacker News - Newest: "SSD"
NAND's New Power Dynamic: Enterprise SSD Demand Reshapes Supply
14 March 2026 at 02:47

NAND's New Power Dynamic: Enterprise SSD Demand Reshapes Supply

Hacker News - Newest: "SSD"

By: jamesbsr

14 March 2026 at 02:47

Article URL: https://news.ycombinator.com/submitted?id=redohmy

Comments URL: https://news.ycombinator.com/item?id=47372795

Points: 2

# Comments: 0

Hacker News - Newest: "SSD"
MacBook Neo Has Up to 8× Slower SSD Speeds Compared to New MacBook Pro
11 March 2026 at 06:49

MacBook Neo Has Up to 8× Slower SSD Speeds Compared to New MacBook Pro

Hacker News - Newest: "SSD"

By: tosh

11 March 2026 at 06:49

Article URL: https://www.macrumors.com/2026/03/10/macbook-neo-slower-ssd-speeds/

Comments URL: https://news.ycombinator.com/item?id=47332370

Points: 1

# Comments: 2

Hacker News - Newest: "SSD"
Unpowered SSD data retention test:no data corruption on USB sticks after 6 years
10 March 2026 at 18:28

Unpowered SSD data retention test:no data corruption on USB sticks after 6 years

Hacker News - Newest: "SSD"

By: u1hcw9nx

10 March 2026 at 18:28

Article URL: https://www.tomshardware.com/pc-components/usb-flash-drives/unpowered-flash-drive-data-retention-test-shows-promising-results-after-six-years-results-show-no-data-corruption-on-usb-sticks-challenging-conventional-wisdom

Comments URL: https://news.ycombinator.com/item?id=47327042

Points: 16

# Comments: 5

Hacker News - Newest: "SSD"
Show HN: Efficient LLM Architectures for 32GB RAM (Ternary and Sparse Inference)
9 March 2026 at 20:30

Show HN: Efficient LLM Architectures for 32GB RAM (Ternary and Sparse Inference)

Hacker News - Newest: "SSD"

By: fatihturker

9 March 2026 at 20:30

Hi HN,

I’ve been exploring how far large language models can be pushed on machines with limited memory.

I built an experimental runtime and architecture approach focused on making extremely large models more feasible on systems with around 32GB of RAM.

The core idea is combining several efficiency techniques:

ternary weight representation {-1, 0, +1} (~1.58 bits per weight), sparse execution that skips zero weights, memory-mapped layer streaming from NVMe storage, and lightweight tensor unpacking optimized for Apple Silicon.

Instead of keeping the entire model in RAM, weights can be streamed from fast SSD storage and unpacked during execution. This shifts the bottleneck from memory capacity toward storage bandwidth and compute efficiency.

Early experiments show significant compression compared to FP16 weights (for example TinyLlama-1.1B shrinking from ~2.05GB to ~0.24GB with ternary packing).

The project is still experimental, but the goal is to explore whether extreme compression + sparsity + SSD streaming can make much larger models practical on consumer machines.

Paper: https://opengraviton.github.io/paper.html

Runtime: https://github.com/opengraviton/graviton-native

I’d really appreciate feedback from people working on inference engines, quantization, or efficient model architectures.

Comments URL: https://news.ycombinator.com/item?id=47315029

Points: 2

# Comments: 1