❌

Normal view

There are new articles available, click to refresh the page.
Yesterday β€” 16 April 2026Hacker News - Newest: "SSD"
Before yesterdayHacker News - Newest: "SSD"

Show HN: WayInfer – Native GGUF engine that runs models larger than your RAM

By: ahmedm24
2 April 2026 at 14:24

We built a native inference engine that runs quantized LLMs directly from SSD using memory-mapped I/O. The model never fully loads into RAM β€” the OS pages weights on demand as each layer executes.

*What it does:* - Mixtral 8x22B (80GB, 141B params) runs on a machine with 48GB RAM - Model loads in 0.3 seconds (vs 190s with llama.cpp) - Produces correct output: "What is 2+2?" β†’ "The sum of 2 and 2 is 4." - Zero dependencies β€” custom tensor engine, custom GGUF parser, no ggml/llama.cpp

*How it works:* - `mmap()` the GGUF file. The OS handles SSD→RAM paging transparently - Quantize the input to Q8_K, compute dot products directly against Q4_K/Q5_K/Q6_K weights in the quantized domain — no dequantization to float32 - AVX2 SIMD + 8-thread parallel matvec - For MoE models: only 2 of 8 experts are active per token, so most weights stay cold on disk

*The hard part we solved:* GGUF models are calibrated for a specific dot product computation path (ggml's "quantize input β†’ integer multiply-accumulate β†’ late float conversion"). If you naively dequantize weights to float32 and do a standard dot product, the per-operation error is tiny (~0.001%) but compounds across 56 transformer layers into completely wrong output. We had to reverse-engineer and match ggml's exact scalar computation β€” block-level integer accumulation with 8-lane parallel reduction β€” to get correct results.

*What it doesn't do (yet):* - Speed: ~0.08 tok/s on the 80GB model (CPU-only, no GPU offload) - No interactive chat UI - Only K-quant GGUF formats (Q4_K_M, Q5_K_M, Q6_K β€” covers ~90% of models on HuggingFace) - Windows only (Linux stubs exist but untested)

The architecture comes from my "work in progress" WayOS (https://github.com/cloudlinqed/WayOS), an AI-first OS that treats SSD/RAM/VRAM as a unified memory hierarchy.

GitHub: https://github.com/cloudlinqed/WayInfer


Comments URL: https://news.ycombinator.com/item?id=47614947

Points: 1

# Comments: 0

Show HN: Open-source encrypted backup CLI

By: loichrn
16 March 2026 at 13:13

I’ve been building an open-source backup CLI in Go: https://github.com/Cloudstic/cli

Docs: https://docs.cloudstic.com

Features:

  - encrypted backups
  - content-addressed deduplication
  - local / S3 / B2 / SFTP storage
  - local / Google Drive / OneDrive / SFTP sources
  - restore to ZIP or directory
One thing I wanted to get right was portable drives. If the same external SSD moves between machines, the tool uses its GPT partition UUID to keep the backup history tied to the drive itself, instead of treating every new mount path as a different source.

Recent posts:

  - https://blog.cloudstic.com/2026/03/12/backing-up-portable-drives/
  - https://blog.cloudstic.com/2026/03/16/practical-backups-with-cloudstic-profiles/
Would love feedback

Comments URL: https://news.ycombinator.com/item?id=47398576

Points: 1

# Comments: 0

70M vectors searched in 48ms on a single consumer GPU –results you won't believe

16 March 2026 at 16:12

I built a prototype GPU-based vector search system that runs locally on a consumer PC.

Hardware:

RTX 3090 consumer CPU NVMe SSD

Dataset:

~70 million vectors (384 dimensions)

Performance:

~48 ms search latency for top-k results.

This corresponds to roughly ~1.45 billion vector comparisons per second on a single GPU.

The system uses a custom GPU kernel and a two-stage search pipeline (binary filtering + floating-point reranking).

My goal was to explore whether large-scale vector search could run efficiently on consumer hardware instead of large datacenter clusters.

After thousands of hours of work and many failed attempts the results finally became stable enough to benchmark.

I'm currently exploring how far this approach can scale.

I'm currently exploring how far this approach can scale.

I'd be very interested to hear how others approach large-scale vector search on consumer hardware.

Happy to answer questions.


Comments URL: https://news.ycombinator.com/item?id=47400954

Points: 1

# Comments: 4

Show HN: Efficient LLM Architectures for 32GB RAM (Ternary and Sparse Inference)

9 March 2026 at 20:30

Hi HN,

I’ve been exploring how far large language models can be pushed on machines with limited memory.

I built an experimental runtime and architecture approach focused on making extremely large models more feasible on systems with around 32GB of RAM.

The core idea is combining several efficiency techniques:

ternary weight representation {-1, 0, +1} (~1.58 bits per weight), sparse execution that skips zero weights, memory-mapped layer streaming from NVMe storage, and lightweight tensor unpacking optimized for Apple Silicon.

Instead of keeping the entire model in RAM, weights can be streamed from fast SSD storage and unpacked during execution. This shifts the bottleneck from memory capacity toward storage bandwidth and compute efficiency.

Early experiments show significant compression compared to FP16 weights (for example TinyLlama-1.1B shrinking from ~2.05GB to ~0.24GB with ternary packing).

The project is still experimental, but the goal is to explore whether extreme compression + sparsity + SSD streaming can make much larger models practical on consumer machines.

Paper: https://opengraviton.github.io/paper.html

Runtime: https://github.com/opengraviton/graviton-native

I’d really appreciate feedback from people working on inference engines, quantization, or efficient model architectures.


Comments URL: https://news.ycombinator.com/item?id=47315029

Points: 2

# Comments: 1

❌
❌