Executive Summary
The pitch for DirectStorage in 2021 was faster loading screens. The pitch in 2026 is structurally different. With the 1.4 release announced at GDC 2026 on March 11, Microsoft has reframed the runtime as a real-time content streaming pipeline: NVMe storage holds compressed asset chunks, the runtime moves them in flight, the GPU (or CPU, or platform-specific silicon) decompresses them, and the result lands in VRAM without round-tripping through a CPU staging buffer. Loading screens are a side effect. The actual product is open worlds that don't pause to think.
Three concrete things changed in DirectStorage 1.4. First, Microsoft added Zstandard alongside
GDeflate, with both CPU and GPU decompression paths and an open-source GPU compute shader
optimized for "content chunked to 256KB or smaller, consistent with modern game packaging patterns
for streaming workloads." Second, it shipped the
Game Asset Conditioning Library (GACL), which preconditions BC1, BC3, BC4, and
BC5 textures so Zstd extracts up to a 50% better ratio out of them. Third, it added
DStorageSetConfiguration2 with global D3D12 CreatorID support so the driver can
schedule decompression alongside rendering work.
The reason any of this is worth a findings page is that it isn't only a PC initiative. The Xbox Wire announcement on March 11, 2026 commits the next Xbox to a custom AMD SoC with developer alpha hardware in 2027 and an "order of magnitude leap in ray tracing performance." The broader GDC 2026 keynote coverage (tbreak's writeup is the most detailed) adds DirectStorage and Zstd compression to the platform feature list, alongside Neural Texture Compression, GPU Directed Work Graph Execution, and FSR Next-class upscaling. Today's GamingBolt coverage of Jason Ronald's spring Xbox Game Dev Update has him saying Microsoft is "leaning in very heavily" on Zstandard for direct SSD asset streaming. The same content pipeline is being aimed at Xbox, Windows 11, and the handheld and PC-style devices in between.
The thesis: Microsoft is standardizing the codec, the conditioning, and the runtime so a single asset packaging strategy scales from a fixed Helix console to a heterogeneous PC install base. Zstd is the format the rest of the industry already speaks (Linux kernel, Btrfs, ZFS, package managers, IETF RFC 8878), and the early shipped titles using GPU decompression on PC have made it clear why a tuned, fixed pipeline buys real value over the open one.
Streaming, Not Loading
A modern open-world title at 4K with high-detail textures can need 8 to 16 GB of unique asset data resident at any given time, against a typical PC GPU memory budget of 8 to 16 GB total. The math doesn't allow for "load everything once and hold it." It demands constant eviction and refill, with the working set turning over as the camera moves.
The legacy I/O path makes that turnover expensive. Files were read by Win32 file APIs into pageable system memory, copied into a CPU-side staging buffer, decompressed by a CPU thread (typically zlib- or LZ4-derived), copied again into a GPU upload heap, and finally DMA-ed across PCIe into VRAM. Every step burns CPU cycles, RAM bandwidth, and PCIe transactions. On a 7 GB/s NVMe drive the SSD is rarely the bottleneck; the orchestration around it is.
DirectStorage cuts most of that orchestration. The runtime batches small reads into the queue depth NVMe is built for, bypasses the legacy I/O stack via BypassIO where the driver supports it, and lets compressed payloads land in a GPU-accessible buffer that the decompression shader can read directly. The relevant figure isn't peak bandwidth, it's request rate: an asset streamer paging textures, geometry, and material parameters at LOD granularity is issuing thousands of small reads per second, and the legacy I/O stack falls over at high request rates well before the drive does.
Compression earns its keep here twice. Smaller payloads mean lower install size and less data pulled per request, but they also act as a bandwidth multiplier. A 1.5x average ratio against a 10 GB/s drive yields ~15 GB/s of effective asset throughput at the GPU, and that's before any of the conditioning tricks below. The constraint that matters then becomes how fast the platform can decompress, and what it has to give up to do it.
What Zstandard Actually Is
Zstandard, or Zstd, is a lossless compressor authored by Yann Collet at Meta and released as an open-source reference implementation under a BSD/GPLv2 dual license. The wire format was published by the IETF as RFC 8878 in February 2021, which is the format version Microsoft is referencing in the DirectStorage SDK. Outside of games, Zstd is everywhere worth caring about: it's the default transparent-compression option for Btrfs, a first-class option for ZFS, the format Linux kernel images have been compressed with since 5.9, a supported codec in Hadoop, Kafka, and most modern container registries, and the format the FreeBSD installer ships in.
Zstd uses a hybrid of LZ77-family dictionary compression with a finite state entropy coder
(Duda's tabled ANS, the same family of entropy
coding behind Oodle Kraken on PlayStation). Tabled ANS gets within a fraction of a percent of
arithmetic coding's theoretical limit at a fraction of the decode cost, which is the actual
reason Zstd decompresses as fast as it does. A modern libzstd reference decoder lands in the
500 MB/s to 2 GB/s range per CPU core for general data, with the SIMD-accelerated paths in
recent versions substantially higher. For comparison, zlib's inflate is in the
200 to 400 MB/s range and not particularly SIMD-friendly.
The other practical reason Microsoft is centering Zstd is that it has tunable compression level (1 to 22, plus negative "fast" levels), supports trained dictionaries that can lift small-payload ratios significantly, and can be implemented as a streaming decoder with bounded memory. That last property matters when the runtime is feeding the codec 256 KB chunks instead of multi-megabyte files. Zstd's frame structure cleanly handles small independent frames, which is exactly the granularity the GPU shader operates on.
None of this is to say Zstd dominates GDeflate (the GPU-friendly Deflate variant Microsoft shipped first in DirectStorage 1.1). GDeflate is more aggressively parallelized for SIMT execution and tends to win on raw GPU throughput per watt of compute. The point of adding Zstd alongside it is choice: Zstd has a wider ecosystem, a better CPU decompression story when the GPU is busy, and better tooling, while GDeflate retains the highest GPU-side throughput when the SMs have headroom. A real shipping game will probably use both, dispatched by workload.
DirectStorage 1.4: The Pipeline
DirectStorage 1.4 reached public preview at version
1.4.0-preview1-2603.504 on March 11, 2026, and the
DirectX
Developer Blog post is the cleanest place to see what Microsoft actually shipped. The runtime
exposes file and memory queues, operates on small chunks with priorities and cancellation, and
supplies first-class decompression hooks for both Zstd and GDeflate. The headline additions versus
1.2 / 1.3:
- Zstd codec support, with both CPU and GPU decompression paths, callable through the same enqueue API as the existing GDeflate path.
- Open-source GPU decompression compute shader for Zstd, redistributable with the SDK. Microsoft's own framing: "optimized for content chunked to 256KB or smaller."
- Game Asset Conditioning Library (GACL) for build-time preconditioning of BC1, BC3, BC4, and BC5 textures. BC7 conditioning support is explicitly called out as a future update.
- Global D3D12 CreatorID support via
DStorageSetConfiguration2, enabling "D3D12 command queue grouping to properly account for DirectStorage workloads" so the driver can schedule decompression alongside the rest of the GPU's work. - Updated
GpuDecompressionBenchmarksample in the microsoft/DirectStorage repository, now capable of comparing Zstd and GDeflate throughput and CPU overhead.
Functionally, the pipeline looks like this: an engine submits a request to a DirectStorage queue naming a file, an offset, a length, a destination resource, and a codec. The runtime issues the NVMe read, lands the compressed bytes in a staging area accessible to the GPU, and dispatches the decompression compute shader (or runs it on the CPU if the policy says so). The decompressed bytes end up in the destination GPU resource without an explicit upload heap copy. The CPU's role is queue management, not data shoveling.
The GPU shader's optimization for sub-256 KB chunks isn't arbitrary. It's the size at which a single compute dispatch can decompress the chunk with good occupancy across a typical modern GPU's SM count without spilling enough state to hurt cache behavior, and it's also roughly the granularity an asset streamer wants to issue at: a streaming geometry chunk, an audio bank slice, or a single 4K BC7 texture mip is in that ballpark. A pipeline aimed at sub-file streaming wants the codec, the runtime, and the engine all agreeing on roughly the same chunk shape.
The driver-side optimization for these new paths is staged. AMD, NVIDIA, and Qualcomm all have H2 2026 timelines for shipping driver-level Zstd decompression tuning. Intel's Microsoft DevBlog quote is more open-ended: they will "share performance improvements in the months ahead." Which means that in spring 2026, the runtime is shipped, the shader is shipped, and the actual silicon optimization is still in flight. That has practical consequences, covered below.
GACL and the BCn Problem
Block Compressed (BCn) texture formats are how GPUs actually consume textures in 2026, and they are notoriously hostile to general-purpose compressors. BC1 through BC7 store textures as fixed 16-byte blocks (8 bytes for BC1 / BC4) containing endpoint colors and per-pixel index bits, with the index and endpoint fields tightly bit-packed. The bit-level layout is great for GPU sampling and bad for an entropy coder, because semantically similar pixels (smoothly varying color, repeated texture patterns) end up with bit patterns that look uncorrelated to a byte-oriented dictionary.
The standard answer is to recondition the data before compression: separate the endpoint and index streams, deinterleave the bit fields, pre-apply a delta or shuffle pass, and feed the result to the general-purpose compressor. Each stream then has stronger internal structure, and the compressor can exploit it. Microsoft's GACL automates this for BC formats and adds a machine-learning-guided variant on top.
| GACL Technique | Microsoft's description | Where the lift comes from |
|---|---|---|
| Shuffling | "Transforms BCn bit streams to promote additional and lower cost matches for Zstd." | Each separated bit-stream is more uniform, so the entropy coder sees lower per-symbol entropy. |
| Block / component entropy reduction | Uses "machine learning to improve outcomes" on top of static shuffling. | Learns asset-specific structure that hand-tuned shuffles miss; small offline training cost for a runtime-free win. |
| Inverse shuffle (runtime) | "After a Zstd stream is decompressed at runtime, any shuffle transforms applied during content conditioning are seamlessly reversed by DirectStorage." | The inverse pass is folded into the decompression dispatch, so engines see normal BCn bytes. |
Microsoft's stated lift is "up to a 50% improvement in Zstd compression ratios for your assets," with no runtime cost beyond the inverse shuffle the decompression shader applies after Zstd decode. That's a build-time change, not a runtime one: ship size and bandwidth get smaller, decompression cost stays roughly the same. For a 100 GB texture-heavy install, "up to 50%" is the difference between fitting in a console's working set or not, even when the BC7-shaped portion of the texture budget waits for the next DirectStorage update.
Where Decompression Runs (And What It Costs)
The most-misunderstood part of the DirectStorage pitch is that GPU decompression is free. It isn't. On PC, the Zstd decompression shader runs on the same SMs that draw the frame. Compute time spent decompressing a chunk is compute time not spent on shading, ray tracing, post-processing, or DLSS / FSR upscaling. The cost is real and it matters. The reason developers still come out ahead is that the alternative (a CPU thread doing the decompression and copying the result across PCIe) is much worse, both in absolute throughput and in main-thread frame impact.
DirectStorage 1.4 lets the application place each request on the GPU path, the CPU path, or let
the runtime decide. The right answer is workload-dependent. A title with heavy ray tracing on a
midrange GPU can be SM-bound during the frame and prefer CPU decompression on idle cores; a
title with a strong CPU bound (open-world simulation, AI, physics) wants the GPU path. A title
with both bounds wants asynchronous compute and careful queue scheduling. The new
DStorageSetConfiguration2 CreatorID is part of how the driver makes that scheduling
decision sanely.
The PC and console pictures diverge on this exact axis. The Xbox Series X|S Velocity Architecture ships with a hardware decompression block Microsoft has publicly described as equivalent to roughly four to five Zen 2 CPU cores worth of decompression throughput, sitting outside both the CPU and the GPU. The SSD-to-RAM path on those consoles never burns SM time on decompression at all.
Microsoft has not confirmed that Helix carries forward a dedicated decompression block of the same kind, but the precedent is strong, the codec target (Zstd) is amenable to a fixed-function implementation, and the platform incentive is obvious: a console with a fixed AMD SoC can afford a few square millimeters of die into a decompression engine and reclaim significant GPU compute for upscaling, NPU-driven NPC behavior, and physics. On a heterogeneous PC install base, Microsoft cannot rely on dedicated decompression silicon, which is why DirectStorage 1.4 ships GPU and CPU decompression paths by default.
The compute-reclamation argument matters specifically for gaming workloads that are getting hungrier for GPU ML capacity: DLSS 4 / FSR Next+ frame generation, neural texture compression decoders, ML-driven NPC behavior, neural radiance caches. Every milliwatt of GPU compute spent on decompression is one not spent on those, and the platform that gets to spend its GPU on rendering instead of pipeline plumbing wins per-frame budget the others don't have. The next section is about how that abstract argument has already shown up in shipping titles.
The Messy Reality on PC
DirectStorage with GPU decompression isn't theoretical. It's also not yet a clean win. The three most-discussed shipped titles using the GPU path each tell a different story, and together they explain why a tuned, fixed-platform implementation is going to feel so much better than the open PC version when Helix arrives.
Ratchet & Clank: Rift Apart (2023, Nixxes)
The success case. The first PC title to ship with GDeflate GPU decompression, built on DirectStorage 1.2 with the GPU path enabled at high settings for background asset streaming. The dimension-transition sequences (the title's signature mechanic) are the single most cited example of "this is what DirectStorage was built for." NVIDIA's launch case study for GDeflate uses Rift Apart's load-time numbers as the headline.
Marvel's Spider-Man 2 (2025, Nixxes)
The cautionary tale. Spider-Man 2 PC ships with GDeflate GPU decompression enabled, and on the RTX 4090 independent testing found the feature actively hurts performance:
| Resolution | Effect of disabling DirectStorage on RTX 4090 |
|---|---|
| 4K | +10% average framerate |
| 1440p | +6% average framerate |
| 1080p | +3% average framerate |
| 4K (1% lows) | +18 to 25% |
Tom's Hardware retested on the RTX 5090 and reported the regression had disappeared, with minor gains in some scenes. Their honest framing was that the 5090 is fast enough to absorb the cost regardless, so Blackwell's actual aptitude for the workload is still an open question pending broader testing across the midrange stack.
Resident Evil Requiem (2026, Capcom)
The weird case. RE:Requiem ships with GDeflate-compressed assets and the GPU decompression path enabled, but independent SpecialK traces show the runtime randomly choosing whether to actually use the GPU. On RTX 5090, 5070, and 5060 the GPU path engages; on a 4060 laptop the runtime falls back to CPU decompression despite the GPU fully supporting the feature. No published explanation for the heuristic exists.
Conspicuously absent from any of this: a public, shipped title using the Zstd path. As of the 1.4 release, Zstd is documented and downloadable but not yet in any retail title's runtime. The first wave of titles built against 1.4 will be the empirical test of whether Zstd's CPU decompression performance and ecosystem fit translate into better real-world streaming behavior than GDeflate's GPU-friendliness. Expect that question to be answered, with measurements, sometime in late 2026 or early 2027.
What Helix Officially Adds
The Xbox Wire post is short on silicon detail and long on platform framing. The wider GDC 2026 keynote coverage fills in the technical feature list. Splitting the two sources cleanly:
Stated in Xbox Wire
- A custom AMD SoC "codesigned for the next generation," with developer alpha hardware shipping in 2027. AMD separately confirmed development is on track for 2027.
- An order-of-magnitude leap in ray tracing performance (Jason Ronald's framing, no specific RT-core or BVH-traversal numbers attached).
- FSR Next, AMD's next-generation upscaler, framed as the underpinning for "what comes next." The "Next+" variant with neural rendering and ML multi-frame generation appears in third-party GDC coverage rather than the Xbox Wire post itself.
Reported from the GDC 2026 keynote
The following features are sourced from third-party coverage of the keynote (tbreak, Tom's Hardware) rather than the Xbox Wire post. Treat as Microsoft's GDC messaging:
- GPU Directed Work Graph Execution, the DirectX-12 work graph extension that lets the GPU schedule its own subsequent draws and dispatches without a CPU round-trip.
- Neural Texture Compression (some Microsoft framing uses "Deep Texture Compression"), a separate compression pipeline operating on the texture content rather than on the bitstream.
- DirectStorage and Zstd compression, listed alongside the rendering features as a platform-level capability and confirmed independently by Ronald's GamingBolt-quoted Game Dev Update remarks.
What Microsoft did not disclose: the SoC's commercial name, the CPU and GPU microarchitecture generations, the NPU's TOPS budget, the memory architecture (unified vs split, bandwidth, capacity), or whether a dedicated decompression block carries forward from the Velocity Architecture. The "Magnus" / "Zen 6" / "RDNA 5" / "FSR Diamond" specifications widely repeated in the trade press (tbreak and others) come from third-party reporting, not Microsoft's announcement, and should be treated as unverified until a later disclosure confirms them.
The interesting part of Helix isn't the silicon, anyway. It's that Microsoft is shipping the same content pipeline across a fixed-target console and a wildly variable PC install base, and has the ability to tune the pipeline differently at each end while keeping the asset format portable. That's the structural advantage a fixed-hardware platform has always had, and it shows up everywhere asset streaming touches the silicon:
- Storage class is fixed. Microsoft can specify a minimum NVMe class, sequential bandwidth, queue depth, and latency floor. PC has to handle everything from a SATA SSD to a PCIe 5.0 drive and pick reasonable defaults across all of it.
- GPU model is fixed. The Zstd decompression shader can be hand-tuned for one SM count, one wave size, one cache hierarchy. The PC shader has to be acceptable across two vendors and many generations, and it shows (see § 07).
- Driver path is fixed. BypassIO support is guaranteed; queue scheduling can be co-designed with the I/O stack rather than negotiated through CreatorID hints.
- Decompression silicon is plausibly present. See § 06. A dedicated decompression block on Helix would let the GPU spend its compute on rendering and ML inference rather than pipeline plumbing.
The asset format, though, is the same. A title built against DirectStorage 1.4 with GACL-conditioned Zstd chunks ships one set of asset packages and lets the runtime resolve them differently per platform. That's the standardization play, and the GamingBolt coverage of Ronald's spring Game Dev Update reinforces it: Microsoft is "leaning in very heavily" on Zstd precisely because the codec works the same everywhere it runs. The variable is where the decompression executes, not what gets decompressed.
Advanced Shader Delivery
Asset streaming is one of two pipelines Microsoft has been quietly rebuilding. The other is Advanced Shader Delivery, and it's worth describing here because the same pattern (move work off the client device, ship a smaller payload, decode just-in-time) shows up in both. The two pipelines are independent in the SDK but converge on the same end goal: a frame that doesn't stutter and a session that doesn't pause.
Just-in-time shader compilation has been the single most visible source of stutter in shipping DX12 / Vulkan titles since the API generation that introduced PSOs. Microsoft's August 2025 introduction post describes the fix: collect each title's shader requirements into a State Object Database (SODB) at build time, and on the server side pair the SODB with a specific GPU and driver profile to produce a Precompiled Shader Database (PSDB). When a player downloads the title through the Xbox app on PC, the app picks the PSDB matching the local GPU and driver and ships precompiled binaries instead of source-level shaders. JIT compilation is bypassed entirely.
The GDC 2026 update formalized the developer integration story: AgilitySDK 1.619 ships an App Identity API (apps declare identity to D3D12 before device creation), a Stats API (PSDB cache hit rates exposed to runtime), PIX integration in the May 2026 release showing those stats as real-time counters, and a feature called Partial Graphics Programs that splits pipeline creation in two so titles with very large PSO counts can reuse common graphics-program prefixes. The initial PSDB delivery is debuting on the ROG Xbox Ally and ROG Xbox Ally X, distributed through the Xbox PC app.
The relevance to asset streaming is the topology, not the mechanism. Both pipelines push a pre-conditioned, platform-aware payload from a server (or local SSD) to the device, do a small amount of decode work near the GPU, and avoid running an expensive process on the client at the moment the player needs the result. Together they're the difference between a 2024-era open-world title that hitches into a new biome and a 2026-era one that doesn't.
What the Research Says
GPU decompression and texture conditioning are active research areas, and the published work explains why Microsoft's design choices in DirectStorage 1.4 land where they do.
GPU decompression architecture: CODAG
CODAG (Sitar et al., 2023) is the most useful recent paper on architecting GPU decompression kernels. Its central finding pushes back on a common assumption: prior GPU decompression schemes assigned specialized thread groups to different decoding stages, which left most of the SMs idle waiting on the critical path. CODAG eliminates the specialization, frees compute resources to run more parallel decompression streams, and lets the GPU's hardware scheduling absorb the multi-latency profile of an LZ-family decoder. They report 13.46x and 5.69x speedups over NVIDIA's RAPIDS RLE implementations, plus a 1.18x lift on Deflate, on the codecs they evaluated.
Worth flagging: CODAG benchmarks RLE and Deflate, not Zstd directly. The architectural lesson (skip thread specialization, exploit GPU hardware scheduling) is what carries over to the DirectStorage 1.4 GPU shader, and it's why a 256 KB chunk size is reasonable: small enough that the dispatch can occupy SMs efficiently, large enough that the entropy coder's setup cost amortizes.
Texture compression that targets the right bottleneck
Two recent papers are directly relevant to the Helix-era "Neural Texture Compression" line. Neural Graphics Texture Compression Supporting Random Access (2024) tackles the constraint that makes neural texture codecs hard for real-time rendering: a sampler needs to fetch a single texel without decoding the entire image. The paper pairs a convolutional encoder with a fully connected decoder operating on positional features, so the GPU can sample at run-time from a learned latent representation rather than reconstructing the whole texture upfront. Hardware Accelerated Neural Block Texture Compression with Cooperative Vectors (2025) takes the next step, mapping the decode onto NVIDIA's cooperative-vectors hardware path in a form usable inside the rendering pipeline.
These are the techniques the Helix "Deep Texture Compression" framing is pointing at. Note the layering: Zstd compresses the bitstream of an asset (BCn block bytes, mesh data, audio); a neural texture codec compresses the image content at a higher level, before the bitstream stage. They're complementary, not competing. A future asset pipeline could ship neural-texture-compressed content packaged as Zstd-compressed chunks streamed via DirectStorage, with each layer recovering a different kind of redundancy.
Entropy coding context
The entropy stage at the heart of Zstd is finite state entropy (Yann Collet's tabled-ANS implementation), descending from Duda's 2013 paper on asymmetric numeral systems. The same family underlies Oodle Kraken on PlayStation. ANS reaches within fractions of a percent of arithmetic coding's theoretical optimum at decode speeds substantially closer to Huffman's, which is the property that lets Zstd compete with Deflate on speed while handily beating it on ratio. The state-table form of tANS is also more amenable to fixed-function implementation than arithmetic coding's range arithmetic, which is one reason a console SoC could plausibly bake a Zstd decoder into silicon. That's an inference, not a confirmed Helix design point.
Project Helix and DirectStorage 1.4 together are best understood as a single platform bet rather than two adjacent announcements. Microsoft is committing to a content pipeline (Zstd-compressed, GACL-conditioned, small-chunk, GPU-decompressable) that scales from a fixed AMD console to a heterogeneous PC install base, with the same asset packages working in both. The codec choice (Zstd over a proprietary alternative) is a deliberate ecosystem play.
What's solid
- DirectStorage 1.4 ships Zstd CPU and GPU paths, with an open-source GPU shader optimized
for chunks of 256 KB or smaller, and a new
DStorageSetConfiguration2CreatorID for driver-side scheduling. - GACL delivers up to 50% better Zstd ratio on BC1, BC3, BC4, and BC5 textures, with the inverse shuffle folded into the decompression dispatch. BC7 conditioning is in flight, not shipped.
- Advanced Shader Delivery is shipping with the AgilitySDK 1.619 toolchain, debuting on the ROG Xbox Ally / Ally X, with Avowed's 85% load-time reduction as the headline case study.
- Project Helix is officially confirmed as a custom AMD SoC console with developer alpha in 2027, and DirectStorage + Zstd is on the GDC 2026 platform-feature list.
What's still open
- The widely repeated Helix specs (Magnus / Zen 6 / RDNA 5 / FSR Diamond) are third-party reporting, not Microsoft's announcement. Treat as unverified.
- Whether Helix carries forward a Velocity-Architecture-style dedicated decompression block. Microsoft hasn't said. The precedent and incentive are both strong; the silicon area cost is modest. Best guess: yes, but unconfirmed.
- Whether Zstd or GDeflate becomes the primary GPU codec in shipping titles. Likely both, with Zstd preferred when CPU decompression is acceptable and GDeflate preferred when SM headroom allows.
- How 2026 PC titles handle the integration challenges that Spider-Man 2 and Resident Evil Requiem have surfaced. Both are policy / runtime issues, not codec issues, but they're affecting players right now.
- BC7 conditioning support and its arrival timeline. The largest portion of a modern AAA texture budget is BC7, and GACL doesn't help it yet.
The thesis, restated
The SSD is becoming an extension of the GPU's memory hierarchy. Zstd is the codec that makes the cost of using it acceptable. DirectStorage 1.4 is the runtime that lets engines reach it at the request rates and chunk sizes modern asset streamers actually issue. Project Helix is the platform where Microsoft can tune the whole path end-to-end while shipping the same asset format everywhere else. None of the pieces are individually new. The bet is on shipping them as a coherent whole, and on closing the gap between "what the runtime can do" and "what the runtime does on a given player's machine."
Sources & Further Reading
Primary Microsoft sources
- DirectStorage 1.4 release adds support for Zstandard — DirectX Developer Blog, March 11, 2026. The canonical writeup of Zstd, GACL, and the new CreatorID API.
- Introducing Advanced Shader Delivery — DirectX Developer Blog, August 2025. Original SODB → PSDB pipeline announcement and the Avowed 85% claim.
- Advanced Shader Delivery: What's New at GDC 2026 — DirectX Developer Blog, March 2026. AgilitySDK 1.619, App Identity API, Stats API, PIX integration, Partial Graphics Programs.
- From GDC: Building the Next Generation of Xbox — Xbox Wire, March 11, 2026. The official Project Helix announcement.
- BypassIO — Microsoft Learn. The driver-level mechanism DirectStorage relies on for low-CPU-overhead reads.
- microsoft/DirectStorage — DirectStorage SDK, samples, and the public Zstd / GDeflate shaders. Includes the updated
GpuDecompressionBenchmarksample.
Reporting
- Project Helix Will Lean Heavily Into Zstandard To Directly Stream Assets From The SSD — GamingBolt, May 2026, covering Jason Ronald's spring Xbox Game Dev Update remarks.
- Microsoft Debuts DirectStorage 1.4 at GDC 2026 — Tom's Hardware, March 2026.
- Spider-Man 2 PC's DirectStorage Issues — PC Gamer, 2025.
- Testing DirectStorage with GPU Decompression on Blackwell — Tom's Hardware.
- Resident Evil Requiem and Inconsistent GPU Decompression — PC Gamer, 2026.
Zstandard as a format
- RFC 8878 — Zstandard Compression and the application/zstd Media Type. The IETF wire-format specification.
- Zstandard project site — reference implementation, benchmarks, and design notes.
Scholarly references (verified, on arxiv)
- Asymmetric Numeral Systems (Duda, 2013) — the entropy-coding family Zstd's FSE stage descends from, and the same family Oodle Kraken's entropy coder is built on.
- CODAG: Characterizing and Optimizing Decompression Algorithms for GPUs (Sitar et al., 2023) — kernel architecture for high-throughput GPU decompression. Evaluates RLE and Deflate, not Zstd directly, but the architectural lesson applies.
- Neural Graphics Texture Compression Supporting Random Access (2024) — neural texture codec designed to be sampled at run-time, without full-image reconstruction.
- Hardware Accelerated Neural Block Texture Compression with Cooperative Vectors (2025) — maps a neural block-texture decoder onto NVIDIA's cooperative-vectors path inside the rendering pipeline.
NVIDIA Technical Blog (GDeflate)
- Accelerating Load Times for DirectX Games and Apps with GDeflate for DirectStorage — the GDeflate launch case study, with Ratchet & Clank: Rift Apart numbers.