Reading view

There are new articles available, click to refresh the page.

AMD Announces the Ryzen 9 9950X3D2 Dual Edition: History Made in Cache

26 March 2026 at 16:05

AMD has officially confirmed the processor enthusiasts have been waiting months for. The Ryzen 9 9950X3D2 Dual Edition is the world’s first desktop chip to feature 3D V-Cache on both compute chiplets, and it arrives April 22nd.

There have been few processor announcements in recent memory that felt genuinely, structurally different from what came before. AMD’s Ryzen 9 9950X3D2 Dual Edition is one of them. Announced this morning by Jack Huynh, AMD’s SVP and GM of Computing and Graphics, through a video posted to AMD’s YouTube channel, this chip does something no desktop processor has ever done: it applies 3D V-Cache technology across both of its compute chiplets simultaneously, not just one.

The result is 208MB of total on-chip cache, comprising 192MB of L3 and 16MB of L2. That is the highest cache figure ever shipped in a consumer Ryzen processor, and AMD is not shy about leaning into that record. Full official specifications are available on the AMD product page.

What Is Dual 3D V-Cache, Exactly?

To understand why this matters, a quick bit of context helps. AMD’s 3D V-Cache technology works by stacking additional SRAM directly on top of a compute chiplet via through-silicon vias. Previous X3D processors, including the Ryzen 9 9950X3D, applied this extra cache to only one of the two CCDs in a dual-chiplet design. The second CCD ran with standard cache allocations. That asymmetry created real-world trade-offs, with certain workloads performing better depending on which CCD a task was scheduled to.

The 9950X3D2 eliminates that compromise entirely. Both CCDs now carry stacked V-Cache, meaning every core on the chip has equal access to the expanded cache pool. A useful way to picture it: imagine two Ryzen 7 9850X3D processors fused onto a single package. That is essentially what AMD has engineered here.

“208MB of cache means more game data, more assets, and more working data sitting right next to the CPU cores.” — Jack Huynh, AMD SVP

Specifications at a Glance

Architecture	Zen 5 (Granite Ridge)
Cores / Threads	16 / 32
Base / Boost Clock	4.3 GHz / Up to 5.6 GHz
L1 / L2 / L3 Cache	1280 KB / 16 MB / 192 MB
Total Cache	208 MB
3D V-Cache	Both CCDs (second-generation)
TDP	200W
Socket	AM5
Memory	DDR5, up to 256 GB
PCIe Version	PCIe 5.0
Instruction Set Extensions	AVX-512, AVX2, AES-NI
Cooler Included	No (liquid cooler recommended)
Overclocking	Unlocked
Launch Date	April 22, 2026
Price	TBA

Performance Claims and Target Workloads

AMD is quoting 5 to 10 percent performance gains over the existing Ryzen 9 9950X3D across creative workloads, with applications like DaVinci Resolve, Blender, and large-scale source code compilations such as Unreal Engine and Chromium cited specifically. Some third-party analyses of AMD’s slide deck suggest gains of up to 13 percent in select tasks, though those figures come with the usual caveats around controlled testing conditions.

It is worth stressing that AMD is not pitching this primarily as a gaming upgrade. The company acknowledges that standard X3D gaming performance is already exceptional, and incremental frame rate gains from the dual-cache design are unlikely to be the headline story. Where AMD says the 9950X3D2 will really distinguish itself is in workloads that live and die by data access latency: large software builds, game engine compiles, AI model training on-device, 3D rendering pipelines, and complex content creation workflows.

That 200W TDP is a number worth paying close attention to, and AMD’s note that a liquid cooler is recommended for optimal performance underscores just how much heat this chip is designed to manage. AMD has historically made strong arguments for the efficiency credentials of its Zen architecture; we have covered that ground in detail at BonTech Labs, including why Intel continues to trail AMD on power efficiency. Whether dual 3D V-Cache pushes AMD’s thermal story in a direction that complicates those comparisons will be worth examining carefully when review hardware is in hand.

The Long Road to Announcement

The 9950X3D2 has had a peculiar path to official existence. Rumours of a dual X3D design circulated throughout 2025, and by the time CES 2026 arrived, most enthusiasts expected AMD to confirm it alongside the Ryzen 7 9850X3D. It did not. The chip was conspicuously absent from AMD’s CES press event, and it remained officially unacknowledged for weeks afterwards.

Then, just days ago, ASRock inadvertently published BIOS support documentation for the 9950X3D2 across its motherboard lineup, effectively announcing the chip before AMD had a chance to. This morning’s video from Jack Huynh brings everything to light at last, with a confirmed worldwide launch date of April 22nd.

For owners of existing AM5 motherboards, the transition should be straightforward. The 9950X3D2 supports the same chipset lineup as current Ryzen 9000 series processors, including X870E, X870, B650E, B650, and A620, so a BIOS update should be all that is required.

What About Zen 6?

The 9950X3D2 arrives at an interesting moment on AMD’s roadmap. Zen 5 has proven competitive across both consumer and professional segments, but speculation about what comes next is already building. If you are trying to decide whether now is the right time to invest in an AM5 flagship or whether it is worth waiting, BonTech Labs has a detailed breakdown of what Zen 6 and AVX-512 support means for PC and server buyers, and it is worth reading before committing to a platform decision.

Pricing and Availability

AMD has not yet revealed the retail price, and that remains the most consequential unknown right now. The Ryzen 9 9950X3D currently sits at around $675 to $700 in most markets. The 9950X3D2 carries additional manufacturing complexity with its three-die package design, so a premium over the existing flagship is virtually certain. Whether AMD prices it to compete with its own Threadripper line or positions it closer to the top of the mainstream desktop stack will define its appeal to the professional and enthusiast markets it is clearly targeting.

We will be pursuing review hardware ahead of the April 22nd launch and will have full performance analysis live as close to release as possible.

Dell XPS 13 9345 Review: Snapdragon X Elite Does the Business in Dell’s Thinnest Laptop Yet

EnosTech

Gavin Bonshor

23 March 2026 at 19:08

The Dell XPS 13 9345 is one of the most compelling compact Windows laptops released in recent memory, and the Snapdragon X Elite X1E-80-100 at its heart is a significant part of why. At 1.19kg and just 15.3mm thick, Dell has built its thinnest and lightest XPS ever, and it genuinely does not feel like anything meaningful has been sacrificed to get there. The CNC-machined aluminium chassis is premium throughout, the 13.4-inch 2K 120Hz InfinityEdge display is sharp and colour-accurate, and the quad-speaker system punches well above its weight for a machine this size. This laptop makes a strong first impression and, for the most part, backs it up once you get into the details.

Design and Build (Snapdragon Elite X version)

Dell has always known how to make an attractive laptop, and the XPS 13 9345 continues that tradition with what is arguably the best-looking XPS the company has produced. The CNC-machined aluminium chassis feels genuinely premium from the moment you pick it up, with a fit and finish that competes with anything at this price point and well above it. At 1.19kg, it is light enough that you legitimately forget it is in your bag, and the 15.3mm profile at its thickest point puts it in direct physical competition with the MacBook Air in a way that very few Windows laptops can claim. The graphite finish on our review unit is clean and professional, without tipping into the kind of anonymous corporate grey that plagues many ultrabooks in this space.

The 13.4-inch InfinityEdge display makes the machine feel smaller than its screen size suggests, and that is a deliberate design outcome rather than an accident. Dell has pushed the bezels down to the point where the laptop chassis itself is not much wider than the screen, which gives the whole package a cohesion that larger-bezel competitors lack. The 2K 120Hz non-touchpanel in our configuration delivers 500 nits of brightness and adaptive refreshes, keeping things fluid, whether you are scrolling through a document or watching streaming content. Colour accuracy is good for general use and video consumption, and outdoor visibility is reasonable at higher brightness settings. However, a glossy finish makes direct sunlight challenging, and the display technology overall is well-matched to the kind of user this laptop is aimed at.

Also Read: Korean AI Startup Upstage Is in Talks to Buy 10,000 AMD MI355X Accelerators – EnosTech.com

The keyboard is where opinions will diverge. Dell has gone with a flat, low-travel design that prioritises the thin chassis over typing depth, and whether that works for you depends heavily on how much time you spend typing and what you’re coming from. It is functional for extended use once you have adjusted your expectations, but anyone coming from a ThinkPad or a MacBook with more traditional key travel will notice the difference. The haptic touchpad similarly takes an adjustment period, with no physical border and a click mechanism that is entirely software-driven. Once calibrated to your preferences, it works well, the glass surface is accurate and smooth, and multi-finger gestures are responsive. It is just different enough from a conventional touchpad that you will spend a day or two relearning muscle memory.

The speaker system is a genuine highlight that deserves more attention than it typically gets in coverage of this machine. The quad-speaker arrangement with a tweeter and woofer configuration produces 8W of total output with actual bass presence, noticeable stereo separation, and volume levels more than adequate for a room environment. Most laptops at this size and weight produce thin, tinny audio that is barely acceptable for a video call. The XPS 13 9345 sounds meaningfully better than that, and it is one of those details that makes day-to-day use noticeably more pleasant.

The port situation needs to be addressed directly and honestly because it is one of the most significant real-world limitations of this machine. There are two USB-C ports, both supporting USB4 at 40Gbps with DisplayPort and Power Delivery. That is the entire connectivity story. No USB-A, no headphone jack, no SD card slot, no HDMI out natively. Both ports are on the same side of the machine, which creates cable management annoyances at a desk. Dell has clearly made a deliberate design choice to strip the chassis to its minimum to meet dimensions and weight requirements, and it is a defensible decision. Still, it is also one that will require most buyers to purchase a USB-C hub as an effectively mandatory accessory. Factors that go into the total when comparing prices against competitors.

CPU Performance (Snapdragon Elite X)

The Snapdragon X Elite X1E-80-100 is a 12-core Arm-based SoC built on TSMC’s 4nm process, with a dual-core boost ceiling of 4GHz, a 45 TOPS NPU for Copilot+ PC features, and Qualcomm’s Adreno integrated graphics handling display output and light GPU workloads. The architecture draws on the same design philosophy that has made Apple Silicon compelling since the M1, prioritising energy-efficient compute over brute-force clock speeds, and the results consistently demonstrate that approach paying off in the x86 Windows laptop market for the first time at a level that genuinely competes.

Starting with Cinebench 2024, the XPS 13 9345 posts a multi-core score of 943 and a single-core score of 118. The multi-core result leads every direct competitor in this class by a meaningful margin. The Microsoft Surface Laptop 2024 13-inch, which runs the same Snapdragon X Elite silicon, scores 798 on multi-core and 123 on single-core. That is a 145-point gap on multi-core between two machines with the same processor, which suggests that Dell’s thermal configuration allows the chip to sustain higher frequencies under load for longer than Microsoft’s chassis permits. The HP OmniBook X 14 lands at 742 multi-core, the Asus Zenbook 14 OLED at 507, and the ThinkPad X1 Nano Gen 3, running an Intel Core Ultra processor, trails the field significantly at 318. The XPS 13 is the class leader here, and it is not particularly close.

Geekbench Pro 6.3 tells a broadly consistent story. The XPS 13 leads multi-core with 14,574, fractionally ahead of the Surface Laptop at 14,432 and more clearly ahead of the HP OmniBook X 14 at 13,233, the Asus Zenbook at 12,310, and the ThinkPad at 10,744. Single-core scores compress into a tight cluster at the top of the chart, with the Surface Laptop marginally ahead at 2,825 versus 2,821 for the XPS 13, a difference so small as to be statistically meaningless. It is worth noting that the ThinkPad posts a competitive single-core score of 2,469 despite being demolished on multi-core, which reflects the architectural difference between Intel’s hybrid-core design and Qualcomm’s more homogeneous approach,h rather than any particular tuning issue.

The HandBrake 1.8.0 real-world encoding test is where the performance picture becomes most compelling. Transcoding a 12-minute 4K source file down to 1080p using the Fast 1080p30 preset, the XPS 13 9345 completes the task in 4 minutes 35 seconds. Every other machine in this comparison takes longer, and some take considerably longer. The Surface Laptop, again running identical silicon, needs 5 minutes 9 seconds, which is 34 seconds behind despite the same chip. The HP OmniBook X 14 takes 6 minutes 1 second, the Asus Zenbook 6 minutes 24 seconds, and the ThinkPad X1 Nano Gen 3 takes over 10 minutes. The HandBrake result is a sustained multi-threaded workload that punishes machines with poor thermal management, and the XPS 13 handles it better than any of its direct competition. Dell’s engineers have clearly done careful work on the thermal solution inside a chassis this thin, and it produces real, measurable results.

On the UL Procyon Computer Vision benchmark, which specifically tests Arm CPU performance on AI- and machine-learning-adjacent workloads, the XPS 13 9345 scores 1,708. This lands it third in a group that is honestly very tightly bunched, behind the HP OmniBook X 14 at 1,771 and the HP EliteBook Ultra G1q at 1,762, and just ahead of the Surface Laptop 2024 at 1,705 and the Surface Pro 2024 at 1,694. The entire field spans fewer than 80 points, indicating the Snapdragon X Elite platform delivers essentially consistent AI compute performance regardless of which OEM has configured it. The differentiator across this field on AI workloads is the platform, not the implementation.

Battery Life and Real-World Use

Battery life is the single strongest argument the Snapdragon X Elite makes against x86 competition, and the XPS 13 9345 demonstrates it convincingly. The 55Wh cell is not large by the standards of larger laptops. Still, the combination of ARM efficiency and Qualcomm’s power management means it goes considerably further than equivalent x86 machines with similar or larger batteries. Streaming BBC iPlayer over hospital Wi-Fi for extended sessions, light document work, and browsing for several hours without access to a charger, the machine kept going from a morning charge well into the evening without complaint. Dell’s all-day battery claim is not marketing language in this case. It holds up in real-world conditions, and that is genuinely more than can be said for most x86 ultrabooks in this size class.

The 60W USB-C power adapter that ships with the machine is appropriately compact for travel, charges the battery at a reasonable rate when you do plug in, and, because both USB-C ports support Power Delivery, you have flexibility about which side of the machine you charge fr, om, depending on where the socket is. That sounds like a small detail, but it matters in the kinds of environments where you are working without a proper desk setup.

The 60W USB-C adapter is compact and charges quickly when you do plug in, which matters in situations where power access is not convenient.

The Windows on Arm Situation

Windows on Arm in 2024 is a substantially different proposition from what it was even two years ago, and the XPS 13 9345 benefits from that progress. Qualcomm’s Prism emulation layer handles most mainstream x86 applications without any visible degradation in the experience, and native Arm64 support now covers the applications that most users spend most of their time in. Chrome, Edge, Firefox, Microsoft Office, Spotify, Visual Studio Code, Slack, Teams, Zoom, and a broad and growing catalogue of productivity tools all run natively and feel snappy. The platform distinction largely disappears for day-to-day use if your workflow sits within the mainstream.

Where the cracks still show is at the edges. Older applications with kernel-level components, certain VPN clients, some creative and media production tools, legacy enterprise software, and a meaningful chunk of the PC gaming catalogue either run with limitations or do not run at all. Driver support for some peripherals can also be inconsistent. None of this is Qualcomm’s or Dell’s fault specifically; it is the reality of a platform transition that is still in progress, but it is also not something that can be glossed over in an honest review. If your workflow depends on any software in those categories, the Compatibility Checker at dell.com for Snapdragon is a necessary first stop before purchase, not optional.

Photos all taken while in hospital having a life-threatening surgery and nasty infection; we’re dedicated to tech even in life and death!

For users whose needs are covered, the Copilot+ PC features add a layer of increasingly useful on-device AI functionality. Live Captions, AI image generation in Paint and Photos, and the NPU-accelerated features in supported applications all work as advertised. Whether they change how you work depends entirely on what you do, but they are present and functional rather than being a marketing checkbox.

Dell XPS 13 9345 (Snapdragon X Elite) Verdict

The sustained multi-core lead in Cinebench and the HandBrake encoding result make a clear case for what Dell’s thermal tuning has achieved here. The fact that the XPS 13 outpaces machines running the same chip is not a small detail. It means Dell has done the engineering work properly rather than simply dropping the Snapdragon X Elite into a thin chassis and hoping for the best. The battery life is the other headline result, and in genuine extended real-world use, it delivers on every claim made for it.

The port situation is the one area that requires honest acknowledgement. Two USB-C ports and nothing else is a deliberate design trade-off, and for desk users, it means a hub is a required purchase rather than an optional one. The Windows on Arm compatibility question is also worth taking seriously before buying, rather than after. Neither issue is a reason to avoid this laptop, but both deserve more than a footnote.

For the right user, someone who wants a compact, light, beautifully built Windows laptop with genuine all-day battery life and a workflow that sits within the mainstream, the Dell XPS 13 9345 is one of the best ultrabooks available right now at any price. The benchmarks make the case, and real-world use backs it up without reservation. Buy the hub at the same time.

Dell provided the XPS 13 9345 in this review for testing. Dell had no input on the contents of this article. Pricing correct at time of writing.

At the time of review, the Dell XPS 13 Copilot+ 9345 laptop can be purchased for £999.99 from Amazon UK.

Also Read: The PC Hardware Industry Has a Memory Problem, and Nobody Is Talking About It Honestly – EnosTech.com

Korean AI Startup Upstage Is in Talks to Buy 10,000 AMD MI355X Accelerators

EnosTech

Gavin Bonshor

23 March 2026 at 15:37

South Korean AI startup Upstage is in discussions with AMD to purchase 10,000 MI355X accelerators, according to Bloomberg, and the story is worth paying attention to for reasons that go beyond the headline number.

Upstage CEO Sung Kim confirmed the talks after meeting AMD CEO Lisa Su in Seoul last week. The quote he gave Bloomberg is the most interesting part of the whole thing: “We have a lot of Nvidia chips in Korea, but we want to diversify to other chips, including AMD’s.” That is not a complaint about Nvidia. It is a deliberate infrastructure strategy, and it is one that more organisations are starting to think seriously about as GPU supply constraints and vendor concentration risk become real operational concerns.

AMD is already an investor in Upstage, having participated in its Series B funding round. So this is not a cold commercial negotiation. It is a deepening of an existing relationship, with Upstage looking to put AMD silicon to work on its Solar language model and on Korea’s national AI foundation model programme. That programme, which the press has taken to calling the “AI Squid Game” after the Netflix series, pits four teams against each other in a government-backed competition evaluated every six months by the Ministry of Science and ICT. Two finalists will be selected by early next year, with the winners receiving additional allocations of Nvidia GPUs. Upstage is currently preparing a model with around 200 billion parameters for the upcoming summer evaluation round.

The MI355X is AMD’s latest Instinct accelerator, built on CDNA 4 architecture with 288 GB of HBM3E memory per card and 8 TB/s of memory bandwidth. Those are serious specifications, and the memory capacity, in particular, is the one that matters most for large-model inference. Running a 200B parameter model requires memory headroom that most accelerators simply cannot provide without aggressive quantisation or multi-node splitting. The MI355X’s memory capacity addresses that directly, and it is part of why AMD has been picking up large-scale enterprise AI commitments from organisations looking for alternatives to Nvidia’s HGX lineup.

The scale of this potential deal, 10,000 cards, would represent a meaningful deployment by any standard. At 288 GB per card, that is 2.88 petabytes of HBM3E accelerator memory if the full order goes through. That level of compute density requires serious infrastructure planning, and it signals that Upstage is not treating this as a trial run. Kim also confirmed that the company is targeting international expansion into markets such as Vietnam and the UAE with sovereign AI systems, which means this compute build-out is not just for domestic competition.

From AMD’s perspective, this is exactly the kind of deal the company needs to keep building momentum in a market that Nvidia still dominates by a considerable margin. AMD’s ROCm software stack has historically been the sticking point for organisations considering a switch, but the gap has narrowed enough that enterprises and startups alike are increasingly willing to run mixed deployments rather than treating Nvidia as the only viable option. The process technology underpinning the MI355X also matters here: TSMC’s capacity constraints at advanced nodes affect every major chip customer, and AMD’s ability to deliver at scale is a real consideration for any organisation planning a large procurement.

It is also worth noting the broader context AMD is operating in. The company has been under market pressure recently, with its stock declining in premarket trading on Monday despite the Upstage news, largely due to macro concerns around Middle East tensions and their effect on supply chains. But the underlying demand signal from deals like this one is clear: there is a growing pool of organisations that want high-performance AI compute, have specific memory and throughput requirements, and are actively looking beyond Nvidia to meet them. AMD’s efficiency advantage over competing x86 architectures is part of what makes its Instinct lineup credible for power-conscious deployments at scale.

The discussions are ongoing rather than finalised, and procurement at this scale involves logistics, software support commitments, and pricing negotiations that take time to close. But the direction of travel is clear. Korea is building serious AI infrastructure, Upstage is positioning itself as a competitive player in that environment, and AMD is the chip partner they are turning to for the next phase of that build-out.

Also Read: Dell XPS 13 9345 Review: Snapdragon X Elite Does the Business in Dell’s Thinnest Laptop Yet – EnosTech.com

The PC Hardware Industry Has a Memory Problem, and Nobody Is Talking About It Honestly

EnosTech

Gavin Bonshor

23 March 2026 at 15:07

The past few months in PC hardware have been eventful by any measure. Apple shipped the M5 Pro and M5 Max, Intel clarified its core architecture roadmap, and anyone trying to build a new PC has been quietly suffering through DRAM pricing that refuses to behave. These stories look separate on the surface. They are not.

The thread connecting all of them is memory, specifically the growing gap between what compute silicon can do and what the memory feeding it can keep up with.

Start with Apple. The M5 Pro and M5 Max are genuinely interesting chips, not just because of the performance numbers but because of what Apple was forced to do architecturally to get there. Fusion Architecture, Apple’s move to a dual-die SoC design, exists primarily because a single monolithic die cannot accommodate 40 GPU cores, 614 GB/s of unified memory bandwidth, and 18 CPU cores without hitting yield and cost walls. The memory bandwidth figure is the one that matters most for AI workloads running locally, and Apple knows it. The M5 Max at 614 GB/s is not chasing gaming benchmarks. It is chasing large language model inference throughput, and bandwidth is the bottleneck that determines how fast it runs.

That bandwidth problem is not unique to Apple. It is an industry-wide crisis, and the full picture of why is considerably more complicated than most coverage lets on. The AI memory crisis running through the data centre right now traces back to physics: DRAM scaling has not kept pace with compute scaling, HBM production is constrained by TSV fabrication yields and advanced packaging capacity, and the most powerful AI systems on the planet spend more time waiting for data than actually processing it. That is not a software problem. It is a silicon and packaging problem, and it does not have a quick fix.

For anyone building a PC right now, the consequences land differently, but they are still real. DRAM pricing has been pulled in two directions simultaneously: AI infrastructure demand is bidding directly on supply at the high end, while consumer DDR5 pricing has been volatile enough to meaningfully change the calculus on a new build from one month to the next. If you have been holding off on a memory upgrade, waiting for prices to settle, the honest answer is that the market dynamics driving this are structural rather than cyclical. Prices may ease, but the pressure from AI demand on overall DRAM supply is not going away.

On the Intel side, there has been a lot of noise about the company killing off its hybrid core architecture in favour of a unified core design. The reality, as is usually the case with Intel roadmap speculation, is more nuanced. Intel is not killing P-cores, at least not in the timeframe the headlines suggest. The unified core concept is a longer-term architectural direction, and the practical implications for anyone buying an Intel platform in the next year or two are limited. What matters more right now is whether Intel’s current generation delivers the performance-per-watt improvements it needs to stay competitive, particularly in a market where Apple Silicon has reset expectations for mobile efficiency and AMD’s desktop Zen 5 parts are putting pressure on the high end.

The bigger picture across all of this is straightforward: memory is the constraint that determines where performance goes next, whether that is Apple designing a new packaging approach to get more bandwidth, hyperscalers paying premiums to secure HBM allocation, or a consumer trying to figure out whether now is a sensible time to buy a 32 GB DDR5 kit. The compute side of the industry has never been more capable. The memory side is struggling to keep up, and that tension is shaping every major hardware decision being made right now.

Also Read: Korean AI Startup Upstage Is in Talks to Buy 10,000 AMD MI355X Accelerators – EnosTech.com

Apple M5 Pro and M5 Max: Everything You Need to Know

EnosTech

Gavin Bonshor

4 March 2026 at 19:59

Apple has officially announced the M5 Pro and M5 Max, the new chips powering the latest 14-inch and 16-inch MacBook Pro, with pre-orders opening March 4 and availability from March 11. On the surface, this looks like another generational chip update, but dig into what Apple has actually done here, and it is a more interesting story than the spec sheet alone suggests.

A New Architecture at the Core of It

The biggest change with M5 Pro and M5 Max is not the core count or the clock speeds; it is how Apple has built the chips in the first place. Both are constructed using what Apple calls Fusion Architecture, which bonds two third-generation 3nm dies together into a single package using TSMC’s advanced SoIC packaging technology. Every M1 through M4 Pro and Max before this was a single monolithic die. That changes here, and the reason why matters.

TSMC’s manufacturing process limits how large a single die can be while maintaining reasonable yield. By splitting the design across two smaller dies and bonding them together, Apple sidesteps that constraint and can reach memory bandwidth and core count figures that a single N3P die could not have delivered economically. The key claim Apple is making is that the inter-die interconnect is fast and low-latency enough that the operating system and applications treat the package as a single unified device, preserving the unified memory model that Apple Silicon has always depended on. Apple has done something similar at the Ultra tier since the M1 Ultra in 2022, but bringing it to the Pro and Max tiers for laptop chips is a harder engineering challenge, and apparently one Apple has now solved.

Apple M5 Pro and M5 Max CPU’s: More Cores, New Core Design

M5 Pro and M5 Max both run an 18-core CPU comprising six super cores and twelve all-new performance cores. The super core is Apple’s highest-performance core design, the same one introduced in the base M5, and Apple claims it delivers the world’s fastest single-threaded performance, driven by increased front-end bandwidth, a revised cache hierarchy, and better branch prediction.

The twelve performance cores are a new design built specifically for the Pro and Max tiers and are not the same as the efficiency cores in previous generations. Where M4 Pro’s E-cores were optimised primarily for power gating, the new performance cores in M5 Pro and M5 Max are designed to deliver sustained multithreaded throughput at lower power than the super cores. That distinction is important for professionals running long compilation jobs, simulations, and rendering workloads that sit somewhere between light background tasks and all-out burst workloads.

Apple is claiming a 30 percent multithreaded uplift for M5 Pro over M4 Pro, which is the largest single-generation CPU gain at the Pro tier since the original M1 Pro. M4 Pro had 14 cores and M5 Pro jumps to 18, a 29 percent increase in core count alone, so the claim is internally consistent. M5 Max gets a more modest 15 percent MT uplift over M4 Max, reflecting the smaller core count jump from 16 to 18.

GPU and Neural Accelerators

M5 Pro gets up to a 20-core GP, U and M5 Max scales to 40 cores. The M5 Pro GPU core count matches the M4 Pro exactly, so the graphics performance gains here are entirely from architectural improvements per core rather than from adding more cores. Apple puts that at around 20 percent better conventional graphics performance and up to 35 percent for ray-traced workloads, with the ray-tracing improvement specifically attributed to Apple’s third-generation ray-tracing engine alongside second-generation dynamic caching and hardware-accelerated mesh shading support.

The more significant GPU addition is the Neural Accelerator that sits inside each GPU core. This is separate from the Neural Engine that handles background Apple Intelligence and Core ML workloads. The Neural Accelerators are dedicated to accelerating matrix multiplication operations that dominate large-model inference when they run through the GPU compute pipeline, as they do in applications like LM Studio and ComfyUI. Apple claims over 4x the peak GPU compute for AI relative to M4 Pro and M4 Max. However, it is worth noting that this figure reflects the Neural Accelerator path specifically, not the conventional shader performance improvement, which is the more measured 20 percent figure.

Memory Bandwidth: The Number That Actually Matters

M5 Pro supports up to 64 GB of unified memory with 307 GB/s of bandwidth, up from 48 GB and 273 GB/s on M4 Pro. M5 Max holds at a maximum capacity of 128 GB but raises bandwidth from 546 GB/s to 614 GB/s in the top 40-core configuration.

For a growing number of professional workloads, memory bandwidth is more important than raw compute performance, and local LLM inference is the clearest example of why. When generating tokens, a large language model must load its full parameter weights from memory on every forward pass. For a 70B-parameter model in 16-bit floating point, that is roughly 140 GB of data moving per token generated, with comparatively little computation performed on it. That makes the workload bandwidth-bound rather than compute-bound, which means 614 GB/s translates directly into faster token generation. For context, AMD’s Ryzen AI Max Plus in the best Windows laptop configuration delivers around 273 GB/s, less than M5 Pro and considerably less than M5 Max. M5 Max also has the memory capacity to run models that cannot fit on any discrete GPU configuration available today, making the bandwidth advantage meaningful in practice rather than just on paper.

Everything Else Worth Knowing About Apple’s new M5 Pro and M5 Max SoCs

Thunderbolt 5 is standard across M5 Pro and M5 Max, and Apple specifies that each port has its own dedicated on-chip controller rather than sharing bandwidth through a motherboard switch. That means each port gets the full 120 Gb/s bandwidth independently. The Media Engine handles H.264, HEVC, and AV1 decode, and ProRes encode and decode, with the Max tier doubling the encode and ProRes throughput, as it has in previous generations. Internal SSD speeds are claimed at up to 14.5 GB/s, roughly double the previous generation, which matters for model loading and high-bitrate video workflows. The new MacBook Pro also picks up Apple’s N1 wireless chip, bringing Wi-Fi 7 and Bluetooth 6.

One feature that tends to get overlooked in launch coverage is Memory Integrity Enforcement, which Apple’s platform security documentation confirms is available on M5-class processors. It is an always-on, hardware-level memory safety mechanism that does not compromise device performance and is specifically designed to protect the kernel attack surface. For enterprise and research users, that is a meaningful security addition that no competing laptop platform currently matches.

The Competitive Picture

No Windows laptop in 2026 combines the memory bandwidth, memory capacity, and power efficiency of M5 Max in a laptop form factor. AMD Strix Halo is the closest competitor for the LLM inference use case and deserves credit for meaningfully closing the gap over recent generations. Still, the bandwidth gap remains a structural disadvantage to overcome within laptop thermal and form-factor constraints. Qualcomm’s Snapdragon X2 Elite is a credible CPU competitor at the consumer tier. Still, the GPU and memory bandwidth situation is not on the same level at the Pro and Max tiers.

Wrapping it Up: Apple M5 Pro and M5 Max look the part on paper

M5 Pro and M5 Max are genuine steps forward,d not just tick-tock updates. Fusion Architecture is the most important Apple Silicon architectural change since M1 Ultra, now applied to the chips that actually go into MacBook Pros. The memory bandwidth figures are the highest available in any laptop, the CPU gains at the Pro tier are the strongest in years, and the Neural Accelerator addition positions both chips well for the continued growth of local AI inference as a professional workload.

Whether Apple’s claimed numbers hold up in independent testing is the question that matters most right now, and that answer starts arriving when hardware ships on March 11.

The AI Memory Crisis: A Deep Technical Analysis of HBM3E, HBM4, DRAM Process Technology, and the Bandwidth Wall Constraining AI

EnosTech

Gavin Bonshor

25 February 2026 at 09:58

For the better part of a decade, the semiconductor industry’s AI narrative centered on compute. More TFLOPS. Bigger dies. Denser transistors. The assumption, implicit in countless product launches and architectural deep-dives, was that if we could build enough compute, the rest would follow. Memory would scale. Bandwidth would keep pace. The limiting factor was always arithmetic capability.

That assumption is now demonstrably false.

The AI memory crisis is not a theoretical concern or a problem for the next generation. It is the binding constraint on current deployments, the primary driver of AI accelerator pricing, and the reason hyperscalers are signing multi-billion-dollar prepayment agreements with memory vendors. Understanding this crisis (its technical roots, its supply chain manifestations, and its potential solutions) requires going deep into the physics of memory, the engineering of High Bandwidth Memory stacks, and the economics of advanced semiconductor packaging.

Part I: The Physics of Memory-Bound AI

Before examining HBM technology itself, we must establish precisely why memory bandwidth has become the constraining factor. This requires understanding the fundamental memory access patterns of modern AI workloads and how they differ from the compute patterns that shaped previous generations of processor design.

Arithmetic Intensity and the Roofline Model

The roofline model, introduced by Williams, Waterman, and Patterson in 2009, provides a framework for understanding performance limits. The model plots achievable performance (in FLOPS) against arithmetic intensity (FLOPS per byte of memory traffic), with two ceilings: a horizontal compute ceiling and a sloped memory bandwidth ceiling.

For a given processor with peak compute capability C (in FLOPS) and memory bandwidth B (in bytes/second), the achievable performance P for a workload with arithmetic intensity I (in FLOPS/byte) is:

P = min(C, B × I)

The inflection point, where a workload transitions from memory-bound to compute-bound, occurs at arithmetic intensity I* = C/B. For NVIDIA’s H100 SXM:

Peak FP16 Tensor Core performance: 1,979 TFLOPS
HBM3 bandwidth: 3.35 TB/s
Inflection point: I* = 1979/3.35 ≈ 591 FLOPS/byte

Any workload with arithmetic intensity below 591 FLOPS/byte is memory-bound on H100. This seems like a high bar. Surely most workloads perform more than 591 operations per byte accessed? For traditional HPC, often yes. For transformer inference, catastrophically no.

Transformer Memory Access Patterns: A Detailed Analysis

The transformer architecture, which underlies virtually all modern large language models, exhibits memory access patterns that systematically produce low arithmetic intensity. Understanding why requires examining each phase of transformer computation.

Linear Projections (QKV and Output)

The core computation in transformers involves matrix-vector multiplications for generating Query, Key, Value, and Output projections. For a model with hidden dimension d_model and a single token:

Weight matrix size: d_model × d_model parameters
Computation: 2 × d_model² FLOPs (multiply-accumulate)
Memory access: d_model² × bytes_per_param (weights) + d_model × bytes_per_activation (input)

In the batch-size-1 case (common in interactive inference), arithmetic intensity is approximately:

I = 2 × d_model² / (d_model² × bytes_per_param) = 2 / bytes_per_param

For FP16 weights (2 bytes), this yields I ≈ 1 FLOP/byte. For INT8 (1 byte), I ≈ 2 FLOPS/byte. This is two to three orders of magnitude below the H100’s inflection point.

Batching helps. Processing B tokens simultaneously amortizes weight loading:

I_batched ≈ 2B / bytes_per_param

To reach compute-bound operation on H100 with FP16 weights would require batch sizes exceeding 1,000 tokens per weight matrix. For latency-sensitive interactive applications, such batch sizes are often impractical.

Attention and KV Cache

The attention mechanism introduces additional memory pressure through the KV (Key-Value) cache. During autoregressive generation, previously computed keys and values must be stored and accessed for each new token.

KV cache size per layer:

KV_size = 2 × batch_size × seq_length × num_heads × head_dim × bytes_per_element

For a model like Llama 2 70B (80 layers, 64 heads, 128 head_dim) with a 4K context in FP16:

KV_size = 2 × 1 × 4096 × 64 × 128 × 2 × 80 = 10.7 GB

This grows linearly with sequence length. At 128K context (increasingly common for modern models), KV cache alone reaches 343 GB, exceeding the capacity of even the B200’s 192GB HBM.

Worse, the attention computation itself (softmax(QK^T)V) exhibits poor arithmetic intensity because the attention matrix must be fully materialized or computed in tiles, with substantial memory traffic for the softmax normalization.

MLP Layers

The feed-forward (MLP) layers in transformers typically use a 4× expansion factor. For a model with hidden dimension d_model:

Up-projection weights: d_model × 4×d_model
Down-projection weights: 4×d_model × d_model
Total: 8 × d_model² parameters

These layers exhibit the same poor arithmetic intensity as the attention projections when processing small batches. The massive parameter count (MLP layers typically comprise roughly two-thirds of total model parameters) makes them the dominant source of memory bandwidth pressure.

Quantifying Real-World Arithmetic Intensity

Empirical measurements of LLM inference consistently show effective arithmetic intensities between 0.5 and 10 FLOPS/byte, depending on batch size, model architecture, and quantization scheme. Even optimized inference frameworks like vLLM, TensorRT-LLM, or custom CUDA kernels cannot escape the fundamental math: moving weights from HBM to compute units dominates execution time.

This creates a situation where adding more compute capability provides diminishing returns. Doubling the TFLOPS of a memory-bound workload yields zero speedup. The industry has reached the point where next-generation accelerators are increasingly defined by their memory subsystems rather than their compute arrays.

Part II: HBM Architecture, A Deep Dive

High Bandwidth Memory represents the semiconductor industry’s most aggressive attempt to address the memory bandwidth problem. Understanding its capabilities and limitations requires examining the technology at the physical, circuit, and system levels.

The DRAM Cell: Foundation of Everything

HBM, like all DRAM, stores data in capacitors. Each bit cell consists of a single transistor (the access device) and a single capacitor (the storage element), forming the “1T1C” cell that has defined DRAM for decades.

The challenge: capacitors discharge over time due to leakage currents, requiring periodic refresh. Smaller capacitors (necessary for density scaling) leak faster and store less charge, making them harder to read reliably. This fundamental physics constrains how small DRAM cells can become and, consequently, how much capacity can be achieved per die.

Modern HBM uses DRAM cells fabricated on the most advanced DRAM process nodes:

1α (1-alpha): Approximately 14-15nm class, current mainstream production
1β (1-beta): Approximately 12-13nm class, entering volume production 2024-2025
1γ (1-gamma): Approximately 10nm class, expected 2026+

These designations refer to the minimum feature pitch in the cell array, not the node name used in logic fabrication. DRAM “10nm” is fundamentally different from logic “10nm,” and direct comparisons are misleading.

EUV Adoption in DRAM

DRAM manufacturers have historically delayed EUV (Extreme Ultraviolet) lithography adoption due to cost sensitivity, but the transition is now underway:

SK Hynix: EUV implementation in 1β node, expanding in 1γ
Samsung: Initial EUV use in 1α, broader adoption in 1β
Micron: 1γ will be their first EUV node (coming from a different lithography strategy)

EUV enables tighter patterning in peripheral circuits and can simplify certain cell array features, but the cost adder (roughly $150M per EUV tool) pressures margins in what remains a cost-sensitive business.

HBM Stack Architecture

An HBM stack consists of multiple DRAM dies vertically stacked atop a base logic die, interconnected via Through-Silicon Vias (TSVs). The JEDEC HBM3 specification defines the interface; implementations vary by vendor.

Die Stack Composition

Generation	DRAM Dies	Capacity/Die	Stack Capacity	Die Thickness
HBM2e 8-Hi	8	2GB	16GB	~40μm
HBM3 8-Hi	8	2-3GB	16-24GB	~40μm
HBM3E 8-Hi	8	3-4GB	24-32GB	~35μm
HBM3E 12-Hi	12	3GB	36GB	~30μm
HBM4 12-Hi (proj.)	12	4GB	48GB	~25-30μm
HBM4 16-Hi (proj.)	16	4GB	64GB	~25μm

Die thinning is critical for tall stacks. Standard DRAM wafers are approximately 775μm thick; HBM dies must be ground to 30-40μm (thinner than a human hair) without damaging the active circuitry. This thinning process introduces mechanical fragility and yield loss.

Through-Silicon Vias: The Vertical Interconnect

TSVs are the defining technology enabler for HBM, providing thousands of vertical electrical connections through each die. Key parameters:

TSV diameter: Approximately 5-10μm for current HBM
TSV pitch: Approximately 40-55μm (center-to-center spacing)
TSVs per stack: More than 5,000 for HBM3
TSV resistance: Approximately 50-200mΩ depending on aspect ratio
TSV capacitance: Approximately 20-50fF

The TSV fabrication process involves:

Via formation: Deep Reactive Ion Etching (DRIE) creates high-aspect-ratio holes through the silicon
Dielectric liner: SiO₂ deposited to insulate the TSV from the silicon substrate
Barrier/seed layer: Typically TaN/Ta/Cu stack deposited via PVD
Copper fill: Electrochemical deposition (ECD) fills the via with copper
CMP: Chemical-mechanical planarization removes overburden

TSV-induced stress is a persistent challenge. The copper fill has a different coefficient of thermal expansion than silicon, creating mechanical stress during thermal cycling. This stress can affect transistor performance in nearby circuits (the “keep-out zone”) and creates reliability risks over time.

The Base Logic Die

The bottom layer of each HBM stack is not a DRAM die but a logic die containing:

PHY (Physical Layer) circuitry: Serializers/deserializers, clock distribution, signal conditioning
Repair logic: Redundancy management for defective DRAM cells
Built-in Self-Test (BIST): Testing infrastructure for manufacturing
Mode registers: Configuration storage for timing and operating modes

The base die is fabricated on a logic process (typically 28nm-12nm class), not a DRAM process. This enables faster logic and better I/O circuits than would be possible on a DRAM process optimized for cell density.

In HBM4, the base die takes on increased importance. The JEDEC specification allows for (and vendors are implementing) application-specific logic in the base die, potentially including compute elements (for processing-in-memory), advanced ECC, or protocol translation. This represents a fundamental architectural shift, with the memory becoming an active participant in computation rather than passive storage.

Electrical Interface Specifications

The HBM interface is radically different from traditional DDR memory, designed for maximum bandwidth in a constrained physical space.

HBM3 Electrical Specifications

Parameter	HBM3 Specification
Interface width	1024 bits (128 bits × 8 channels)
Data rate	Up to 6.4 Gbps per pin
Signaling	Single-ended, 1.1V VDDQ
Channels per stack	16 (pseudo-channels, 8 independent)
Burst length	4 (BL4)
Prefetch	8n (256 bits per channel per access)
Row buffer size	2KB per bank (typical)
Banks per channel	16 (4 bank groups × 4 banks)

HBM3E Evolutionary Changes

HBM3E maintains pin-compatibility with HBM3 while increasing data rates:

Data rate increase: 6.4 Gbps to 8.0-9.2 Gbps
Achieved via improved PHY design, better signal conditioning, tighter timing margins
Some vendors implement additional ECC capabilities
12-Hi stacks introduced for capacity scaling

The bandwidth calculation:

BW = Data Rate × Interface Width / 8 bits per byte

BW_HBM3E_9.2 = 9.2 Gbps × 1024 bits / 8 = 1,178 GB/s per stack

HBM4 Architectural Changes

HBM4 represents a more significant departure:

Parameter	HBM4 Specification (Projected)
Interface width	2048 bits (doubled from HBM3)
Data rate	6-8 Gbps initial, roadmap to 12+ Gbps
Independent channels	16 (up from 8)
Bandwidth per stack	1.5-2 TB/s
Base die	Customizable logic integration

The doubled interface width is the critical change. It nearly doubles available bandwidth without requiring proportional increases in signaling rate. However, this comes with significant implementation challenges:

Interposer routing: 2× more traces required between HBM and processor
Bump count: Approximately 2× micro-bumps per stack
Power delivery: Higher aggregate I/O power
Controller complexity: More channels to manage simultaneously

Power Consumption Analysis

HBM power consumption is a frequently overlooked constraint that becomes increasingly important as stack counts and data rates increase.

Power Breakdown

HBM power consists of several components:

I/O power: Driving signals between HBM and processor; scales with data rate and activity
Core power: Activating rows, sensing data, refresh; relatively constant per bit stored
Peripheral power: Clocking, command decode, PHY; scales with frequency

Typical HBM3E stack power consumption:

Operating Mode	Power (approximate)
Idle (self-refresh)	2-3W
Read-intensive (high BW)	12-18W
Write-intensive	14-20W
Peak (sustained max BW)	18-25W

For a B200 with eight HBM3E stacks, memory alone can consume 100-200W under load, representing a substantial fraction of the total package power budget.

Energy Efficiency Metrics

The industry typically measures memory energy efficiency in picojoules per bit (pJ/bit):

DDR5: Approximately 15-25 pJ/bit
GDDR6: Approximately 8-15 pJ/bit
HBM3: Approximately 3-5 pJ/bit
HBM3E: Approximately 2.5-4 pJ/bit
HBM4 target: Approximately 2-3 pJ/bit

HBM’s efficiency advantage over alternatives stems from its wide interface (moving more bits per clock cycle) and short signaling distance (microbumps versus package traces). For AI workloads that move enormous amounts of data, this efficiency advantage compounds into meaningful total power savings.

Part III: Advanced Packaging, The True Bottleneck

If HBM is the strategic resource constraining AI hardware, advanced packaging is the strategic bottleneck constraining HBM deployment. The ability to physically integrate HBM stacks alongside logic dies is limited by packaging technology, and that packaging technology is limited primarily by TSMC’s manufacturing capacity.

Silicon Interposer Technology

The silicon interposer is the foundation of HBM integration. It is a large piece of silicon (larger than either the logic die or HBM stacks) that provides fine-pitch interconnect between components.

Interposer Specifications

Parameter	Typical Values
Interposer thickness	100-150μm
Interposer size (H100)	~2,500 mm²
Interposer size (B200)	~4,000+ mm²
RDL layers	3-4 layers typical
RDL pitch	0.4-1.0μm line/space
TSV pitch (interposer)	40-50μm
Micro-bump pitch	40-55μm (current), 25-40μm (advanced)

The interposer is fabricated using a dedicated process on 65nm-class equipment. It doesn’t require cutting-edge lithography for the RDL layers, but the TSV formation and planarization steps are complex. Larger interposers face reticle limits (the maximum exposure area of a lithography tool), requiring stitching (multiple exposures) for interposers exceeding approximately 800mm².

Micro-Bump Interconnect

Micro-bumps connect the HBM stacks and logic die to the interposer. These are small solder spheres (typically SnAg alloy) that are reflowed to form electrical connections.

Current micro-bump specifications:

Diameter: Approximately 25-40μm
Pitch: Approximately 40-55μm
Height: Approximately 20-35μm after reflow
Resistance: Less than 50mΩ per bump

Micro-bump count scales with interface width: an HBM3 stack requires roughly 1,500-2,000 bumps including power and ground, while HBM4’s doubled interface will approach 3,000+ bumps per stack. Yield loss in micro-bump formation is a persistent challenge, as a single failed bump can render the assembly non-functional.

CoWoS Variants in Detail

TSMC’s CoWoS (Chip-on-Wafer-on-Substrate) is the dominant platform for HBM integration, with multiple variants optimized for different use cases.

CoWoS-S (Standard)

CoWoS-S uses a monolithic silicon interposer:

Interposer: Single silicon die with RDL and TSVs
Size limit: Approximately 1,700mm² (reticle-limited) without stitching, up to approximately 2,500mm² with stitching
Applications: NVIDIA H100, AMD MI250/MI300
Yield: Mature process with reasonable yields
Cost: High but predictable

The H100 SXM uses CoWoS-S with an approximately 2,350mm² interposer carrying the GH100 die and five HBM3 stacks.

CoWoS-L (Local Silicon Interconnect)

CoWoS-L enables larger effective interposer areas by using multiple Local Silicon Interconnect (LSI) chips on an RDL interposer:

Architecture: Small silicon chips (LSIs) provide fine-pitch routing in critical regions; coarser RDL connects LSIs
Size capability: 3,000-5,000mm² effective area
Applications: NVIDIA Blackwell B100/B200 with dual-die configuration
Complexity: Higher than CoWoS-S; requires careful design partitioning
Yield: Can be better than very large CoWoS-S due to smaller silicon pieces

Blackwell’s architecture (two compute dies connected via NVLink) demands CoWoS-L. The alternative (a single interposer large enough for both dies plus eight HBM stacks) would face severe reticle and yield challenges.

CoWoS-R (RDL Interposer)

CoWoS-R replaces the silicon interposer with an organic RDL structure:

Interposer: Multi-layer organic RDL on substrate
Pitch: Coarser than silicon (approximately 2μm vs. approximately 0.4μm)
Cost: Lower than silicon interposer
Performance: Slightly lower bandwidth potential due to coarser routing
Applications: Cost-sensitive products, lower HBM stack counts

Capacity Constraints and Expansion

CoWoS capacity has been the binding constraint on AI accelerator shipments since 2023. TSMC’s capacity evolution:

Year	Approximate CoWoS Capacity (monthly starts)
2023	~12-15K
2024	~25-35K
2025 (projected)	~50-60K
2026 (projected)	~80-100K

Even with aggressive expansion, demand continues to outstrip supply. Every Blackwell unit, every MI300X, every Google TPU v5, every Amazon Trainium 2 competes for the same CoWoS lines. The expansion requires not just floor space but specialized equipment (high-accuracy die bonders, precision dispensing systems for underfill, advanced inspection tools) with lead times exceeding 12 months.

Die Bonding and Assembly Process

The CoWoS assembly process flow reveals the complexity involved:

Interposer fabrication: TSV formation, RDL patterning, passivation
Interposer wafer probe: Electrical testing to identify good interposer sites
Micro-bump formation: UBM (under-bump metallurgy) deposition, bump plating on interposer
HBM stack attachment: Thermocompression bonding of tested/known-good HBM stacks to interposer
Logic die attachment: Thermocompression bonding of GPU/accelerator die
Underfill dispense: Capillary flow of epoxy underfill for mechanical stability
Underfill cure: Thermal cure of underfill material
Interposer thin and TSV reveal: Backside grinding to expose TSVs
Substrate attach: Flip-chip bonding to organic substrate
Substrate-level testing: Full functional testing
Lid attach: Thermal interface and protective lid installation
Final test: Comprehensive characterization

Each step introduces potential defects and yield loss. The compounding effect means that even with 99% yield at each step, ten steps yield only 90% cumulative yield. In practice, overall CoWoS assembly yields are significantly lower, though exact figures are closely guarded.

Hybrid Bonding: The Future of High-Density Interconnect

Micro-bump pitch is approaching physical limits. At pitches below approximately 25μm, bump bridging (adjacent bumps shorting together) and non-wet opens (bumps failing to form connections) become increasingly problematic. The industry is transitioning toward hybrid bonding for future high-density applications.

Hybrid Bonding Technology

Hybrid bonding directly connects copper pads on two surfaces without solder, achieved through:

Surface preparation: Ultra-flat surfaces (CMP to less than 0.5nm roughness)
Plasma activation: Surface treatment to enable low-temperature bonding
Alignment: Sub-micron accuracy placement
Direct bonding: Dielectric-to-dielectric bonding at room temperature
Anneal: Thermal treatment to form copper-copper bonds

Key parameters:

Pitch: Less than 10μm demonstrated, approximately 5μm in production for image sensors
Density: More than 10,000 connections/mm² vs. approximately 500/mm² for micro-bumps
Resistance: Less than 20mΩ, lower than micro-bumps
Bandwidth potential: More than 1 TB/s/mm² demonstrated

TSMC’s SoIC platform uses hybrid bonding for die-to-die connections. For HBM, hybrid bonding could enable:

Wider interfaces without proportional area increase
Lower I/O power due to shorter interconnects
Tighter integration between compute and memory

However, hybrid bonding’s stringent surface and alignment requirements make it challenging for large-area applications. Volume deployment for HBM-to-logic interfaces likely remains in the 2027+ timeframe.

Part IV: Vendor Roadmaps and Competitive Dynamics

The HBM market is an oligopoly with three players: SK Hynix, Samsung, and Micron. Their competitive positions, technology roadmaps, and capacity plans will determine HBM availability for the remainder of the decade.

SK Hynix: Technical Leadership and Capacity Constraints

SK Hynix has maintained HBM leadership through consistent execution across process technology, TSV integration, and customer relationships.

Technology Position

Process node: HBM3E on 1α DRAM process, transitioning to 1β in 2025
Stack height: 12-Hi HBM3E in volume production; 16-Hi sampling for HBM4
Data rate: Production HBM3E at 9.2 Gbps, industry-leading
Capacity per stack: 36GB (12-Hi HBM3E), 48GB+ roadmap for HBM4

Manufacturing Footprint

DRAM fabs: Icheon (M10, M14, M15, M16), Cheongju (M11, M12)
HBM packaging: Primarily Icheon, expanding to Cheongju
Capacity expansion: New M15X fab for HBM, operational 2025-2026

Strategic Relationships

SK Hynix is NVIDIA’s primary HBM supplier, with multi-year agreements covering H100, H200, and Blackwell generations. This relationship provides revenue visibility but also creates concentrated customer risk. SK Hynix is working to diversify with AMD and hyperscaler custom silicon programs.

Financial Profile

HBM revenue share: Expected to exceed 40% of DRAM revenue in 2025
HBM margin premium: Estimated 60-70% gross margin vs. approximately 40% for commodity DRAM
Capex intensity: More than $15B annually, with substantial allocation to HBM

Samsung: Recovery and Catch-Up

Samsung’s HBM challenges have been well-documented: yield issues, delayed qualification, lost market share. The company’s recovery efforts represent one of the most significant competitive dynamics in the memory industry.

Technology Timeline

2023: HBM3 yield issues; NVIDIA qualification delayed
H1 2024: HBM3E qualification for NVIDIA reportedly failed thermal testing
H2 2024: Claimed HBM3E qualification achieved; volume ramp begins
2025: 12-Hi HBM3E expansion; HBM4 development acceleration

Technical Approach

Samsung is pursuing several differentiated strategies:

Advanced packaging investment: Expanded in-house HBM packaging to reduce TSMC dependency
Thermal solutions: New thermal interface materials and heat spreader designs
HBM-PIM: Processing-in-Memory variants with in-DRAM compute acceleration
Alternative stacking: Investigating non-TSV 3D DRAM approaches for future generations

Manufacturing Capacity

Fabs: Hwaseong (Lines 13-18), Pyeongtaek (P1, P2, P3)
New construction: P4 (Pyeongtaek) and Taylor, Texas fab
HBM packaging: Expanding dedicated HBM lines at multiple sites

Strategic Position

Samsung’s DRAM market leadership (by total bits shipped) provides scale advantages in DRAM wafer production, but HBM requires execution in packaging and qualification, not just wafer fab. The company’s integrated device manufacturer (IDM) model enables end-to-end control but also means no external validation of component quality.

Micron: The American Option

Micron occupies a unique position as the only U.S.-headquartered HBM manufacturer, providing supply chain diversification and potential advantages under evolving semiconductor policy.

Technology Status

HBM3E: 8-Hi product in volume production; claimed 9.2 Gbps performance leadership
12-Hi HBM3E: Sampling in late 2024, volume 2025
HBM4: Development on track for 2026 production
Process: 1β transition occurring 2024-2025

Manufacturing Strategy

Micron takes a different approach than Korean competitors:

Wafer fab: Hiroshima (Japan), Taichung (Taiwan), Boise (Idaho), expanding with CHIPS Act support
HBM packaging: Primarily outsourced (TSMC, others), with some in-house capability
New capacity: Idaho and New York fabs supported by CHIPS Act funding

CHIPS Act and Geopolitical Positioning

Micron has secured approximately $6.1B in CHIPS Act grants and up to $7.5B in loans, supporting:

Boise, Idaho: Expanded HBM-capable DRAM production
Clay, New York: New megafab for advanced DRAM (long-term)
Domestic supply chain: Reduces dependence on Korea and Taiwan for critical AI components

For hyperscalers and defense applications with supply chain security requirements, Micron’s American footprint provides strategic value beyond pure price/performance competition.

Market Share and Pricing Dynamics

Current and projected HBM market share:

Vendor	2024 Estimated Share	2025 Projected Share
SK Hynix	~50-55%	~45-50%
Samsung	~35-40%	~35-40%
Micron	~10-15%	~15-20%

Pricing remains elevated relative to commodity DRAM:

HBM3E ASP: Approximately $15-20 per GB (vs. approximately $2-3/GB for DDR5)
Price premium: 5-10× commodity DRAM on a per-bit basis
Contract structures: Long-term agreements with prepayments becoming standard
Price trends: Expected to remain elevated through 2026 due to supply constraints

Part V: Future Trajectories and Alternative Architectures

The HBM roadmap provides a clear near-term path, but the fundamental tension between memory bandwidth demand and deliverable supply suggests the need for architectural innovation beyond incremental HBM improvements.

The Terabytes-per-GPU Challenge

Major AI developers have publicly and privately signaled memory requirements approaching terabyte scale per accelerator by 2027-2028:

Model scaling: 10+ trillion parameter models in development
Long context: 1M+ token context windows becoming standard
Mixture-of-experts: MoE models require full model weight residency
Multi-modal: Vision and video processing dramatically increase memory footprint

Current state-of-the-art (B200: 192GB) must scale approximately 5× to reach terabyte class. How might this occur?

Path 1: More HBM Stacks

8 stacks to 12 or 16 stacks per GPU
Requires dramatically larger interposers (CoWoS-L or beyond)
Power delivery becomes critical (400W+ from HBM alone)
Physical package size may exceed practical limits

Path 2: Higher Capacity Stacks

16-Hi and 24-Hi stacks under development
Die thinning beyond 25μm introduces extreme mechanical fragility
Thermal dissipation through tall stacks becomes limiting
TSV aspect ratios increase, complicating fabrication

Path 3: Higher Density DRAM

1γ (10nm class) and beyond
4D DRAM (vertical cell transistors) enables 3× density improvement
Timeline: Volume production likely 2027-2028

Path 4: Architectural Innovation

Alternative memory architectures may supplement or partially replace HBM:

CXL-Attached Memory

Compute Express Link (CXL) enables memory expansion beyond the package boundary. CXL memory provides a tiered memory architecture:

Tier 1: On-package HBM (highest bandwidth, lowest latency)
Tier 2: CXL-attached memory (moderate bandwidth, medium latency)
Tier 3: Storage-class memory/SSD (lowest bandwidth, highest latency)

CXL 3.0 specifications:

Bandwidth: Approximately 64 GB/s per x16 link (PCIe 6.0 electrical)
Latency: Approximately 150-200ns additional vs. local memory
Topology: Supports memory pooling across multiple hosts

CXL’s bandwidth is 1-2 orders of magnitude lower than HBM, making it unsuitable for bandwidth-critical operations. However, for capacity expansion (storing model weights that fit in CXL memory while hot weights reside in HBM) it provides a viable path to terabyte-class systems.

Samsung, SK Hynix, and Micron all have CXL memory products in production or sampling. The ecosystem challenge is software: efficiently tiering data between HBM and CXL memory requires runtime intelligence that remains immature.

Processing-in-Memory (PIM)

Instead of moving data to compute, PIM moves compute to data. By integrating processing elements within or adjacent to memory arrays, PIM reduces data movement for suitable workloads.

Samsung HBM-PIM

Samsung’s HBM-PIM adds programmable compute units to the HBM base die:

Compute capability: SIMD units for vector operations
Target workloads: Element-wise operations, embeddings, attention
Bandwidth advantage: Data processed before leaving HBM stack
Programming model: Custom SDK required; limited ecosystem

HBM-PIM has seen limited adoption due to programming complexity and restricted operation support. For transformer inference, where the bottleneck is feeding weights to matrix multiply units, PIM’s element-wise strengths are not ideally matched.

GDDR-Based Alternatives

GDDR6 and GDDR7 offer an alternative path with different tradeoffs:

Parameter	HBM3E	GDDR6X	GDDR7 (projected)
Bandwidth per chip	~1.2 TB/s	~84 GB/s	~192 GB/s
Pins per chip	1024	32	48
Power efficiency	~3 pJ/bit	~10 pJ/bit	~8 pJ/bit
Cost per GB	$$$	$	$$
Package complexity	High (interposer)	Low (package)	Low

GDDR requires more physical chips to achieve equivalent bandwidth (16-24 chips versus 6-8 HBM stacks), consuming substantially more board area and power. For data center accelerators where density and efficiency are paramount, HBM remains preferred. GDDR is more suitable for consumer GPUs and edge devices where cost sensitivity dominates.

Optical Memory Interfaces

Looking further ahead, optical interconnects could fundamentally change memory architecture:

Bandwidth potential: Terabit-class per fiber
Distance: Meters instead of millimeters, enabling disaggregated memory
Power: Potentially lower at high bandwidth-distance products
Latency: Speed of light, but opto-electronic conversion adds overhead

Intel, NVIDIA, and startups like Ayar Labs are developing co-packaged optical I/O. Production deployment for memory interfaces remains in the 2028+ timeframe at earliest, but the technology could enable architectures where memory is physically separated from compute while maintaining high bandwidth.

Part VI: The Strategic Picture

The AI memory crisis is not merely a technical challenge. It is reshaping competitive dynamics across the semiconductor industry and influencing the pace of AI capability development.

Industry Implications

Memory vendors: Transformed from commodity suppliers to strategic partners; pricing power and margin expansion
TSMC: Advanced packaging has become as strategic as leading-edge logic; CoWoS capacity expansion is a capital priority
NVIDIA/AMD: Memory subsystem design increasingly differentiates products; software optimization for memory efficiency becomes critical
Hyperscalers: Supply chain security requires multi-vendor strategies and forward commitments
AI developers: Algorithm research increasingly targets memory efficiency (quantization, sparsity, efficient architectures)

The Efficiency Imperative

The memory wall is driving innovation in AI efficiency that will have lasting impact regardless of whether hardware constraints eventually ease:

Quantization: FP8, INT4, and below reduce memory footprint with minimal accuracy loss
Sparsity: Structured and unstructured sparsity techniques reduce effective parameter counts
Architecture innovation: Linear attention, state-space models, and other alternatives that scale better
Speculative decoding: Using smaller models to reduce large model invocations
Caching and retrieval: External knowledge bases reduce the need for massive parameter counts

These algorithmic advances, driven by hardware constraints, may ultimately prove more impactful than the hardware improvements themselves.

Conclusion: The Memory-Defined Era

We have entered a period where AI hardware progress is measured not in TFLOPS but in terabytes and TB/s. The memory wall is real, it is present, and it will shape the trajectory of artificial intelligence development for the remainder of this decade.

The industry’s response (HBM scaling, packaging innovation, alternative architectures, algorithmic efficiency) will determine whether AI capabilities continue their exponential trajectory or bend to a more constrained path. The companies that solve these challenges will define the next generation of AI infrastructure. The companies that fail to adapt will find their products bottlenecked by an increasingly expensive and scarce resource.

Memory is the new compute. HBM is the new gold. And the AI memory crisis is just beginning.