Vision

Breaking Memory Bottlenecks in AI

April 17, 2025

AI computation isn’t just about crunching numbers – it’s also about moving data. A well-known challenge in computer architecture is the “memory wall”: processor speeds have increased much faster than memory speeds, so processors often sit idle waiting for data to come from memory. In AI workloads, this is exacerbated by the huge size of neural network models and the datasets they process. An AI chip might perform tens of trillions of operations per second, but if it can’t get data in and out of memory quickly enough, those ALUs (arithmetic logic units) are starving. To mitigate this, the industry developed High-Bandwidth Memory (HBM), which involves stacking memory chips (DRAM) closely to the processor and connecting them with through-silicon vias (TSVs) on an interposer. This arrangement offers much higher bandwidth than plugging in DIMM modules further away on the motherboard. HBM2 and HBM3, for instance, provide on the order of 1 TB/s of bandwidth to the GPU – a massive improvement over previous memory tech. However, even HBM has its limitations: the TSVs and interposer consume extra power and add latency, and the number of memory stacks you can physically put around a chip is limited (typically 4–8 stacks max, due to space and wiring constraints). As AI models demand even more memory (think of giant recommendation systems or language models needing tens of GBs of fast memory), we are once again nearing a bottleneck.

CDimension’s Memory-On-Logic Integration: Our solution is to take memory out of its separate package and embed it directly onto the AI chip. Using our monolithic 3D integration, we can fabricate memory layers on top of logic layers. Imagine an AI processor, but now instead of being topped with just metal interconnect layers, it’s topped with an entire layer of DRAM cells (built with 2D material transistors). Then possibly another layer, and so on. This effectively creates an AI Computing Stack where layer 1 could be logic, layer 2 memory, layer 3 logic, etc., all in one chip. The connections between these layers are simply part of the chip manufacturing – no TSVs, no interposer. This approach is often called computing-in-memory or logic-in-memory integration, and it’s like the Holy Grail for breaking the memory wall. The distance between logic and memory shrinks from centimeters (on a motherboard) or millimeters (with HBM on an interposer) down to microns (on a single 3D chip).

No TSVs = Big Wins: TSVs (through-silicon vias) are essentially vertical copper pillars that go through silicon to connect stacked dies. While useful, they are relatively large (several hundreds nanometers wide) and need significant spacing around them, plus they add capacitance and consume power each time a signal goes through. By not needing TSVs (because our memory and logic are made together, not as separate chips), we immediately save that overhead. It’s reported that in some HBM stacks, a noticeable percentage of power is spent just driving signals through TSVs and across the interposer. We avoid that entirely. Additionally, TSVs can impact yield (they’re basically holes that can introduce stress or defects) – eliminating them can improve manufacturing yield or at least simplify the process. Our 3D memory uses monolithic inter-tier vias (MIVs) which are analogous to normal interconnects in a chip, just going between layers, and can be made much denser and more power-efficient than TSVs. CDimension’s 3D HBM avoids TSVs, monolithically integrating memory for greater bandwidth and efficiency. Instead of separate memory dies communicating over relatively long distances, we embed memory right where it’s needed. In CDimension’s design, memory arrays sit directly above compute units, connected by millions of small vertical vias. This means when a neural engine core needs to fetch weights or activations, those bits might be stored literally nanometers above it, in a memory cell accessible through a direct vertical hop. The bandwidth of such a connection is immense – we can achieve essentially one bit per vertical via, and we can have millions of vias, far exceeding the few thousand connections a typical HBM stack might have. The latency is also incredibly low; it’s like having cache-level proximity but for large memory.

1000× Improvement Possibilities: In our R&D, we have demonstrated that using MoS₂-based memory devices in a 3D integration can lead to up to 1000× higher integration density and bandwidth. What does this number mean? It means you could store and access 1000 times more data in the same area compared to a conventional silicon DRAM on an interposer. Part of this comes from the elimination of bulky TSV structures, part from the ability to use multiple layers, and part from the inherently smaller device footprint of 2D-material transistors for memory. Also, energy per bit can drop by a similar three orders of magnitude because we’re not driving signals far – each bitline/wordline is extremely short and capacitances are low, plus the operating voltage of our 2D memory can be comparable or even lower than standard DRAM. These combined improvements attack the memory bottleneck from both sides: more bandwidth, less energy.

For an AI example, consider a training accelerator for a language model: with current tech, you might have 40 GB of HBM memory delivering ~1 TB/s. With CDimension’s integrated memory, you could imagine 40 GB integrated delivering maybe ~100 TB/s (100×) or more, and doing so at a fraction of the energy. In fact, because the memory is so close, you might not need as much cache on the processor, saving area and power there too. It changes the balance of the system. Algorithms that were memory-bound can suddenly be compute-bound (which is a good thing if you have lots of compute). It allows more parallelism – you can feed many more cores in parallel without starving them.

Scaling and Flexibility: Another advantage is how this approach scales. If you need more memory, you don’t necessarily have to make a larger die footprint or add more chips around – you can consider adding another memory layer on top in the 3D stack (assuming thermal and design considerations are managed). We essentially trade off a little more height for a lot more memory. And because the memory is integrated, we can customize it per application. For instance, one AI chip might include a layer of SRAM for very low latency, while another includes a dense DRAM layer for larger capacity – or even both. The architecture can blur the line between “on-chip memory” and “off-chip memory” because all memory is now effectively on-chip.

Challenges and Our Progress: It’s worth noting that this approach is cutting-edge and comes with challenges. One challenge is heat: stacking memory on logic means the logic layer’s heat has to go through memory layers to get out (unless you have interspersed thermal conduits). We have been exploring thermal solutions like micro thermal vias or arranging high-heat and low-heat layers smartly. Another challenge is yield: a defect in any layer could ruin the whole stack, so manufacturing processes need to be very robust (or incorporate redundancy). The good news is that memory, especially DRAM, can tolerate redundancy (bad cells can be laser-fused out and replaced with spares, etc.), and we can apply similar redundancy in our 3D memory layers. Our development is staged – our Phase II product focuses on a 3D integrated HBM built on 2D materials, which will let us refine these techniques in a practical device. So far, results are very promising and indicate that these challenges are solvable, paving the way for volume-production 3D memory on logic.

AI Implications: Breaking the memory bottleneck means AI chips can keep their arithmetic units busy much more of the time. Training large models could be faster because GPUs/TPUs (or our future 3D AI ICs) won’t be bottlenecked by waiting on data from memory. Inference on big models (think running GPT-XYZ type models) could be done with lower latency because the weights can be accessed so quickly. It could also enable new algorithms – for instance, those that rely on very fast fine-grained memory access or even in-memory computing (performing certain computations directly where the data is stored). By essentially giving an AI chip an abundance of “feeding bandwidth,” we allow architects to design systems with more cores or more parallel processes without hitting the bandwidth ceiling.

In edge devices, having high-bandwidth memory on-chip means you can do powerful AI processing without relying on cloud offloading, because the device can handle large models internally. Imagine AR glasses or drones with supercomputer-level AI chips that don’t consume insane power – integrated memory is key for that vision.

Conclusion: Memory has long been the Achilles’ heel for computing performance scaling. With 3D monolithic integration of memory and logic, CDimension is turning the Achilles’ heel into a new strength. We’re effectively building AI chips that eat data for breakfast – providing them with massive, efficient on-chip data storage and access. As we perfect this technology, the concept of a memory bottleneck could become a thing of the past for AI systems, which means the only limit to performance will be how many compute units we can cram in – and we have plans for that too, as discussed in the next blog about fully 3D-integrated AI chips.

Vision

From Lab Curiosity to Industry-Ready: How CDimension is Making 2D Materials Work for the Future of Electronics

Technology

The Future of AI Hardware: Breaking the Silicon Barrier

Want to know more? Stay in touch

See what we’re working on, be the first to hear about product releases, and get our insights articles sent right to your inbox.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Breaking Memory Bottlenecks in AI

Read more

Want to know more? Stay in touch