AI model using AMD GPUs for training hits milestone

Zyphra, AMD, and IBM spent a year testing whether AMD’s GPUs and platform can support large-scale AI model training, and the result is ZAYA1.

In partnership, the three companies trained ZAYA1 – described as the first major Mixture-of-Experts foundation model built entirely on AMD GPUs and networking – which they see as proof that the market doesn’t have to depend on NVIDIA to scale AI.

The model was trained on AMD’s Instinct MI300X chips, Pensando networking, and ROCm software, all running across IBM Cloud’s infrastructure. What’s notable is how conventional the setup looks. Instead of experimental hardware or obscure configurations, Zyphra built the system much like any enterprise cluster—just without NVIDIA’s components.

Zyphra says ZAYA1 performs on par with, and in some areas ahead of, well-established open models in reasoning, maths, and code. For businesses frustrated by supply constraints or spiralling GPU pricing, it amounts to something rare: a second option that doesn’t require compromising on capability.

How Zyphra used AMD GPUs to cut costs without gutting AI training performance

Most organisations follow the same logic when planning training budgets: memory capacity, communication speed, and predictable iteration times matter more than raw theoretical throughput.

MI300X’s 192GB of high-bandwidth memory per GPU gives engineers some breathing room, allowing early training runs without immediately resorting to heavy parallelism. That tends to simplify projects that are otherwise fragile and time-consuming to tune.

Zyphra built each node with eight MI300X GPUs connected over InfinityFabric and paired each one with its own Pollara network card. A separate network handles dataset reads and checkpointing. It’s an unfussy design, but that seems to be the point; the simpler the wiring and network layout, the lower the switch costs and the easier it is to keep iteration times steady.

ZAYA1: An AI model that punches above its weight

ZAYA1-base activates 760 million parameters out of a total 8.3 billion and was trained on 12 trillion tokens in three stages. The architecture leans on compressed attention, a refined routing system to steer tokens to the right experts, and lighter-touch residual scaling to keep deeper layers stable.

The model uses a mix of Muon and AdamW. To make Muon efficient on AMD hardware, Zyphra fused kernels and trimmed unnecessary memory traffic so the optimiser wouldn’t dominate each iteration. Batch sizes were increased over time, but that depends heavily on having storage pipelines that can deliver tokens quickly enough.

All of this leads to an AI model trained on AMD hardware that competes with larger peers such as Qwen3-4B, Gemma3-12B, Llama-3-8B, and OLMoE. One advantage of the MoE structure is that only a sliver of the model runs at once, which helps manage inference memory and reduces serving cost.

A bank, for example, could train a domain-specific model for investigations without needing convoluted parallelism early on. The MI300X’s memory headroom gives engineers space to iterate, while ZAYA1’s compressed attention cuts prefill time during evaluation.

Making ROCm behave with AMD GPUs

Zyphra didn’t hide the fact that moving a mature NVIDIA-based workflow onto ROCm took work. Instead of porting components blindly, the team spent time measuring how AMD hardware behaved and reshaping model dimensions, GEMM patterns, and microbatch sizes to suit MI300X’s preferred compute ranges.

InfinityFabric operates best when all eight GPUs in a node participate in collectives, and Pollara tends to reach peak throughput with larger messages, so Zyphra sized fusion buffers accordingly. Long-context training, from 4k up to 32k tokens, relied on ring attention for sharded sequences and tree attention during decoding to avoid bottlenecks.

Storage considerations were equally practical. Smaller models hammer IOPS; larger ones need sustained bandwidth. Zyphra bundled dataset shards to reduce scattered reads and increased per-node page caches to speed checkpoint recovery, which is vital during long runs where rewinds are inevitable.

Keeping clusters on their feet

Training jobs that run for weeks rarely behave perfectly. Zyphra’s Aegis service monitors logs and system metrics, identifies failures such as NIC glitches or ECC blips, and takes straightforward corrective actions automatically. The team also increased RCCL timeouts to keep short network interruptions from killing entire jobs.

Checkpointing is distributed across all GPUs rather than forced through a single chokepoint. Zyphra reports more than ten-fold faster saves compared with naïve approaches, which directly improves uptime and cuts operator workload.

What the ZAYA1 AMD training milestone means for AI procurement

The report draws a clean line between NVIDIA’s ecosystem and AMD’s equivalents: NVLINK vs InfinityFabric, NCCL vs RCCL, cuBLASLt vs hipBLASLt, and so on. The authors argue the AMD stack is now mature enough for serious large-scale model development.

None of this suggests enterprises should tear out existing NVIDIA clusters. A more realistic path is to keep NVIDIA for production while using AMD for stages that benefit from the memory capacity of MI300X GPUs and ROCm’s openness. It spreads supplier risk and increases total training volume without major disruption.

This all leads us to a set of recommendations: treat model shape as adjustable, not fixed; design networks around the collective operations your training will actually use; build fault tolerance that protects GPU hours rather than merely logging failures; and modernise checkpointing so it no longer derails training rhythm.

It’s not a manifesto, just our practical takeaway from what Zyphra, AMD, and IBM learned by training a large MoE AI model on AMD GPUs. For organisations looking to expand AI capacity without relying solely on one vendor, it’s a potentially useful blueprint.

See also: Google commits to 1000x more AI infrastructure in next 4-5 years

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is part of TechEx and is co-located with other leading technology events including the Cyber Security Expo. Click here for more information.

AI News is powered by TechForge Media. Explore other upcoming enterprise technology events and webinars here.

AI model using AMD GPUs for training hits milestone

How Zyphra used AMD GPUs to cut costs without gutting AI training performance

ZAYA1: An AI model that punches above its weight

Making ROCm behave with AMD GPUs

Keeping clusters on their feet

What the ZAYA1 AMD training milestone means for AI procurement

Be the first to comment

Leave a Reply Cancel reply

Bitmine Immersion Technologies (BMNR) Announces $200 Million Investment In Beast Industries