Huawei's new open source technique shrinks LLMs to make them run on less powerful, less expensive hardware

Huawei's new open source technique shrinks LLMs to make them run on less powerful, less expensive hardware

Huawei’s Computing Systems Lab in Zurich has introduced a new open-source quantization method for large language models (LLMs) aimed at reducing memory demands without sacrificing output quality.

The technique, called SINQ (Sinkhorn-Normalized Quantization), is designed to be fast, calibration-free, and easy to integrate into existing model workflows. The code for performing it has been made available by the Huawei research team on Github and Hugging Face under a permissive, enterprise-friendly Apache 2.0 license, allowing organizations to take and use it, modify it, and deploy it commercially — all for free.

Across models of different sizes, SINQ cuts memory usage by 60–70%, depending on architecture and bit-width.

This enables models that would previously require >60 GB of memory to run on ~20 GB setups—a critical enabler for running large models on a single high-end GPU or even multi-GPU consumer-grade setups.

This makes it possible to run models that previously needed high-end enterprise GPUs—like NVIDIA’s A100 or H100—on significantly more affordable hardware, such as a single Nvidia GeForce RTX 4090 (around $1600), instead of enterprise hardware like the A100 80GB ($19,000) or even H100 units that exceed $30,000.

For teams using cloud infrastructure, the savings are similarly tangible. A100-based instances often cost $3–4.50 per hour, while 24 GB GPUs like the RTX 4090 are available on many platforms for $1–1.50 per hour.

Over time, especially for extended inference workloads, this difference can add up to thousands of dollars in cost reductions, while also unlocking LLM deployment on smaller clusters, local workstations, or consumer-grade setups previously constrained by memory.

Tackling the Memory Challenge of LLMs

Running large models often requires compromises between performance and size.

In practice, neural networks use floating-point numbers to represent both weights and activations. A floating-point number can express a wide range of values (very small, very large, with fractional parts).

This flexibility is helpful because during training and inference, weights and activations can vary in scale dramatically. Using floating-point lets the model adjust precisely. (For example, a weight could be 0.0023 or 123.45, and floating-point can capture both with decent precision.)

Quantization — a method that reduces the precision of model weights — offers a practical path to lower memory usage, but typically comes with trade-offs in model quality, especially at 4-bit precision and below.

When you convert those floating-point values into lower-precision formats (like 8-bit integers), you’re approximating them.

That means you store and compute with fewer bits, which is faster and more memory-efficient — but you risk losing fidelity (i.e. introducing small errors).

The trick is to do the conversion carefully so the model’s behavior stays nearly the same, even though internally it’s working with rougher approximations of those weights and activations.

SINQ addresses these pain points by introducing a plug-and-play solution that delivers strong performance even in low-precision settings—without requiring calibration data or inter-layer dependencies.

How SINQ Works

The SINQ approach introduces two main innovations:

Dual-Axis Scaling: Instead of using a single scale factor for quantizing a matrix, SINQ uses separate scaling vectors for rows and columns. This helps mitigate the effects of outliers and allows the quantization error to be distributed more flexibly across the matrix.

Sinkhorn-Knopp-Style Normalization: A fast algorithm inspired by Sinkhorn iterations is used to normalize the standard deviations of rows and columns in a matrix. This helps minimize what the authors call “matrix imbalance,” a new proxy metric shown to be more effective than alternatives like kurtosis for improving quantization performance.

The combination of these two features allows SINQ to outperform other calibration-free techniques such as Round-To-Nearest (RTN), HQQ, and Hadamard-based quantization across multiple benchmarks.

Performance and Compatibility

SINQ has been evaluated across a wide range of architectures and models, including the Qwen3 series, LLaMA, and DeepSeek.

On benchmarks like WikiText2 and C4, SINQ consistently reduces perplexity and flip rates compared to baseline methods, often approaching or matching the performance of calibrated solutions.

It also supports non-uniform quantization schemes such as NF4 and can be combined with calibration methods like AWQ, leading to the variant A-SINQ. In calibrated settings, A-SINQ further narrows the gap with full-precision models.

In terms of runtime efficiency, SINQ quantizes models roughly twice as fast as HQQ and over 30 times faster than AWQ. This makes it well-suited for both research and production environments where quantization time is a practical constraint.

Open Source and Easy to Use

Huawei has released SINQ as an open-source project under a permissive, enterprise-friendly Apache 2.0 license, with implementation instructions and reproducibility tools available on GitHub:

The repository includes support for quantizing Hugging Face models with just a few lines of code, as well as tools for saving and reloading quantized weights. Default settings offer a balance between memory savings and accuracy, and users can customize parameters like bit-width, tiling strategy, and group size based on their needs.

The authors also provide evaluation integration via the lm-eval library and plan to release pre-quantized models on the Hugging Face Hub in the near future.

Looking Ahead

With growing demand for running large models on consumer-grade hardware, quantization is becoming an essential tool. SINQ aims to lower the entry barrier for LLM deployment, enabling developers and researchers to efficiently shrink models without major trade-offs in quality or compatibility.

Further updates—including integration with Hugging Face Transformers and pre-quantized model releases—are planned, making this a project to watch in the quantization space.

Blockonomics

Be the first to comment

Leave a Reply

Your email address will not be published.


*