SegQuant: A Semantics-Aware and Generalizable Quantization Framework for Diffusion Models

Diffusion models have demonstrated remarkable generative capabilities, yet deploying them efficiently remains a significant challenge in latency-sensitive environments where both speed and output quality must be carefully balanced.

Quantization offers a practical path to reducing computational cost, but existing post-training quantization (PTQ) methods typically rely on manual, architecture-specific rules or runtime-dynamic heuristics that are fundamentally incompatible with modern static graph compilers, creating a gap that hinders automated, large-scale deployment.

We present SegQuant, a deployment-aware PTQ framework that derives its quantization strategy entirely from the model's static computation graph, requiring no hand-crafted rules or dynamic profiling. At its core, SegLinear automatically identifies semantically distinct segments within weight matrices and quantizes them independently, capturing structural heterogeneity that uniform quantization overlooks. Complementing this, DualScale preserves the narrow yet semantically critical negative activations introduced by functions such as SiLU and GELU, through a hardware-native dual-path computation that fully leverages standard Tensor Core operations without any custom kernel overhead.

Together, these components enable SegQuant to achieve strong image quality under aggressive quantization settings, generalize seamlessly across both DiT-based and UNet-based architectures, and integrate naturally with mainstream deployment toolchains.

SegQuant: A Semantics-Aware and Generalizable Quantization Framework for Diffusion Models

Abstract

Video

BibTeX