Adaptive SM clocking for energy-efficient LLM serving
How CarbonForge can help lower GPU energy without touching your inference stack.
Jean-Maxime Larouche on Jun 1, 2026
LLM serving is moving from occasional jobs to persistent service infrastructure. At that scale, GPU power is no longer a background detail. It sets thermal limits, caps how much inference fits on a given site, and decides where new capacity can be deployed [15]. Integrated over time, that draw becomes energy, and energy is what drives operating cost [14].
There are several ways to cut the power and energy a serving deployment uses: model choice and compression [16], quantization [17], batching, scheduling, serving-engine changes [18][19], and infrastructure control [6][8][10]. Those are all valid levers, but they carry different rollout costs. For an operator already serving a model under latency, quality, and data-boundary constraints, the least disruptive place to start is below the inference stack: change the hardware policy while leaving the model, engine, request path, and user-facing behavior intact. One promising lever is dynamic clock control. The GPU does not need the same SM clock in every phase: prefill can be compute-bound and reward high clock, while decode often becomes memory-bandwidth-bound, where extra SM clock can turn into power rather than tokens.
This idea sits in a growing line of work on GPU-DVFS and energy-aware LLM serving [6][7][8][9][10]. In this post, we build the intuition from the hardware model, explain how default GPU clocking fits into the story, and look at what a serving-aware governor can recover that a general-purpose one cannot. We close with where this kind of control belongs as more serving signal becomes available.
TL;DR. One LLM serving workload moves between two hardware regimes. Prefill is compute-bound and rewards SM clock. Decode is often memory-bandwidth-bound, where extra clock turns into power, not tokens. The roofline says where that boundary sits; the super-linear power curve says why cutting clock below it is nearly free. NVIDIA's default autoboost reacts to headroom, not to which regime the GPU is in, so it leaves a measurable power gap on memory-bound decode. This post builds that argument from the hardware model, shows how a serving-aware governor closes the gap and how to measure it without fooling yourself, and ends with where the same control problem goes next.
One serving workload, two clock regimes
LLM inference has two phases with different hardware profiles. Prefill (processing the prompt) is dominated by large matrix multiplies and is usually compute-bound. In that regime, higher SM clock can reduce latency. Decode (generating tokens one at a time) is often memory-bound. The GPU spends much of the time waiting on memory, the math units are less fully occupied, and additional SM clock can become power rather than tokens.
The classical roofline gives the boundary precisely [1].
For a kernel with arithmetic intensity I (FLOPs per byte of memory traffic), achievable performance is
where Ppeak is peak compute and Bpeak peak memory bandwidth. Prefill operates above the ridge (I·Bpeak > Ppeak), so raising the clock raises Ppeak and translates directly into performance. Decode operates below the ridge: performance is pinned by Bpeak, which the SM clock does not move.
The other half of the story is power. Dynamic GPU power scales roughly as
so over the operating range, Pdyn grows super-linearly in f. Cutting f by 20–25% on a memory-bound kernel can have little effect on throughput while reducing power by more than the clock reduction itself. That is the power side of the opportunity: when throughput has plateaued, lowering clock can improve tokens per watt without changing the model or inference engine.
Changing SM clock with feedback control
Put the two pieces together. The bottleneck changes inside one serving workload, and unused clock has a nonlinear power cost. Prefill can reward higher SM clock; decode can stop rewarding it after the memory-bandwidth knee. A clock policy that follows the bottleneck as it moves can capture that saving, returning to full clock the moment latency pressure rises so the saving is taken only where it does not put the SLA at risk.
We call that clocking mechanism a governor: a reactive feedback controller that adjusts hardware state from measured signals. At step t, a governor observes hardware state zt, such as utilization, power, temperature, clock headroom, and sometimes memory traffic. It then chooses a clock ct from the available clock set 𝒞. A simple power-aware version of that policy is
Here T̂(zt, c) is the controller's estimate of throughput or latency safety at the current state, and Tmin is the operating constraint imposed by the SLA. Written as efficiency, the same greedy rule chooses the clock with the best current tokens per watt, while penalizing SLA risk:
Energy-aware DVFS methods differ mostly in what zt contains and how T̂ is estimated. Published GPU-DVFS and LLM-serving control work makes the state estimate or policy more workload-aware [6][7][8][9][10]. One way to stay engine-independent is to keep the deployment boundary lower: infer the compute-bound versus memory-bound regime from hardware telemetry alone, then apply the lowest safe SM clock for the current regime. It is still reactive, but the reaction is serving-aware. That is the rule we measure below.
Is there a gap to recover, and can we trust the measurement?
The equations above are explanatory. The measured question is whether a serving-aware, hardware-only rule beats autoboost — and whether the measurement itself can be trusted. The first check is the frontier: if autoboost were already on the best throughput-power curve for this workload, there would be no equal-throughput, lower-power point to recover.
On a saturated decode workload (Qwen2.5-7B across 2×A100), we compared the default autoboost operating point with the lowest-power point on the same hardware. The default point is not on the throughput-vs-power frontier. In this regime, there is an operating point with both lower power and equal-or-higher throughput.
That frontier result identifies the opportunity. The workload benchmark then asks whether a controller preserves capacity on representative traffic under an MLPerf-style concurrency ladder [2]. Each configuration used a fresh inference engine for the measurement. That discipline matters: in our runs, the same baseline moved by ~2× depending on test ordering, a warmup confound that shared-engine benchmarks can hide.
The pattern repeats across the workload classes we have tested under this methodology: capacity within ±1.5% of the default governor (= matched within noise), GPU power down 12–13%, tokens per watt up 13–14%. That is the measured band we report for general LLM serving on multi-GPU A100. We keep the broader per-workload breakdown under NDA, including prefill-heavy and decode-heavy traffic.
The claim is deliberately scoped: the public measurements here are multi-GPU A100 LLM-serving results. The −12 to −13% band is the representative parity-SLA result; the ~30% point shows the saturated decode frontier. CarbonForge targets a hardware regime, memory-bandwidth-bound GPU execution, not a model family (the parity-SLA saving holds from a 7B model up to a 32B model on the same 2×A100 — ≈9% lower power at matched capacity at 32B, the smaller figure consistent with the roofline picture as the larger model runs more compute-bound). New workload classes — including non-LLM GPU work such as rendering, where the same memory-bound regime may apply — CPU-compounded deployments, and unusual host controls start with a characterization run before activation. In production, a conservative lossless mode has validated at −12% GPU power at 0 throughput loss; standard mode is the parity-SLA setting once acceptance is complete. Both return to high clock when latency pressure rises or telemetry is ambiguous.
Where this kind of control belongs
Serving-aware clocking is not a single knob, and the reactive, hardware-only version in this post is the simplest version of it. A controller reads GPU telemetry, infers the regime, and sets the lowest safe clock, outside the inference stack and with no engine changes. That is the least intrusive layer, which is why it is the easiest to adopt. It is also not the only one.
When richer serving signals are available — phase mix, queue pressure, recent demand — a controller can reason about what is likely to happen next while keeping prompts and user content out of the loop [11][12]. That moves the problem from reactive control to short-horizon planning: choose a clock trajectory, not only a greedy one-step action. A simplified version looks like
where the controller trades energy, latency risk, and switching cost over a short horizon. That is the shape of model-predictive control for inference [13]. Deeper still is the compile-and-serve path, where the model is lowered to executable kernels and the controller sees the most signal and holds the most control, at the cost of integration. Each layer trades adoption cost for headroom.
Where we land. Default GPU clocking is tuned for headroom, not for the regime a serving workload is in. On memory-bound decode, that leaves power on the table, and the saving is recoverable from hardware telemetry alone at matched throughput. The reactive hardware layer is the entry, not the destination. We build controllers across these layers at CarbonForge. If you run your own GPU fleet, apply to join our early-adopter program.
Methodology. Figures and numbers are from controlled runs on 2×NVIDIA A100 (tensor-parallel) with real vLLM serving [3] and NVML power telemetry [4]; the baseline is NVIDIA's production-default GPU clock governor unless stated. Capacity-under-SLA is measured with an MLPerf-style concurrency-ladder runner [2] using fresh-engine-per-config (eliminates ordering bias). The control mechanism is proprietary. The detailed operating-envelope map, per-workload breakdowns, and reproduction methodology are available to qualified partners under NDA.
References
- Samuel Williams, Andrew Waterman, and David Patterson. Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM, 2009. DOI 10.1145/1498765.1498785.
- MLCommons. MLPerf Inference Benchmark Suite, with the benchmark paper available as arXiv:1911.02549.
- vLLM Project. vLLM documentation.
- NVIDIA. NVML API Reference Guide; NVIDIA System Management Interface documentation.
- NVIDIA. Data Center GPU Manager documentation.
- Haoran Qiu et al. Power-aware Deep Learning Model Serving with µ-Serve. USENIX ATC 2024.
- Qunyou Liu, Darong Huang, Marina Zapater, and David Atienza. GreenLLM: SLO-Aware Dynamic Frequency Scaling for Energy-Efficient LLM Serving. arXiv:2508.16449, 2025.
- Andreas Kosmas Kakolyris et al. SLO-aware GPU Frequency Scaling for Energy Efficient LLM Inference Serving (throttLL'eM). arXiv:2408.05235, 2024.
- Jovan Stojkovic et al. DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency. HPCA 2025.
- Yuan Ma, Srinivasan Subramaniyan, and Xiaorui Wang. Power Capping of GPU Servers for Machine Learning Inference Optimization. ICPP 2025. DOI 10.1145/3754598.3754670.
- Amey Agrawal et al. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. OSDI 2024.
- Archit Patke et al. Hierarchical Autoscaling for Large Language Model Serving with Chiron. arXiv:2501.08090, 2025.
- James B. Rawlings, David Q. Mayne, and Moritz M. Diehl. Model Predictive Control: Theory, Computation, and Design. 2nd edition, Nob Hill Publishing, 2017.
- Arman Shehabi et al. 2024 United States Data Center Energy Usage Report. Lawrence Berkeley National Laboratory, 2024. DOI 10.71468/P1WC7Q.
- Geoff Blanford, Tom Wilson, John Bistline, and Nils Johnson. Powering Intelligence 2026: Updated Scenarios of U.S. Data Center Electricity Use and Power Strategies. Electric Power Research Institute, 2026.
- Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv:1910.01108, 2019.
- Guangxuan Xiao et al. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. ICML 2023.
- Gyeong-In Yu et al. Orca: A Distributed Serving System for Transformer-Based Generative Models. OSDI 2022.
- Woosuk Kwon et al. Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023.