Your fleet has more throughput than your monitoring can see
We find and lock the operating point your stack can't reach
Runs alongside vLLM and TensorRT. Under your power envelope, with no rewrite of your stack.
Your stack can't see most of the optimization space
Every model update, every traffic shift, every GPU refresh drifts your operating point. Your power envelope doesn't move.
Your monitoring polls GPU averages at coarse intervals. CarbonForge measures the same workload at sub-millisecond resolution. What your stack can tune is a fraction of what's there.
How the CarbonForge loop works
Power Telemetry
Sub-millisecond power and latency, with kernel-level attribution.
Optimization Engine
Searches the operating point that captures what your monitoring misses.
Runtime Controller
Re-locks the operating point when models, traffic, or hardware change.
Re-locks continuously as models, traffic, and hardware change
Run the complete Loop on your fleet. More tokens per GPU under the same power envelope.
Built at Mila
Leadership
Advisors
Compilers and GPU codegen
McGill, CIFAR AI Chair
Enterprise and infrastructure
Former Data Center CTO, Intel
Become an early adopter partner
A limited number of slots in 2026 for teams serving, operating, or running inference at scale.