lneiman 6 hours ago Author here. We were hitting tail latency and low GPU utilization issues serving SLMs via Triton.I built a scrappy client-side router using Redis and Lua to track real-time GPU load. It boosted utilization by ~40% and improved latencies.Happy to hear feedback on the implementation or thoughts on better ways to do this!
Author here. We were hitting tail latency and low GPU utilization issues serving SLMs via Triton.
I built a scrappy client-side router using Redis and Lua to track real-time GPU load. It boosted utilization by ~40% and improved latencies.
Happy to hear feedback on the implementation or thoughts on better ways to do this!