Straggler-Aware Scheduler for Distributed Training
Optimizes collective completion time instead of per-flow fairness for gradient synchronization. 45% iteration time reduction under persistent stragglers, <1% overhead under transient conditions.
The Problem
Distributed training iteration time = max flow time:
T_iter = max_{i ∈ [1,N]} T_flow_i
Traditional congestion control (DCTCP, TIMELY) optimizes per-flow fairness—harmful here since speeding up fast flows does nothing. Naive straggler detection causes oscillation under microbursts.
What I Built
Persistence-Filtered Detection:
streak_i = streak_i + 1 if T_i > 1.2 × T_med else 0
confirmed_straggler = (streak_i ≥ 3)
- Only trigger after K=3 consecutive slow iterations
- Filters transient slowdowns (microbursts, CPU spikes)
- Median-based threshold adapts to background load
Adaptive Rate Reallocation:
r'_i = r_i × (1 + 0.3) if straggler else r_i × (1 - 0.15 × |S|/(N-|S|))
- Asymmetric: help stragglers aggressively (α=0.3), penalize donors gently (β=0.15)
- α > 2β directly reduces T_iter
Gradual Recovery:
r_{t+1} = 0.5 × r_t + 0.5 × r_base
- Exponential moving average prevents bounce-back oscillation
- Recovery slower than punishment
Cooldown:
- Wait 5 iterations after reallocation before next adjustment
- Prevents rapid oscillation
Architecture
Custom Ring All-Reduce:
- Point-to-point send/recv for per-worker timing hooks
- N-1 stages: scatter-reduce, then all-gather
- Enables rate manipulation via delays
Network Model:
- Simulated via
delay_i = d_base / r_i(blocking sleep) - No real incast/drops (tests control logic, not transport)
Four Profiles:
- Uniform: all 10ms
- Straggler: one 3× persistent
- Variable: N(d_base, σ²) per iteration
- Bursty: random 5× with 20% probability
Results
| Profile | Baseline | Ours | Improvement | |---------|----------|------|-------------| | Straggler | 1298ms | 717ms | 44.8% | | Variable | 268ms | 268ms | <1% | | Bursty | 747ms | 750ms | <1% |
Straggler Profile CDF:
- Median: 1314ms → 708ms (46% improvement)
- p99: 1365ms → 1001ms (27% improvement)
Ablation (K threshold on bursty):
- K=1: 878ms, 179 reallocations → +23% regression
- K=3: 750ms, 5 reallocations → <1% overhead
Implementation
Workload:
- Small CNN (201K params) on CIFAR-10
- Communication 70% of iteration time (realistic for large-scale)
Statistical Testing:
- 5 runs × 250 iterations = 1,240 per config
- Welch's t-test: p < 0.0001 for straggler improvement
Lessons Learned
Systems:
- Persistence filtering critical—instant reaction causes oscillation
- Asymmetric adjustment: help stragglers aggressively, penalize donors gently
- Cooldown prevents thrashing
- Median threshold adapts to background automatically
Distributed Training:
- Max-of-flows metric correct for barriers (not mean/sum)
- Per-flow fairness wrong for collective operations
Evaluation:
- Test failure modes (stragglers, bursts) not just normal case
- Ablation shows why K≥3 works
Limitations & Future Work
Current:
- Simulated network (sleep not congestion)
- Single machine (processes not nodes)
- Relative detection (all-slow undetected)
Future:
- Real transport (ECN/RTT signals → cwnd)
- Distributed cluster evaluation
- Distinguish compute vs. network stragglers
- Multi-job fairness