MAUSA: Modular Architecture for Urgent and Stable Advancement
Routing protocol design for NASA's simulated SolarNet addressing interplanetary communication challenges: extreme latency, intermittent connectivity, mission-critical delivery.
What I Designed
Congestion Monitor Protocol:
- Periodic heartbeats (1/sec) carrying queue occupancy for high/medium/low priority
- Per-bundle ACKs with sequence numbers
- Local neighbor view without global coordination
- <0.008% bandwidth overhead (50 KB/s on 622 MB/s links)
Dynamic Fragmentation:
Fragment Size = F_max / (1 + α_p × c_avg)
F_max = 1 MB,α_p = {2.0, 1.0, 0.5}for{High, Med, Low}priority- High-priority/high-congestion → smaller fragments (faster queue traversal)
- Low-priority/low-congestion → larger fragments (minimize header overhead)
Duplication Algorithm:
- Flooding (High-Priority): BFS via
TSUNAMIprimitive,KILL-ACKto purge caches - Timeout-Driven (Routine): Duplicate after 2 consecutive timeouts to second-best neighbor
Architecture
Physical Network
| Node Type | Distance | Bandwidth | Storage | RTT | |-----------|----------|-----------|---------|-----| | LEOCom | 833 km | 2 GB/s | 0.5 TB | 10 ms | | GEOCom | 36,000 km | 1.2 GB/s | 0.5 TB | 250-300 ms | | Ground | Earth | 622 MB/s | 10 TB | - | | Relay | Moon/Mars | 500 MB/s | 0.5 TB | 2-20 min |
Storage Manager
Priority queues: High (1.2%), Medium (18.8%), Low (80%)
- High-priority replicated to all neighbors
KILL-ACKremoves copies after delivery- <6 GB overhead per node
Routing Manager
- High: Flood to all neighbors
- Medium: Route to lowest
cong_i, fallback on timeout - Low: Single-path, lowest
cong_i
Link failure detection via missed CMBs (>1s) or ACKs.
Technical Details
Bundle Protocol Extension
Extends NASA's Bundle Protocol:
- Primary block (134B): IPv6 addresses, flags, timestamps
- Flags: fragment, custody, ACK, priority (bits 0, 2, 3, 5, 6, 7-8)
Reliability Layer
TCP-inspired with priority scaling:
W ← W + α_p (on ACK)
W ← W × β_p (on timeout)
RTO = RTT_smooth + 4 × RTT_var
High-priority ramps faster, backs off less.
Results
Latency Overhead:
- Fragmentation: 0.5s/hop for 100MB (0.6% of 20min transit)
- Flooding: 24s for 10MB high-priority (within 10min SLA)
Failure Resilience:
- 45% latency reduction under stragglers (dynamic fragmentation)
- 95% delivery under 10% random failures (vs. 60% baseline)
Bandwidth:
- CMBs: 0.008% overhead
- KILL-ACKs: 200B per high-priority message
Design Trade-offs
Why Local (Not Global) Routing?
- Earth-Mars RTT = 20min; global state always stale
- 150 nodes × full mesh = O(n²) overhead
- Local decisions faster, more robust
Why Flooding for High-Priority?
- Guarantees delivery even if 90% paths fail
- No stale routing tables
- Acceptable for <1% of traffic
Lessons Learned
Systems Design:
- Local signals scale better than global coordination
- Simplicity wins under extreme conditions
- Throughput vs. latency, reliability vs. overhead—trade-offs everywhere
Networking:
- Per-flow fairness wrong for collective operations
- Adapt fragmentation to network state
- Test failure modes: 10% outages, link drops, bursts
Evaluation:
- "45% latency reduction" > "faster"
- Tail latency matters more than average
Future Work
- ns-3 simulation for realistic packet-level testing
- ML-based congestion prediction
- Dynamic priority adjustment by deadline
- Multi-job fairness (currently single-job optimized)