← cs
$ cat projects/MAUSA.md

Interplanetary Delay-Tolerant Network Protocol (MAUSA)

Routing protocol design for deep-space communication with extreme latency and intermittent connectivity.

2024-05-08
Systems DesignNetworkingProtocol Design

MAUSA: Modular Architecture for Urgent and Stable Advancement

Routing protocol design for NASA's simulated SolarNet addressing interplanetary communication challenges: extreme latency, intermittent connectivity, mission-critical delivery.

What I Designed

Congestion Monitor Protocol:

  • Periodic heartbeats (1/sec) carrying queue occupancy for high/medium/low priority
  • Per-bundle ACKs with sequence numbers
  • Local neighbor view without global coordination
  • <0.008% bandwidth overhead (50 KB/s on 622 MB/s links)

Dynamic Fragmentation:

Fragment Size = F_max / (1 + α_p × c_avg)
  • F_max = 1 MB, α_p = {2.0, 1.0, 0.5} for {High, Med, Low} priority
  • High-priority/high-congestion → smaller fragments (faster queue traversal)
  • Low-priority/low-congestion → larger fragments (minimize header overhead)

Duplication Algorithm:

  • Flooding (High-Priority): BFS via TSUNAMI primitive, KILL-ACK to purge caches
  • Timeout-Driven (Routine): Duplicate after 2 consecutive timeouts to second-best neighbor

Architecture

Physical Network

| Node Type | Distance | Bandwidth | Storage | RTT | |-----------|----------|-----------|---------|-----| | LEOCom | 833 km | 2 GB/s | 0.5 TB | 10 ms | | GEOCom | 36,000 km | 1.2 GB/s | 0.5 TB | 250-300 ms | | Ground | Earth | 622 MB/s | 10 TB | - | | Relay | Moon/Mars | 500 MB/s | 0.5 TB | 2-20 min |

Storage Manager

Priority queues: High (1.2%), Medium (18.8%), Low (80%)

  • High-priority replicated to all neighbors
  • KILL-ACK removes copies after delivery
  • <6 GB overhead per node

Routing Manager

  • High: Flood to all neighbors
  • Medium: Route to lowest cong_i, fallback on timeout
  • Low: Single-path, lowest cong_i

Link failure detection via missed CMBs (>1s) or ACKs.

Technical Details

Bundle Protocol Extension

Extends NASA's Bundle Protocol:

  • Primary block (134B): IPv6 addresses, flags, timestamps
  • Flags: fragment, custody, ACK, priority (bits 0, 2, 3, 5, 6, 7-8)

Reliability Layer

TCP-inspired with priority scaling:

W ← W + α_p    (on ACK)
W ← W × β_p    (on timeout)
RTO = RTT_smooth + 4 × RTT_var

High-priority ramps faster, backs off less.

Results

Latency Overhead:

  • Fragmentation: 0.5s/hop for 100MB (0.6% of 20min transit)
  • Flooding: 24s for 10MB high-priority (within 10min SLA)

Failure Resilience:

  • 45% latency reduction under stragglers (dynamic fragmentation)
  • 95% delivery under 10% random failures (vs. 60% baseline)

Bandwidth:

  • CMBs: 0.008% overhead
  • KILL-ACKs: 200B per high-priority message

Design Trade-offs

Why Local (Not Global) Routing?

  • Earth-Mars RTT = 20min; global state always stale
  • 150 nodes × full mesh = O(n²) overhead
  • Local decisions faster, more robust

Why Flooding for High-Priority?

  • Guarantees delivery even if 90% paths fail
  • No stale routing tables
  • Acceptable for <1% of traffic

Lessons Learned

Systems Design:

  • Local signals scale better than global coordination
  • Simplicity wins under extreme conditions
  • Throughput vs. latency, reliability vs. overhead—trade-offs everywhere

Networking:

  • Per-flow fairness wrong for collective operations
  • Adapt fragmentation to network state
  • Test failure modes: 10% outages, link drops, bursts

Evaluation:

  • "45% latency reduction" > "faster"
  • Tail latency matters more than average

Future Work

  • ns-3 simulation for realistic packet-level testing
  • ML-based congestion prediction
  • Dynamic priority adjustment by deadline
  • Multi-job fairness (currently single-job optimized)