-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Problem
Network workflows that place multiple demands repeatedly compute redundant SPF (Shortest Path First) calculations. SPF computes paths to ALL nodes but we only use the path to one destination, and many demands
share the same source.
Affected workflows:
- MaximumSupportedDemand (MSD) - Binary search evaluates ~25 alpha values, placing all demands at each step
- TrafficMatrixPlacement - Monte Carlo runs N iterations, each placing all demands under different failure scenarios
- demand_placement_analysis - Core function used by the above and available for direct use
Production scenario example (2,733 nodes, 720 demands from 77 sources):
| Workflow | Iterations | SPF Calls (current) | SPF Calls (with caching) |
|---|---|---|---|
| MSD | 25 alphas × 1 seed | 18,000 | ~2,000 |
| TrafficMatrixPlacement | 10 MC iterations | 7,200 | ~800 |
Root Cause
for demand in demands:
# Each call computes SPF internally
policy.place_demand(flow_graph, src_id, dst_id, priority, volume)
The place_demand() API encapsulates SPF computation—convenient but prevents reuse. With 720 demands from 77 unique sources, we compute 720 SPFs when 77 would suffice.
Proposed Solution
Cache SPF results by source node within each placement batch:
spf_cache: dict[int, PredDAG] = {}
for demand in demands:
if src_id not in spf_cache:
spf_cache[src_id] = algorithms.spf(src=src_id, dst=None, ...) # Full DAG
dag = spf_cache[src_id]
placed = flow_graph.place(flow_idx, src_id, dst_id, dag, volume, ...)
# Fallback if shortest paths saturated
if placed < volume:
spf_cache[src_id] = algorithms.spf(src=src_id, dst=None, ...)
# retry...
Key insight: We already have the primitives:
- algorithms.spf(dst=None) returns full predecessor DAG to all nodes
- flow_graph.place() accepts a pre-computed DAG
Implementation Locations
Option A: Per-workflow (minimal change)
Apply caching in each workflow step individually:
- maximum_supported_demand_step.py → _evaluate_alpha_cached()
- exec/analysis/flow.py → demand_placement_analysis()
Pros: Isolated changes, easy to test independently
Cons: Duplicated caching logic
Option B: Shared utility (cleaner)
Create a CachedDemandPlacer class in exec/analysis/:
class CachedDemandPlacer:
def init(self, algorithms, handle, flow_graph, node_mask, edge_mask):
self.spf_cache: dict[int, PredDAG] = {}
...
def place(self, src_id, dst_id, volume, priority) -> float:
# Caching logic here
...
def clear_cache(self):
self.spf_cache.clear()
Pros: Single implementation, consistent behavior
Cons: New abstraction to maintain
Option C: Core-level caching
Push caching into NetGraph-Core's FlowPolicy class.
Pros: Automatic for all callers
Cons: More complex C++ changes, cache lifetime management, harder to tune
Recommendation: Start with Option A for MSD (highest impact), then refactor to Option B if TrafficMatrixPlacement also shows significant benefit.
Design Considerations
Why not cache across alpha evaluations (MSD) or MC iterations?
Each evaluation starts with fresh residual capacities. The optimal paths differ based on what's already placed. Cross-evaluation caching would require tracking edge usage—complexity not worth the benefit
since within-evaluation cache hit rate is already high (720/77 = 9.3x).
Why not cache across failure scenarios (TrafficMatrixPlacement)?
Each failure scenario has different excluded nodes/links. The network topology changes, so cached paths may be invalid. Would need cache keyed by (source, exclusion_set)—possible but adds complexity.
Future enhancement: For failure scenarios with small exclusion sets, we could potentially reuse cached DAGs and just filter out excluded edges at placement time (similar to how stale DAGs work with
residuals). This needs investigation.
Why the fallback recomputation is essential
When ALL shortest paths from a source to a destination become saturated, placement fails. The fallback discovers alternative (longer) paths that have become optimal due to shortest-path saturation. Without
it, demands would fail even when capacity exists via non-shortest routes.
Thread safety for TrafficMatrixPlacement
Monte Carlo runs parallel iterations via ThreadPoolExecutor. Each iteration operates on independent FlowGraph instances with independent residuals. The SPF cache should be per-iteration, not shared, to avoid
cross-contamination between failure scenarios.
Expected Impact
| Workflow | Before | After | Speedup |
|---|---|---|---|
| MSD (25 alphas) | 106s | ~36s | ~3x |
| TrafficMatrixPlacement (10 iters) | ~10s | ~4s | ~2.5x (estimated) |
| Combined scenario | 118s | ~42s | ~2.8x |
Alternatives Considered
Bidirectional Dijkstra: Explored for multipath mode. For dense graphs (degree ~133), both search frontiers explore most nodes before meeting—actually 58% slower. Only helps for sparse graphs or
single-destination queries.
Destination-aware early termination: Dijkstra can stop when destination is settled, but for ECMP/multipath we need ALL equal-cost paths, requiring fuller exploration. And caching the full DAG provides more
value than early termination for one destination.