Skip to content

SPF Caching for Demand Placement Workflows #98

@networmix

Description

@networmix

Problem

Network workflows that place multiple demands repeatedly compute redundant SPF (Shortest Path First) calculations. SPF computes paths to ALL nodes but we only use the path to one destination, and many demands
share the same source.

Affected workflows:

  1. MaximumSupportedDemand (MSD) - Binary search evaluates ~25 alpha values, placing all demands at each step
  2. TrafficMatrixPlacement - Monte Carlo runs N iterations, each placing all demands under different failure scenarios
  3. demand_placement_analysis - Core function used by the above and available for direct use

Production scenario example (2,733 nodes, 720 demands from 77 sources):

Workflow Iterations SPF Calls (current) SPF Calls (with caching)
MSD 25 alphas × 1 seed 18,000 ~2,000
TrafficMatrixPlacement 10 MC iterations 7,200 ~800

Root Cause

for demand in demands:
# Each call computes SPF internally
policy.place_demand(flow_graph, src_id, dst_id, priority, volume)

The place_demand() API encapsulates SPF computation—convenient but prevents reuse. With 720 demands from 77 unique sources, we compute 720 SPFs when 77 would suffice.

Proposed Solution

Cache SPF results by source node within each placement batch:

spf_cache: dict[int, PredDAG] = {}

for demand in demands:
if src_id not in spf_cache:
spf_cache[src_id] = algorithms.spf(src=src_id, dst=None, ...) # Full DAG

  dag = spf_cache[src_id]
  placed = flow_graph.place(flow_idx, src_id, dst_id, dag, volume, ...)

  # Fallback if shortest paths saturated
  if placed < volume:
      spf_cache[src_id] = algorithms.spf(src=src_id, dst=None, ...)
      # retry...

Key insight: We already have the primitives:

  • algorithms.spf(dst=None) returns full predecessor DAG to all nodes
  • flow_graph.place() accepts a pre-computed DAG

Implementation Locations

Option A: Per-workflow (minimal change)

Apply caching in each workflow step individually:

  • maximum_supported_demand_step.py → _evaluate_alpha_cached()
  • exec/analysis/flow.py → demand_placement_analysis()

Pros: Isolated changes, easy to test independently
Cons: Duplicated caching logic

Option B: Shared utility (cleaner)

Create a CachedDemandPlacer class in exec/analysis/:

class CachedDemandPlacer:
def init(self, algorithms, handle, flow_graph, node_mask, edge_mask):
self.spf_cache: dict[int, PredDAG] = {}
...

  def place(self, src_id, dst_id, volume, priority) -> float:
      # Caching logic here
      ...

  def clear_cache(self):
      self.spf_cache.clear()

Pros: Single implementation, consistent behavior
Cons: New abstraction to maintain

Option C: Core-level caching

Push caching into NetGraph-Core's FlowPolicy class.

Pros: Automatic for all callers
Cons: More complex C++ changes, cache lifetime management, harder to tune

Recommendation: Start with Option A for MSD (highest impact), then refactor to Option B if TrafficMatrixPlacement also shows significant benefit.

Design Considerations

Why not cache across alpha evaluations (MSD) or MC iterations?

Each evaluation starts with fresh residual capacities. The optimal paths differ based on what's already placed. Cross-evaluation caching would require tracking edge usage—complexity not worth the benefit
since within-evaluation cache hit rate is already high (720/77 = 9.3x).

Why not cache across failure scenarios (TrafficMatrixPlacement)?

Each failure scenario has different excluded nodes/links. The network topology changes, so cached paths may be invalid. Would need cache keyed by (source, exclusion_set)—possible but adds complexity.

Future enhancement: For failure scenarios with small exclusion sets, we could potentially reuse cached DAGs and just filter out excluded edges at placement time (similar to how stale DAGs work with
residuals). This needs investigation.

Why the fallback recomputation is essential

When ALL shortest paths from a source to a destination become saturated, placement fails. The fallback discovers alternative (longer) paths that have become optimal due to shortest-path saturation. Without
it, demands would fail even when capacity exists via non-shortest routes.

Thread safety for TrafficMatrixPlacement

Monte Carlo runs parallel iterations via ThreadPoolExecutor. Each iteration operates on independent FlowGraph instances with independent residuals. The SPF cache should be per-iteration, not shared, to avoid
cross-contamination between failure scenarios.

Expected Impact

Workflow Before After Speedup
MSD (25 alphas) 106s ~36s ~3x
TrafficMatrixPlacement (10 iters) ~10s ~4s ~2.5x (estimated)
Combined scenario 118s ~42s ~2.8x

Alternatives Considered

Bidirectional Dijkstra: Explored for multipath mode. For dense graphs (degree ~133), both search frontiers explore most nodes before meeting—actually 58% slower. Only helps for sparse graphs or
single-destination queries.

Destination-aware early termination: Dijkstra can stop when destination is settled, but for ECMP/multipath we need ALL equal-cost paths, requiring fuller exploration. And caching the full DAG provides more
value than early termination for one destination.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions