We need to add an evaluation that tests the robustness of programs across multiple runs (seeds) and also across multiple K-values.
- A weak test can assess similarity of the overall information captured by each run.
- A stronger test would compare programs across runs and assess consistency.