Add an evaluation for model robustness

We need to add an evaluation that tests the robustness of programs across multiple runs (seeds) and also across multiple K-values.

1. A weak test can assess similarity of the overall information captured by each run.
2. A stronger test would compare programs across runs and assess consistency.