Why does Median Correctness vary accross METRO Evaluation?
Summary:
We are observing variation in median correctness across METRO evaluation runs for the same custom AI agent despite no configuration or dataset changes.
During multiple executions of evaluation runs on the same agent, we are observing that the median correctness score varies across runs, even when there are no intentional changes made to: •The agent configuration •Test dataset / prompts •Evaluation setup
For example, the median correctness differs between runs (e.g., 0.4 → 0.2 → 0), which is impacting consistency in evaluation results. If you could guide us on:
oWhat could be the possible reasons for variation in median correctness across evaluation runs?
oIs this expected behavior due to model randomness or evaluation methodology?