You're almost there! Please answer a few more questions for access to the Applications content. Complete registration
Interested in joining? Complete your registration by providing Areas of Interest here. Register

Why does Median Correctness vary accross METRO Evaluation?

Summary:

We are observing variation in median correctness across METRO evaluation runs for the same custom AI agent despite no configuration or dataset changes.
During multiple executions of evaluation runs on the same agent, we are observing that the median correctness score varies across runs, even when there are no intentional changes made to: •The agent configuration •Test dataset / prompts •Evaluation setup
For example, the median correctness differs between runs (e.g., 0.4 → 0.2 → 0), which is impacting consistency in evaluation results. If you could guide us on:
oWhat could be the possible reasons for variation in median correctness across evaluation runs?
oIs this expected behavior due to model randomness or evaluation methodology?

Tagged:

Howdy, Stranger!

Log In

To view full details, sign in.

Register

Don't have an account? Click here to get started!