Benchmark Explorer

All 500 LongMemEval-S questions with c137's answers and the grading verdict.

How this was graded. Answers judged by GPT-4o using the official LongMemEval evaluation templates from xiaowu0162/LongMemEval. One judge call per question, temperature=0. Script: grade_official.py.

Per question you'll see: the question, ground-truth answer, and c137's answer. Failures are tagged retrieval miss (ground truth not in the retrieved context) or model error (context was there, model still answered wrong).

Answerer model