YourBench: Easy Custom Evaluation Sets for Everyone

1Hugging Face, 2University of Illinois at Urbana-Champaign
YourBench MMLU Results
Figure 1: YourBench-generated MMLU questions maintain the same model ranking as the original benchmark while being more challenging.

YourBench automatically generates reliable, domain-specific evaluation sets directly from source documents.

Abstract

Evaluating large language models (LLMs) effectively remains a critical bottleneck, as traditional static benchmarks suffer from saturation and contamination, while human evaluations are costly and slow. This hinders timely or domain-specific assessment, crucial for real-world applications.

We introduce YourBench, a novel, open-source framework that addresses these limitations by enabling dynamic, automated generation of reliable, up-to-date, and domain-tailored benchmarks cheaply and without manual annotation, directly from user-provided documents. We demonstrate its efficacy by replicating 7 diverse MMLU subsets using minimal source text, achieving this for under $15 in total inference costs while perfectly preserving the relative model performance rankings (Spearman ρ = 1) observed on the original benchmark.

To ensure that YourBench generates data grounded in provided input instead of relying on posterior parametric knowledge in models, we also introduce Tempora-0325, a novel dataset of over 7K diverse documents, published exclusively after March 2025. Our comprehensive analysis spans 26 SoTA models from 7 major families across varying scales (3-671B parameters) to validate the quality of generated evaluations through rigorous algorithmic checks (e.g., citation grounding) and human assessments.

We release the YourBench library, the Tempora-0325 dataset, 150k+ question answer pairs based on Tempora and all evaluation and inference traces to facilitate reproducible research and empower the community to generate bespoke benchmarks on demand, fostering more relevant and trustworthy LLM evaluation.

Framework Overview

YourBench is a comprehensive framework for generating custom evaluation sets from any collection of documents. The pipeline consists of four main stages:

YourBench Framework
Figure 2: The YourBench pipeline transforms source documents into high-quality evaluation sets through four key stages.
  1. Document Preprocessing: Standardizes diverse document formats and handles multimodal content
  2. Question Generation: Uses LLM ensembles to create diverse, contextually-grounded questions
  3. Quality Filtering: Ensures questions are valid and verifiably answerable from source material
  4. Evaluation: Provides tools for assessing model performance on the generated benchmark

Key Features

YourBench addresses critical limitations in current LLM evaluation approaches:

  • Dynamic Generation: Create fresh benchmarks on demand, reducing contamination risk
  • Domain Specificity: Tailor evaluations to specialized fields and knowledge areas
  • Temporal Relevance: Generate evaluations from recent documents to test up-to-date knowledge
  • Cost Efficiency: Produce high-quality evaluations for under $15 in inference costs
  • Automation: Eliminate the need for manual annotation while maintaining quality
  • Verifiable Grounding: Ensure questions are answerable from source material through citation validation

The framework is designed to be accessible to researchers, practitioners, and organizations of all sizes, democratizing access to custom evaluation.

Validation Results

MMLU Benchmark Replication

We validated YourBench by replicating 7 diverse subsets of the MMLU benchmark. Using only a few relevant Wikipedia pages per domain as input, we generated new multiple-choice questions in the MMLU style. This process took less than 5 minutes and cost under $2 per domain, requiring no human annotation.

The results demonstrated two key findings:

  1. Perfect preservation of relative model performance rankings compared to the original MMLU (Spearman ρ = 1.00)
  2. Consistently more challenging questions (lower absolute scores), yielding a contamination-resistant evaluation

This confirms that YourBench can reliably generate evaluations that maintain the discriminative power of established benchmarks while offering fresh, uncontaminated test items.

Generation Quality Metrics

Validity-Diversity Spectrum

Validity-Diversity Spectrum
Figure 3: Trade-off between question validity and semantic diversity across different LLM generators.

Our analysis reveals an interplay between question validity and semantic diversity across different generator models. Models like o3 mini excel in validity but exhibit low diversity, while models like Qwen2.5 32B achieve high diversity with slightly lower validity. Some models like DeepSeek V3 demonstrate a strong balance, scoring well on both dimensions.

Citation Grounding

Citation Grounding Performance
Figure 4: Citation generation and validity performance across different models.

Faithful attribution to source material via citations is crucial for verifying the grounding of generated answers. Our evaluation shows that leading models like Claude 3.7 Sonnet and several competitive open-weight models demonstrate strong citation generation capabilities, while models like Qwen2.5 32B achieve high citation validity at a fraction of the cost.

Tempora-0325 Dataset

To support robust evaluation, particularly concerning temporal knowledge, we release Tempora-0325, a dataset comprising 7,368 documents published exclusively after March 1, 2025. This dataset is designed to:

  • Mitigate contamination by using content published after model training cutoffs
  • Force reliance on provided context rather than parametric knowledge
  • Span diverse domains including government, corporate, legal, medical, sports, news, and blogs
  • Provide both an unbalanced full corpus reflecting real-world distributions and a balanced subset for controlled analysis

Tempora-0325 is publicly available and can be used with YourBench to create challenging, up-to-date evaluations.

Applications

YourBench is already being explored in several research initiatives:

  • Domain-Specific Knowledge Assessment: Evaluating LLMs on specialized, proprietary knowledge in fields like agriculture
  • Personalized Education: Generating tailored assessment questions based on individual student learning profiles
  • Advanced RAG Training Data: Creating challenging training corpora for retrieval-augmented generation systems

By providing a robust, scalable, and fast automated approach, YourBench facilitates more nuanced, timely, and targeted assessments of LLM capabilities at a low cost, making the process accessible to most researchers and practitioners.

BibTeX

@misc{shashidhar2025yourbencheasycustomevaluation,
      title={YourBench: Easy Custom Evaluation Sets for Everyone}, 
      author={Sumuk Shashidhar and Clementine Fourier and Alina Lozovskia and Thomas Wolf and Gokhan Tur and Dilek Hakkani-Tür},
      year={2025},
      eprint={2504.01833},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.01833}, 
}