评测基准¶
评测基准¶
- 02xx.xxxxx_BLEU: a Method for Automatic Evaluation of Machine Translation
- 0401.xxxxx_ROUGE: A Package for Automatic Evaluation of Summaries
- 1803.01937_ROUGE2.0: Updated and Improved Measures for Evaluation of Summarization Tasks
- 1804.08771_SacreBLEU: A Call for Clarity in Reporting BLEU Scores
- 2306.05685_Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
数据集-Agent¶
数据集-QA¶
数据集-编程¶
- 2107.03374_HumanEval: Evaluating Large Language Models Trained on Code
- 2108.07732_MBPP: Program Synthesis with Large Language Models
- 2310.06770_SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
- 2402.16694_HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization
- 2403.07974_LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
- 2407.10499_CIBench: Evaluating Your LLMs with a Code Interpreter Plugin
- 2410.03859_SWE-bench-Multimodal: Do AI Systems Generalize to Visual Software Domains?
- 2410.06992_SWE-Bench+: Enhanced Coding Benchmark for LLMs
- 2501.01257_CodeForces: Benchmarking Competition-level Code Generation of LLMs on CodeForces
数据集-长文本¶
- 2402.05136_LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K
- 2402.17753_LoCoMo: Evaluating Very Long-Term Conversational Memory of LLM Agents
- 2404.06654_RULER: What’s the Real Context Size of Your Long-Context Language Models?
- 2407.11963_NeedleBench: Can LLMs Do Retrieval and Reasoning in Information-Dense Context
数据集-数学¶
数据集-图片¶
- 2306.13394_MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
- 2307.06281_MMBench: Is Your Multi-modal Model an All-around Player?
- 2307.16125_SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
- 2311.12793_ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
- 2506.18095_ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation
数据集¶
- 通用
- 2009.03300_MMLU: Measuring Massive Multitask Language Understanding
- 2305.08322_C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models
- 2306.09212_CMMLU: Measuring massive multitask language understanding in Chinese
- 2307.15020_SuperCLUE: A Comprehensive Chinese Large Language Model Benchmark
- 2311.12983_GAIA: a benchmark for General AI Assistants
- 2404.07972_OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
- 2501.14249_HLE: Humanity’s Last Exam