论文¶
通用¶
Agents¶
- 2210.03629_ReAct
- 2303.08268_Chat-with-the-Environment
- 2303.11366_Reflexion: Language Agents with Verbal Reinforcement Learning
- 2303.16434_TaskMatrix.AI
- 2304.03442_Generative-Agents
- 2307.07924_ChatDev: Communicative Agents for Software Development
- 2308.00352_MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
- 2308.04026_AgentSims: An Open-Source Sandbox for Large Language Model Evaluation
- 2308.08155_AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
- 2308.10848_AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors
- 2310.06117_Step-Back: Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models
- 2402.18679_MetaGPT_DI: Data Interpreter: An LLM Agent For Data Science
- 2407.07061_IoA: Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence
- 2408.08435_ADAS: Automated Design of Agentic Systems
- 2410.17238_SELA: Tree-Search Enhanced LLM Agents for Automated Machine Learning
- 2408.08435_ADAS: Automating Agentic Workflow Generation
- 2410.21012_FACT: Examining the Effectiveness of Iterative Context Rewriting for Multi-fact Retrieval
- 2504.01990_Advances and Challenges in Foundation Agents
- 2506.12508_AgentOrchestra: A Hierarchical Multi-Agent Framework for General-Purpose Task Solving
视觉 Agent&AIOS¶
- 2312.13771_AppAgent: Multimodal Agents as Smartphone Users
- 2402.07939_UFO: A UI-Focused Agent for Windows OS Interaction
- 2406.01014_Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration
- 2501.11733_Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks
- 2501.12326_UI-TARS: Pioneering Automated GUI Interaction with Native Agents
- 2502.14282_PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC
- 2403.16971_AIOS: LLM Agent Operating System
- 2504.14603_UFO2: The Desktop AgentOS
大模型调优¶
- 2101.00190_Prefix-Tuning: Optimizing Continuous Prompts for Generation
- 2103.10385_p-tuning: GPT Understands, Too
- 2104.08691_Prompt Tuning: The Power of Scale for Parameter-Efficient Prompt Tuning
- 2106.09685_LoRA: Low-Rank Adaptation of Large Language Models
- 2401.01335_Self-Play: Fine-Tuning Converts Weak Language Models to Strong Language Models
- 2402.09353_DoRA: Weight-Decomposed Low-Rank Adaptation
- 2402.12354_LoRA+: Efficient Low Rank Adaptation of Large Models
- 2403.03507_GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
- 2403.13372_LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models
- 2203.02155_Training language models to follow instructions with human feedback(InstructGPT)
- 2305.20050_Let’s Verify Step by Step
- 2408.03314_Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
- 2412.14135_Scaling of Search and Learning: A Roadmap to Reproduce o1 from Reinforcement Learning Perspective
分布式模型¶
- 通用
- 1701.06538_MoE: Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
- 1806.03377_PipeDream: Fast and Efficient Pipeline Parallel DNN Training
- 1811.06965_GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
- 1909.08053_Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
- 19xx_PipeDream: Generalized Pipeline Parallelism for DNN Training
- 2006.15704_PyTorch Distributed: Experiences on Accelerating Data Parallel Training
- 2006.16668_GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
- 2006.09503_PipeDream-2BW: Memory-Efficient Pipeline-Parallel DNN Training
- 2104.04473_Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
- 2205.14135_FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
- 2307.08691_FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
LLM NLP¶
- 18xx_GPT1: Improving Language Understanding by Generative Pre-Training
- 1810.04805_BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- 19xx_GPT2: Language Models are Unsupervised Multitask Learners
- 2012.00413_CPM: A Large-scale Generative Chinese Pre-trained Language Model
- 2302.13971_LLaMA: Open and Efficient Foundation Language Models
- 2307.09288_Llama 2: Open Foundation and Fine-Tuned Chat Models
- 2309.16609_Qwen Technical Report
- 2401.14196_DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence
- 2404.06395_MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
- 2405.04434_DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
- 2406.12793_ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
- 2407.10671_Qwen2 Technical Report
- 2412.15115_Qwen2.5
- 2505.09388_Qwen3
LLM MoE¶
LLM 多模态¶
- 2304.08485_LLaVA: Visual Instruction Tuning
- 2308.12966_Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
- 2310.03744_LLaVA2: Improved Baselines with Visual Instruction Tuning
- 2312.07533_VILA: On Pre-training for Visual Language Models
- 2403.05525_DeepSeek-VL: Towards Real-World Vision-Language Understanding
- 2408.01800_MiniCPM-V: A GPT-4V Level MLLM on Your Phone
- 2409.17146_Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
- 2411.00774_Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM
- 2412.04468_NVILA: Efficient Frontier Visual Language Models
- 2502.13923_Qwen2.5-VL
- 2503.20215_Qwen2.5-Omni Technical Report
- 2506.13642_Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model
LLM 音频¶
- 2005.08100_Conformer: Convolution-augmented Transformer for Speech Recognition
- 2112.02418_YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone
- 2212.04356_whisper: Robust Speech Recognition via Large-Scale Weak Supervision
- 2301.02111_Vall-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
- 2303.03926_VALL-E_X: Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling
- 2406.05370_VALL-E2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers
- 2407.05407_CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens
- 2407.10759_Qwen2-Audio Technical Report
- 2410.00037_Moshi: a speech-text foundation model for real-time dialogue
- 2412.10117_CosyVoice2: Scalable Streaming Speech Synthesis with Large Language Models
- 2501.06282_MinMo: A Multimodal Large Language Model for Seamless Voice Interaction
- 2505.02707_Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play
- 2505.17589_CosyVoice3: Towards In-the-wild Speech Generation via Scaling-up and Post-training
LLM强化学习¶
LLM 量化¶
- 通用
- 2110.02861_bitsandbytes: 8-bit Optimizers via Block-wise Quantization
- 2206.01861_ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers
- 2206.09557_LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models
- 2208.07339_LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
- 2209.05433_FP8: FP8 Formats For Deep Learning
- 2210.17323_GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
- 2211.10438_SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
- 2305.14314_QLoRA: Efficient Finetuning of Quantized LLMs
- 2306.00978_AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
- 2309.05516_AutoRound: Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs
LLM 闭源模型¶
3D¶
- 2003.08934_NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
- 2203.08586: Deep vanishing point detection: Geometric priors make dataset variations vanish
- 2312.14132_DUSt3R: Geometric 3D Vision Made Easy
- 2406.09756_MASt3R: Grounding Image Matching in 3D with MASt3R
- 2412.09401_SLAM3R: Real-Time Dense Scene Reconstruction from Monocular RGB Videos
- 2412.12392_MASt3R-SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors
- 2503.11651_VGGT: Visual Geometry Grounded Transformer
LLM 安全¶
Benchmarking¶
- 通用
- 2009.03300_MMLU: Measuring Massive Multitask Language Understanding
- 2103.03874_MATH: Measuring Mathematical Problem Solving With the MATH Dataset
- 2311.12022_GPQA: A Graduate-Level Google-Proof Q&A Benchmark
- 2311.12983_GAIA: a benchmark for General AI Assistants
- 2404.07972_OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
- 2411.04368_SimpleQA: Measuring short-form factuality in large language models
- 2501.14249_HLE: Humanity’s Last Exam
数据集&数据蒸馏¶
Framework¶
- 1712.05889_Ray: A Distributed Framework for Emerging AI Applications
- 1910.02054_DeepSpeed_ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
- PyTorch: An Imperative Style, High-Performance Deep Learning Library
- Transformers: State-of-the-Art Natural Language Processing
- 2210.XX_Ray v2 Architecture
- 2309.06180_Efficient Memory Management for Large Language Model Serving with PagedAttention
ML¶
- 2112.09332_WebGPT: Browser-assisted question-answering with human feedback
- 2203.11147_GopherCite: Teaching language models to support answers with verified quotes
- 2305.14251_FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation
- 2304.09848_Generative_Search: Evaluating Verifiability in Generative Search Engines
- 2307.02185_Citation: A Key to Building Responsible and Accountable Large Language Models
- 2307.16883_HAGRID: A Human-LLM Collaborative Dataset for Generative Information-Seeking with Attribution
- 2305.14627_ALCE: Enabling Large Language Models to Generate Text with Citations
ML 多模态相关¶
- 2108.03353_ Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning
- 2209.08199_ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots
- 2212.06817_RT-1: ROBOTICS TRANSFORMER FOR REAL-WORLD CONTROL AT SCALE
- 2401.10935_SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
- 2402.04615_ScreenAI: A Vision-Language Model for UI and Infographics Understanding
- 2411.02059_TableGPT2: A Large Multimodal Model with Tabular Data Integration
ML Vision¶
- 1506.02640_You Only Look Once: Unified, Real-Time Object Detection
- 1612.08242_YOLO9000: Better, Faster, Stronger
- 1804.02767_YOLOv3
- 2004.10934_YOLOv4: Optimal Speed and Accuracy of Object Detection
- 2205.00159_SVTR: Scene Text Recognition with a Single Visual Model
- 2207.02696_YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors
- Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
- 2304.08485_Visual Instruction Tuning
- 2402.13616_YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information
- 2405.14458_YOLOv10: Real-Time End-to-End Object Detection
- 2411.15858_SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition
RAG¶
- 2005.11401_Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
- 2312.10997_Retrieval-Augmented Generation for Large Language Models: A Survey
- 2401.15884_CRAG: Corrective Retrieval Augmented Generation
- 2403.14403_Adaptive-RAG
- 2404.16130_From Local to Global: A Graph RAG Approach to Query-Focused Summarization
- 2405.16506_GRAG: Graph Retrieval-Augmented Generation
- GraphRAG 官方文档
- 2406.13213_Multi-Meta-RAG: Improving RAG for Multi-Hop Queries using Database Filtering with LLM-Extracted Metadata
- 2410.10450_KBLaM: Knowledge Base augmented Language Model
- 2504.03137_LightPROF: A Lightweight Reasoning Framework for Large Language Model on Knowledge Graph
Tools¶
AGI¶
others¶
Highlighting the top ML papers every week: https://github.com/dair-ai/ML-Papers-of-the-Week