LLM 模型¶
NLP 模型¶
- 1810.04805_BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- 18xx_GPT1: Improving Language Understanding by Generative Pre-Training
- 19xx_GPT2: Language Models are Unsupervised Multitask Learners
- 2006.03654_DeBERTa: Decoding-enhanced BERT with Disentangled Attention
- 2012.00413_CPM: A Large-scale Generative Chinese Pre-trained Language Model
- 2302.13971_LLaMA: Open and Efficient Foundation Language Models
- 2307.09288_Llama 2: Open Foundation and Fine-Tuned Chat Models
- 2309.16609_Qwen Technical Report
- 2310.19341_Skywork: A More Open Bilingual Foundation Model
- 2401.14196_DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence
- 2404.06395_MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
- 2405.04434_DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
- 2406.12793_ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
- 2407.10671_Qwen2 Technical Report
- 2412.15115_Qwen2.5
- 2505.09388_Qwen3
- 2508.06471_GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
多模态模型¶
- 2112.15093_CTR: Benchmarking Chinese Text Recognition: Datasets, Baselines, and an Empirical Study
- 2304.08485_LLaVA: Visual Instruction Tuning
- 2308.12966_Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
- 2310.03744_LLaVA2: Improved Baselines with Visual Instruction Tuning
- 2312.07533_VILA: On Pre-training for Visual Language Models
- 2403.05525_DeepSeek-VL: Towards Real-World Vision-Language Understanding
- 2408.01800_MiniCPM-V: A GPT-4V Level MLLM on Your Phone
- 2409.17146_Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
- 2410.13848_Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
- 2411.00774_Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM
- 2412.04468_NVILA: Efficient Frontier Visual Language Models
- 2502.13923_Qwen2.5-VL
- 2505.14683_BAGEL: Emerging Properties in Unified Multimodal Pretraining
- 2506.13642_Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model
- 2506.13642_Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model
- 2507.05595_PaddleOCR 3.0 Technical Report
- 2510.14528_PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
Embedding 模型¶
LLM 音频¶
- 2005.08100_Conformer: Convolution-augmented Transformer for Speech Recognition
- 2106.07447_HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
- 2112.02418_YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone
- 2212.04356_whisper: Robust Speech Recognition via Large-Scale Weak Supervision
- 2301.02111_Vall-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
- 2303.03926_VALL-E_X: Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling
- 2406.05370_VALL-E2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers
- 2407.05407_CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens
- 2407.10759_Qwen2-Audio Technical Report
- 2410.00037_Moshi: a speech-text foundation model for real-time dialogue
- 2412.10117_CosyVoice2: Scalable Streaming Speech Synthesis with Large Language Models
- 2501.06282_MinMo: A Multimodal Large Language Model for Seamless Voice Interaction
- 2505.02707_Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play
- 2505.17589_CosyVoice3: Towards In-the-wild Speech Generation via Scaling-up and Post-training
- 2512.20156_Fun-Audio-Chat Technical Report
LLM 视频¶
LLM MoE¶
商业模型¶
- 2303.08774_GPT-4 Technical Report
- 2312.11805_Gemini: A Family of Highly Capable Multimodal Models
- 2403.05530_Gemini1.5: Unlocking multimodal understanding across millions of tokens of context
- 2406.02430_Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
- 2407.04675_Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition
- 2503.20020_Gemini2: Gemini Robotics: Bringing AI into the Physical World
- 2504.xxxxx_Seed-Thinking-v1.5: Advancing Superb Reasoning Models with Reinforcement Learning
- 2505.07062_Seed1.5-VL Technical Report