LLM 模型¶

NLP 模型¶

多模态模型¶

Embedding 模型¶

2506.05176_Qwen3_Embedding: Advancing Text Embedding and Reranking Through Foundation Models

LLM 音频¶

2005.08100_Conformer: Convolution-augmented Transformer for Speech Recognition
2106.07447_HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
2112.02418_YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone
2212.04356_whisper: Robust Speech Recognition via Large-Scale Weak Supervision
2301.02111_Vall-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
2303.03926_VALL-E_X: Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling
2406.05370_VALL-E2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers
2407.05407_CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens
2407.10759_Qwen2-Audio Technical Report
2410.00037_Moshi: a speech-text foundation model for real-time dialogue
2412.10117_CosyVoice2: Scalable Streaming Speech Synthesis with Large Language Models
2501.06282_MinMo: A Multimodal Large Language Model for Seamless Voice Interaction
2505.02707_Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play
2505.17589_CosyVoice3: Towards In-the-wild Speech Generation via Scaling-up and Post-training
2512.20156_Fun-Audio-Chat Technical Report

LLM 视频¶

LLM MoE¶

商业模型¶