LLM 周边技术¶
Framework¶
- 1712.05889_Ray: A Distributed Framework for Emerging AI Applications
- 1910.02054_DeepSpeed_ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
- PyTorch: An Imperative Style, High-Performance Deep Learning Library
- Transformers: State-of-the-Art Natural Language Processing
- 2210.XX_Ray v2 Architecture
- 2309.06180_vLLM: Efficient Memory Management for Large Language Model Serving with PagedAttention
大模型调优¶
- 2101.00190_Prefix-Tuning: Optimizing Continuous Prompts for Generation
- 2103.10385_p-tuning: GPT Understands, Too
- 2104.08691_Prompt Tuning: The Power of Scale for Parameter-Efficient Prompt Tuning
- 2106.09685_LoRA: Low-Rank Adaptation of Large Language Models
- 2401.01335_Self-Play: Fine-Tuning Converts Weak Language Models to Strong Language Models
- 2402.09353_DoRA: Weight-Decomposed Low-Rank Adaptation
- 2402.12354_LoRA+: Efficient Low Rank Adaptation of Large Models
- 2403.03507_GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
- 2403.13372_LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models
分布式模型¶
- 1701.06538_MoE: Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
- 1806.03377_PipeDream: Fast and Efficient Pipeline Parallel DNN Training
- 1811.06965_GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
- 1909.08053_Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
- 19xx_PipeDream: Generalized Pipeline Parallelism for DNN Training
- 2006.09503_PipeDream-2BW: Memory-Efficient Pipeline-Parallel DNN Training
- 2006.15704_PyTorch Distributed: Experiences on Accelerating Data Parallel Training
- 2006.16668_GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
- 2104.04473_Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
- 2205.14135_FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
- 2307.08691_FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
- 通用
LLM 量化¶
- 通用
- 2110.02861_bitsandbytes: 8-bit Optimizers via Block-wise Quantization
- 2206.01861_ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers
- 2206.09557_LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models
- 2208.07339_LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
- 2209.05433_FP8: FP8 Formats For Deep Learning
- 2210.17323_GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
- 2211.10438_SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
- 2305.14314_QLoRA: Efficient Finetuning of Quantized LLMs
- 2306.00978_AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
- 2309.05516_AutoRound: Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs
LLM 安全¶
LLM强化学习¶
其他¶
- 2203.02155_Training language models to follow instructions with human feedback(InstructGPT)
- 2305.20050_Let’s Verify Step by Step
- 2408.03314_Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
- 2412.14135_Scaling of Search and Learning: A Roadmap to Reproduce o1 from Reinforcement Learning Perspective