10.8. 模型推理平台¶

10.8.1. Ollama¶

https://ollama.com/
GitHub 仓库: https://github.com/ollama/ollama
ollama(OpenAI compatibility): https://ollama.com/blog/openai-compatibility
支持的模型列表： https://ollama.com/library
Ollama(开源平台，用于在本地运行和管理大型语言模型 (LLM)。它旨在使研究人员和开发人员能够轻松使用 LLM 进行各种任务)
定位: Ollama 专注于提供大型语言模型的运行环境，如Llama3、Phi3等。
Ollama是一个创新的工具，旨在在本地运行像Llama 2和Mistral这样的开源LLM。这个开创性平台通过将模型权重、配置和数据集捆绑到一个由Model文件管理的统一的包中，简化了运行LLM的复杂过程。Ollama模型库提供了广泛的模型选择
Ollama因其与各种模型的广泛兼容性而脱颖而出，包括Llama 2、Mistral和WizardCoder等著名模型。这种兼容性确保用户可以方便地接触语言建模技术前沿。Ollama包容性的方法简化了探索和使用该领域最新进展的过程，使其成为那些热衷于保持在AI研究和开发前沿的用户的理想平台。
Ollama没有官方的WebUI，但有几个可用的WebUI选项可以使用。其中一个选项是Ollama WebUI

10.8.2. LLamaFactory¶

备注

详见 ui

10.8.3. Xinference(Xorbits Inference)¶

https://github.com/xorbitsai/inference
https://inference.readthedocs.io/
一个开源平台，用于简化各种 AI 模型的运行和集成。借助 Xinference，您可以使用任何开源 LLM、嵌入模型和多模态模型在云端或本地环境中运行推理，并创建强大的 AI 应用。
支持的模型列表: https://inference.readthedocs.io/en/latest/models/builtin/index.html

10.8.4. LMDeploy¶

GitHub: https://github.com/InternLM/lmdeploy
官网: lmdeploy.readthedocs.io/en/latest/
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
LMDeploy 由 MMDeploy 和 MMRazor 团队联合开发，是涵盖了 LLM 任务的全套轻量化、部署和服务解决方案。

10.8.5. localai¶

https://localai.io/
https://github.com/mudler/LocalAI
简介: The free, Open Source OpenAI alternative. Self-hosted, community-driven and local-first. Drop-in replacement for OpenAI running on consumer-grade hardware. No GPU required. Runs gguf, transformers, diffusers and many more models architectures. It allows to generate Text, Audio, Video, Images. Also with voice cloning capabilities.
定位: LocalAI 是一个开源的、本地运行的OpenAI替代品，提供与OpenAI API规范兼容的REST API。
GPU加速：它可以在没有GPU加速的情况下运行，但如果有的话可以利用它。利用GPU加速可以提高计算速度和能效。这种设置还可以适应大型LLM模型。
密集型模型管理：LocalAI处理大型语言模型的方法涉及一种手动、详细的方法论。用户需要直接与AutoGPTQ、RWKV、llama.cpp和vLLM等各种后端系统进行交互，这使得可以进行更多的定制和优化。这种管理风格要求细致的配置、定期更新和维护，因此需要更高的技术水平。它提供了对模型的增强控制，使用户可以精确地将其定制以满足特定需求并实现最佳性能。

特征:

* Local, OpenAI drop-in alternative REST API. You own your data.
* NO GPU required. NO Internet access is required either
    * Optional, GPU Acceleration is available.
* Supports multiple models
* 🏃 Once loaded the first time, it keep models loaded in memory for faster inference
* ⚡ Doesn’t shell-out, but uses bindings for a faster inference and better performance.

quick-start:

docker run -ti --name local-ai -p 8080:8080 localai/localai:latest-aio-cpu
# or, if you have an Nvidia GPU:
# docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-aio-gpu-nvidia-cuda-12

10.8.6. FastChat¶

https://github.com/lm-sys/FastChat
An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
支持的模型列表: https://github.com/lm-sys/FastChat/blob/main/docs/model_support.md

10.8.7. One API¶

https://github.com/songquanpeng/one-api

10.8.8. 其他¶

Jan¶

Jan(offline): https://jan.ai/
https://github.com/janhq/jan
定位: Jan AI 旨在将用户的计算机转变为AI计算机，提供本地和远程API连接的能力。
Jan支持Mac M1/ M2/ M3、Mac (Intel)、Windows、Linux系统。除了本地模型，它也支持OpenAI API。
linux环境需要使用 pwsh 命令

LM Studio¶

https://lmstudio.ai/
https://github.com/lmstudio-ai
Any model file in one of the supportedarchitectures converted to gguf will work in LMStudio.
是桌面版应用，非开源项目

区别¶

特性	Jan	Ollama
支持的 LLM 引擎	llama.cpp, TensorRT-LLM	llama.cpp,Triton, ONNX Runtime
性能	更快	在某些情况下可能更快
易用性	命令行和 Web 界面	主要命令行界面
灵活性	更灵活	更简单
社区	较小但活跃	较大且拥有更多资源

Jan AI 强调本地运行和隐私保护，同时提供远程API连接能力，具有跨平台特性和可定制性。
Ollama 似乎更专注于提供大型语言模型的运行环境，并且允许用户进行自定义。
LocalAI 是一个社区驱动的项目，旨在提供一个OpenAI的开源替代品，具有不需要GPU和互联网的独特优势，支持多种AI功能，并且对性能进行了优化。