6.3.8. Text Generation Inference¶

From: https://huggingface.co/docs/text-generation-inference/index
GitHub: https://github.com/huggingface/text-generation-inference/
专门用于部署和服务高度优化的LLMs进行推理的库。它包括 Transformers 中未包含的面向部署的优化功能，例如用于提高吞吐量的连续批处理和用于多 GPU 推理的张量并行性。
Hugging Face also provides Text Generation Inference (TGI), a library dedicated to deploying and serving highly optimized LLMs for inference. It includes deployment-oriented optimization features not included in Transformers, such as continuous batching for increasing throughput and tensor parallelism for multi-GPU inference.

备注

这个本质其实是做了一个推理的应用。

Getting started¶

Text Generation Inference¶

Text Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and T5.

https://img.zhaoweiguo.com/uPic/2024/10/Js2elD.png

Quick Tour¶

Launching TGI:

model=teknium/OpenHermes-2.5-Mistral-7B
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
    ghcr.io/huggingface/text-generation-inference:2.4.0 \
    --model-id $model

Consuming TGI:

import requests

headers = {
    "Content-Type": "application/json",
}

data = {
    'inputs': 'What is Deep Learning?',
    'parameters': {
        'max_new_tokens': 20,
    },
}

response = requests.post('http://127.0.0.1:8080/generate', headers=headers, json=data)
print(response.json())
# {'generated_text': '\n\nDeep Learning is a subset of Machine Learning that is concerned with the development of algorithms that can'}

Installation from source¶

Install CLI:

git clone https://github.com/huggingface/text-generation-inference.git && cd text-generation-inference
make install

备注

需要先安装protobuf和rust