Hugging Face Hub ################ * Hugging Face Hub 是一个拥有超过 120k 模型、20k 数据集和 50k 演示应用程序 (Spaces) 的平台 Repositories ============ * Models, Spaces, and Datasets are hosted on the Hugging Face Hub as Git repositories, which means that version control and collaboration are core elements of the Hub. Cloning repositories 克隆存储库:: git clone https://huggingface.co// git clone git@hf.co:/ Models ====== 常见的机器学习文档工具包括:: - 模型卡(Model Cards):自动生成描述模型性能、用途、局限性等信息的标准文档。 - 数据表(Data Sheets):自动生成数据集的元数据、组成、收集方法等信息的标准文档。 - 报告生成器(Reporting):根据模型试验结果自动生成试验报告。 - 文档extractors:从代码注释、函数签名等自动提取文档。 - 可视化工具(Visualization):通过关系图、流程图等可视化展示系统结构。 - 笔记本工具(Notebooks):像Jupyter等笔记本可以嵌入文档、代码、结果展示。 Integrated Libraries -------------------- HuggingFace Supported Libraries:: +----------------------------+---------------+---------+-------------------+-------------+ | Library | Inference API | Widgets | Download from Hub | Push to Hub | +============================+===============+=========+===================+=============+ | 🤗 Transformers | ✅ | ✅ | ✅ | ✅ | +----------------------------+---------------+---------+-------------------+-------------+ | 🤗 Diffusers | ❌ | ❌ | ✅ | ✅ | +----------------------------+---------------+---------+-------------------+-------------+ | Adapter Transformers | ❌ | ❌ | ✅ | ✅ | +----------------------------+---------------+---------+-------------------+-------------+ | AllenNLP | ✅ | ✅ | ✅ | ❌ | +----------------------------+---------------+---------+-------------------+-------------+ | Asteroid | ✅ | ✅ | ✅ | ❌ | +----------------------------+---------------+---------+-------------------+-------------+ | BERTopic | ✅ | ✅ | ✅ | ✅ | +----------------------------+---------------+---------+-------------------+-------------+ | docTR | ✅ | ✅ | ✅ | ❌ | +----------------------------+---------------+---------+-------------------+-------------+ | ESPnet | ✅ | ✅ | ✅ | ❌ | +----------------------------+---------------+---------+-------------------+-------------+ | fastai | ✅ | ✅ | ✅ | ✅ | +----------------------------+---------------+---------+-------------------+-------------+ | Keras | ❌ | ❌ | ✅ | ✅ | +----------------------------+---------------+---------+-------------------+-------------+ | Flair | ✅ | ✅ | ✅ | ❌ | +----------------------------+---------------+---------+-------------------+-------------+ | MBRL-Lib | ❌ | ❌ | ✅ | ✅ | +----------------------------+---------------+---------+-------------------+-------------+ | ML-Agents | ❌ | ❌ | ✅ | ✅ | +----------------------------+---------------+---------+-------------------+-------------+ | NeMo | ✅ | ✅ | ✅ | ❌ | +----------------------------+---------------+---------+-------------------+-------------+ | PaddleNLP | ✅ | ✅ | ✅ | ✅ | +----------------------------+---------------+---------+-------------------+-------------+ | Pyannote | ❌ | ❌ | ✅ | ❌ | +----------------------------+---------------+---------+-------------------+-------------+ | PyCTCDecode | ❌ | ❌ | ✅ | ❌ | +----------------------------+---------------+---------+-------------------+-------------+ | Pythae | ❌ | ❌ | ✅ | ✅ | +----------------------------+---------------+---------+-------------------+-------------+ | RL-Baselines3-Zoo | ❌ | ✅ | ✅ | ✅ | +----------------------------+---------------+---------+-------------------+-------------+ | Sample Factory | ❌ | ✅ | ✅ | ✅ | +----------------------------+---------------+---------+-------------------+-------------+ | Sentence Transformers | ✅ | ✅ | ✅ | ✅ | +----------------------------+---------------+---------+-------------------+-------------+ | spaCy | ✅ | ✅ | ✅ | ✅ | +----------------------------+---------------+---------+-------------------+-------------+ | SpanMarker | ✅ | ✅ | ✅ | ✅ | +----------------------------+---------------+---------+-------------------+-------------+ | Scikit Learn (using skops) | ✅ | ✅ | ✅ | ✅ | +----------------------------+---------------+---------+-------------------+-------------+ | Speechbrain | ✅ | ✅ | ✅ | ❌ | +----------------------------+---------------+---------+-------------------+-------------+ | Stable-Baselines3 | ❌ | ✅ | ✅ | ✅ | +----------------------------+---------------+---------+-------------------+-------------+ | TensorFlowTTS | ❌ | ❌ | ✅ | ❌ | +----------------------------+---------------+---------+-------------------+-------------+ | Timm | ✅ | ✅ | ✅ | ✅ | +----------------------------+---------------+---------+-------------------+-------------+ | Transformers.js | ❌ | ❌ | ✅ | ❌ | +----------------------------+---------------+---------+-------------------+-------------+ HuggingFace Supported Libraries Description `各lib库具体链接参见 `_ :: Library Description 🤗 Transformers State-of-the-art Natural Language Processing for Pytorch, TensorFlow, and JAX 🤗 Diffusers A modular toolbox for inference and training of diffusion models Adapter Transformers Extends 🤗Transformers with Adapters. AllenNLP An open-source NLP research library, built on PyTorch. Asteroid Pytorch-based audio source separation toolkit BERTopic BERTopic is a topic modeling library for text and images docTR Models and datasets for OCR-related tasks in PyTorch & TensorFlow ESPnet End-to-end speech processing toolkit (e.g. TTS) fastai Library to train fast and accurate models with state-of-the-art outputs. Keras Library that uses a consistent and simple API to build models leveraging TensorFlow and its ecosystem. Flair Very simple framework for state-of-the-art NLP. MBRL-Lib PyTorch implementations of MBRL Algorithms. ML-Agents Enables games and simulations made with Unity to serve as environments for training intelligent agents. NeMo Conversational AI toolkit built for researchers PaddleNLP Easy-to-use and powerful NLP library built on PaddlePaddle Pyannote Neural building blocks for speaker diarization. PyCTCDecode Language model supported CTC decoding for speech recognition Pythae Unifyed framework for Generative Autoencoders in Python RL-Baselines3-Zoo Training framework for Reinforcement Learning, using Stable Baselines3. Sample Factory Codebase for high throughput asynchronous reinforcement learning. Sentence Transformers Compute dense vector representations for sentences, paragraphs, and images. spaCy Advanced Natural Language Processing in Python and Cython. SpanMarker Familiar, simple and state-of-the-art Named Entity Recognition. ScikitLearn(using skops) Machine Learning in Python. Speechbrain A PyTorch Powered Speech Toolkit. Stable-Baselines3 Set of reliable implementations of deep reinforcement learning algorithms in PyTorch TensorFlowTTS Real-time state-of-the-art speech synthesis architectures. Timm Collection of image models, scripts, pretrained weights, etc. Transformers.js State-of-the-art Machine Learning for the web. Run 🤗 Transformers directly in your browser, with no need for a server! Transformers ^^^^^^^^^^^^ * transformers is a library with state-of-the-art Machine Learning for Pytorch, TensorFlow and JAX. It provides thousands of pretrained models to perform tasks on different modalities such as text, vision, and audio. You can find models for many different tasks:: 01. Extracting the answer from a context (question-answering). 02. Creating summaries from a large text (summarization). 03. Classify text (e.g. as spam or not spam, text-classification). 04. Generate a new text with models such as GPT (text-generation). 05. Identify parts of speech (verb, subject, etc.) or entities (country, organization, etc.) in a sentence (token-classification). 06. Transcribe audio files to text (automatic-speech-recognition). 07. Classify the speaker or language in an audio file (audio-classification). 08. Detect objects in an image (object-detection). 09. Segment an image (image-segmentation). 10. Do Reinforcement Learning (reinforcement-learning) Using existing models:: # With pipeline, just specify the task and the model id from the Hub. from transformers import pipeline pipe = pipeline("text-generation", model="distilgpt2") # If you want more control, you will need to define the tokenizer and model. from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("distilgpt2") model = AutoModelForCausalLM.from_pretrained("distilgpt2") * 其他资源 * Transformers library: https://github.com/huggingface/transformers * Transformers docs: https://huggingface.co/docs/transformers/index * Share a model guide: https://huggingface.co/docs/transformers/model_sharing Adapter Transformers ^^^^^^^^^^^^^^^^^^^^ 安装:: pip install -U adapter-transformers * ``adapter-transformers`` is a library that extends 🤗 transformers by allowing to integrate, train and use Adapters and other efficient fine-tuning methods. The library is fully compatible with 🤗 transformers. Adapters are small learnt layers inserted within each layer of a pre-trained model. You can learn more about this in the `original paper `_. * Adapter Hub repository: https://github.com/adapter-hub/hub * AdapterHub: https://adapterhub.ml/explore/ * official guide: https://docs.adapterhub.ml/index.html * Adapter Transformers library: https://github.com/adapter-hub/adapter-transformers * Integration with Hub docs: https://docs.adapterhub.ml/huggingface_hub.html 示例:: from transformers import AutoModelWithHeads model = AutoModelWithHeads.from_pretrained("bert-base-uncased") adapter_name = model.load_adapter("AdapterHub/bert-base-uncased-pf-emotion", source="hf") model.active_adapters = adapter_name find all Adapter Models programmatically:: from transformers import list_adapters # source can be "ah" (AdapterHub), "hf" (hf.co) or None (for both, default) adapter_infos = list_adapters(source="hf", model_name="bert-base-uncased") AllenNLP ^^^^^^^^ * allennlp is a NLP library for developing state-of-the-art models on different linguistic tasks. It provides high-level abstractions and APIs for common components and models in modern NLP. It also provides an extensible framework that makes it easy to run and manage NLP experiments. .. image:: https://img.zhaoweiguo.com/uPic/2023/07/cWfYz0.jpg :width: 50% :height: 300px * AllenNLP website: https://allenai.org/allennlp * AllenNLP repository: https://github.com/allenai/allennlp 示例:: import allennlp_models from allennlp.predictors.predictor import Predictor predictor = Predictor.from_path("hf://allenai/bidaf-elmo") predictor_input = { "passage": "My name is Wolfgang and I live in Berlin", "question": "Where do I live?" } predictions = predictor.predict_json(predictor_input) Asteroid ^^^^^^^^ * asteroid is a Pytorch toolkit for audio source separation. It enables fast experimentation on common datasets with support for a large range of datasets and recipes to reproduce papers. * models page: https://huggingface.co/models?filter=asteroid .. figure:: https://img.zhaoweiguo.com/uPic/2023/07/V6D7ii.jpg :width: 400px * Asteroid website: https://asteroid-team.github.io/ * Asteroid library: https://github.com/asteroid-team/asteroid * Integration docs: https://github.com/asteroid-team/asteroid/blob/master/docs/source/readmes/pretrained_models.md 示例:: from asteroid.models import ConvTasNet model = ConvTasNet.from_pretrained('mpariente/ConvTasNet_WHAM_sepclean') ESPnet ^^^^^^ * espnet is an end-to-end toolkit for speech processing, including automatic speech recognition, text to speech, speech enhancement, dirarization and other tasks. .. figure:: https://img.zhaoweiguo.com/uPic/2023/07/EPEBlv.jpg :width: 400px * ESPnet model zoo repository: https://github.com/espnet/espnet_model_zoo * ESPnet docs: https://espnet.github.io/espnet/index.html * Integration docs: https://github.com/asteroid-team/asteroid/blob/master/docs/source/readmes/pretrained_models.md 示例:: import soundfile from espnet2.bin.tts_inference import Text2Speech text2speech = Text2Speech.from_pretrained("model_name") speech = text2speech("foobar")["wav"] soundfile.write("out.wav", speech.numpy(), text2speech.fs, "PCM_16") fastai ^^^^^^ * fastai is an open-source Deep Learning library that leverages PyTorch and Python to provide high-level components to train fast and accurate neural networks with state-of-the-art outputs on text, vision, and tabular data. * models page: https://huggingface.co/models?library=fastai&sort=downloads 安装:: pip install huggingface_hub["fastai"] * fastai course: https://course.fast.ai/ * fastai website: https://www.fast.ai/ * Integration with Hub docs: https://docs.fast.ai/huggingface.html * Integration with Hub announcement: https://huggingface.co/blog/fastai 示例:: from huggingface_hub import from_pretrained_fastai learner = from_pretrained_fastai("espejelomar/identify-my-cat") _,_,probs = learner.predict(img) print(f"Probability it's a cat: {100*probs[1].item():.2f}%") # Probability it's a cat: 100.00% Keras ^^^^^ * keras is an open-source machine learning library that uses a consistent and simple API to build models leveraging TensorFlow and its ecosystem. * Keras Developer Guides: https://keras.io/guides/ * Keras examples: https://keras.io/examples/ * Keras examples on 🤗 Hub: https://huggingface.co/keras-io * Keras learning resources: https://keras.io/getting_started/learning_resources/#moocs 示例:: from huggingface_hub import from_pretrained_keras model = from_pretrained_keras("keras-io/mobile-vit-xxs") prediction = model.predict(image) prediction = tf.squeeze(tf.round(prediction)) print(f'The image is a {classes[(np.argmax(prediction))]}!') # The image is a sunflower! ML-Agents ^^^^^^^^^ * ml-agents is an open-source toolkit that enables games and simulations made with Unity to serve as environments for training intelligent agents. 安装:: # Clone the repository git clone https://github.com/Unity-Technologies/ml-agents # Go inside the repository and install the package cd ml-agents pip3 install -e ./ml-agents-envs pip3 install -e ./ml-agents * ML-Agents documentation: https://github.com/Unity-Technologies/ml-agents/blob/develop/docs/Hugging-Face-Integration.md * Official Unity ML-Agents Spaces demos: https://huggingface.co/unity 示例:: mlagents-load-from-hf --repo-id="Art-phys/poca-SoccerTwos_500M" --local-dir="./downloads" PaddleNLP ^^^^^^^^^ * Leveraging the PaddlePaddle framework, PaddleNLP is an easy-to-use and powerful NLP library with awesome pre-trained model zoo, supporting wide-range of NLP tasks from research to industrial applications. 安装:: pip install -U paddlenlp 示例:: from paddlenlp.transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("PaddlePaddle/ernie-1.0-base-zh", from_hf_hub=True) model = AutoModelForMaskedLM.from_pretrained("PaddlePaddle/ernie-1.0-base-zh", from_hf_hub=True) tokenizer.save_to_hf_hub(repo_id="/") model.save_to_hf_hub(repo_id="/") RL-Baselines3-Zoo ^^^^^^^^^^^^^^^^^ * rl-baselines3-zoo is a training framework for Reinforcement Learning using Stable Baselines3. * RL-Baselines3-Zoo official trained models: https://huggingface.co/sb3 * RL-Baselines3-Zoo documentation: https://github.com/DLR-RM/rl-baselines3-zoo 示例:: # Download ppo SpaceInvadersNoFrameskip-v4 model and save it into the logs/ folder python -m rl_zoo3.load_from_hub --algo dqn --env SpaceInvadersNoFrameskip-v4 -f logs/ -orga sb3 python enjoy.py --algo dqn --env SpaceInvadersNoFrameskip-v4 -f logs/ Sample Factory ^^^^^^^^^^^^^^ * sample-factory is a codebase for high throughput asynchronous reinforcement learning. It has integrations with the Hugging Face Hub to share models with evaluation results and training metrics. * Repository: https://github.com/alex-petrenko/sample-factory 安装:: pip install sample-factory 示例:: python -m sample_factory.huggingface.load_from_hub -r -d Sentence Transformers ^^^^^^^^^^^^^^^^^^^^^ * sentence-transformers is a library that provides easy methods to compute embeddings (dense vector representations) for sentences, paragraphs and images. Texts are embedded in a vector space such that similar text is close, which enables applications such as semantic search, clustering, and retrieval. .. figure:: https://img.zhaoweiguo.com/uPic/2023/07/WkkFtA.jpg :width: 400px * Sentence Transformers library: https://github.com/UKPLab/sentence-transformers * Sentence Transformers docs: https://www.sbert.net/ * Integration with Hub announcement: https://huggingface.co/blog/sentence-transformers-in-the-hub 示例:: from sentence_transformers import SentenceTransformer, util model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1') query_embedding = model.encode('How big is London') passage_embedding = model.encode(['London has 9,787,426 inhabitants at the 2011 census', 'London is known for its finacial district']) print("Similarity:", util.dot_score(query_embedding, passage_embedding)) spaCy ^^^^^ * spaCy is a popular library for advanced Natural Language Processing used widely across industry. spaCy makes it easy to use and train pipelines for tasks like named entity recognition, text classification, part of speech tagging and more, and lets you build powerful applications to process and analyze large volumes of text. 安装:: pip install spacy-huggingface-hub 示例:: # Using spacy.load(). import spacy nlp = spacy.load("en_core_web_sm") # Importing as module. import en_core_web_sm nlp = en_core_web_sm.load() SpanMarker ^^^^^^^^^^ * SpanMarker is a framework for training powerful Named Entity Recognition models using familiar encoders such as BERT, RoBERTa and DeBERTa. Tightly implemented on top of the 🤗 Transformers library, SpanMarker can take good advantage of it. As a result, SpanMarker will be intuitive to use for anyone familiar with Transformers. 安装:: pip install -U span_marker * SpanMarker repository: https://github.com/tomaarsen/SpanMarkerNER * SpanMarker docs: https://tomaarsen.github.io/SpanMarkerNER 示例:: from span_marker import SpanMarkerModel model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-fewnerd-fine-super") SpeechBrain ^^^^^^^^^^^ * speechbrain is an open-source and all-in-one conversational toolkit for audio/speech. The goal is to create a single, flexible, and user-friendly toolkit that can be used to easily develop state-of-the-art speech technologies, including systems for speech recognition, speaker recognition, speech enhancement, speech separation, language identification, multi-microphone signal processing, and many others. 示例:: import torchaudio from speechbrain.pretrained import EncoderClassifier classifier = EncoderClassifier.from_hparams( source="speechbrain/urbansound8k_ecapa" ) out_prob, score, index, text_lab = classifier.classify_file('speechbrain/urbansound8k_ecapa/dog_bark.wav') * SpeechBrain website: https://speechbrain.github.io/ * SpeechBrain docs: https://speechbrain.readthedocs.io/en/latest/index.html Stable-Baselines3 ^^^^^^^^^^^^^^^^^ * stable-baselines3 is a set of reliable implementations of reinforcement learning algorithms in PyTorch. 安装:: pip install stable-baselines3 pip install huggingface-sb3 * Hugging Face Stable-Baselines3 documentation: https://github.com/huggingface/huggingface_sb3#hugging-face--x-stable-baselines3-v20 * Stable-Baselines3 documentation: https://stable-baselines3.readthedocs.io/en/master/ Stanza ^^^^^^ * stanza is a collection of accurate and efficient tools for the linguistic analysis of many human languages. Starting from raw text to syntactic analysis and entity recognition, Stanza brings state-of-the-art NLP models to languages of your choosing. * stanza docs: https://stanfordnlp.github.io/stanza/ 示例:: import stanza nlp = stanza.Pipeline('en') # download th English model and initialize an English neural pipeline doc = nlp("Barack Obama was born in Hawaii.") # run annotation over a sentence TensorBoard ^^^^^^^^^^^ * TensorBoard provides tooling for tracking and visualizing metrics as well as visualizing models. All repositories that contain TensorBoard traces have an automatic tab with a hosted TensorBoard instance for anyone to check it out without any additional effort! * TensorBoard documentation: https://www.tensorflow.org/tensorboard timm ^^^^ * timm, also known as pytorch-image-models, is an open-source collection of state-of-the-art PyTorch image models, pretrained weights, and utility scripts for training, inference, and validation. .. figure:: https://img.zhaoweiguo.com/uPic/2023/07/neP2Ex.jpg :width: 400px 示例:: import timm # Loading https://huggingface.co/timm/eca_nfnet_l0 model = timm.create_model("hf-hub:timm/eca_nfnet_l0", pretrained=True) * timm (pytorch-image-models) GitHub Repo: https://github.com/rwightman/pytorch-image-models * timm documentation: https://huggingface.co/docs/timm Transformers.js ^^^^^^^^^^^^^^^ * Transformers.js is a JavaScript library for running 🤗 Transformers directly in your browser, with no need for a server! It is designed to be functionally equivalent to the original Python library, meaning you can run the same pretrained models using a very similar API. 安装:: npm i @xenova/transformers * Transformers.js repository: https://github.com/xenova/transformers.js * Transformers.js docs: https://huggingface.co/docs/transformers.js * Transformers.js demo: https://xenova.github.io/transformers.js/ 示例:: // Use a different model for sentiment-analysis let pipe = await pipeline('sentiment-analysis', 'nlptown/bert-base-multilingual-uncased-sentiment'); Widgets 部件 ------------ Widgets examples ^^^^^^^^^^^^^^^^ * `Named Entity Recognition `_ using `spaCy `_. * `Image Classification `_ using `🤗Transformers `_ * `Text to Speech `_ using `ESPnet `_. * `Sentence Similarity `_ using `Sentence Transformers `_. * 详细内容: https://huggingface.co/docs/hub/models-widgets#whats-a-widget some links to examples ^^^^^^^^^^^^^^^^^^^^^^ * text-classification, for instance `roberta-large-mnli `_ * token-classification, for instance `dbmdz/bert-large-cased-finetuned-conll03-english `_ * question-answering, for instance `distilbert-base-uncased-distilled-squad `_ * translation, for instance `t5-base `_ * summarization, for instance `facebook/bart-large-cnn `_ * conversational, for instance `facebook/blenderbot-400M-distill `_ * text-generation, for instance `gpt2 `_ * fill-mask, for instance `distilroberta-base `_ * zero-shot-classification (implemented on top of a nli text-classification model), for instance `facebook/bart-large-mnli `_ * table-question-answering, for instance `google/tapas-base-finetuned-wtq `_ * sentence-similarity, for instance `osanseviero/full-sentence-distillroberta2 `_ * 详细内容: https://huggingface.co/docs/hub/models-widgets#what-are-all-the-possible-taskwidget-types Datasets ======== * 🤗 Datasets documentation: https://huggingface.co/docs/datasets/index * dataset card: https://huggingface.co/docs/hub/datasets-cards * main datasets page: https://huggingface.co/datasets .. note:: 打开指定数据集时,可以点击按钮``Use in dataset library``查看使用方法 Using Datasets -------------- * Some datasets on the Hub contain a loading script, which allows you to easily load the dataset when you need it. * Many datasets however do not need to include a loading script, for instance when their data is stored directly in the repository in formats such as CSV, JSON and Parquet. 🤗 Datasets can load those kinds of datasets automatically without a loading script. * tutorials: https://huggingface.co/docs/datasets/tutorial * how-to guides: https://huggingface.co/docs/datasets/how_to * 更多详情参见: Datasets documentation Adding new datasets ------------------- * 三种方法: * Add files manually to the repository through the UI * Push files with the ``push_to_hub`` method from 🤗 Datasets * Use Git to commit and push your dataset files Spaces ====== * Hugging Face Spaces: https://huggingface.co/spaces