主页

索引

模块索引

搜索页面

6.3.4. Datasets

  • Datasets is a library for easily accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP) tasks.

pip安装:

pip install datasets
# check
python -c "from datasets import load_dataset; print(load_dataset('squad', split='train')[0])"

# Audio
pip install 'datasets[audio]'
# check
python -c "import soundfile; print(soundfile.__libsndfile_version__)"

# Vision
pip install 'datasets[vision]'

源码安装:

git clone https://github.com/huggingface/datasets.git
cd datasets
pip install -e .

TUTORIALS

Load a dataset from the Hub

Load a dataset

load_dataset_builder():

>>> from datasets import load_dataset_builder
>>> ds_builder = load_dataset_builder("rotten_tomatoes")

# Inspect dataset description
>>> ds_builder.info.description
Movie Review Dataset.
This is a dataset of containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews.
This data was first used in Bo Pang and Lillian Lee,
        ``Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.'', Proceedings of the ACL, 2005.


# Inspect dataset features
>>> ds_builder.info.features
{'label': ClassLabel(num_classes=2, names=['neg', 'pos'], id=None),
 'text': Value(dtype='string', id=None)}

load_dataset():

>>> from datasets import load_dataset
>>> dataset = load_dataset("rotten_tomatoes", split="train")
>>> dataset
Dataset({
    features: ['text', 'label'],
    num_rows: 8530
})

Splits

get_dataset_split_names():

>>> from datasets import get_dataset_split_names
>>> get_dataset_split_names("rotten_tomatoes")
['train', 'validation', 'test']

前面load_dataset()函数指定了split,如果不指定:

>>> from datasets import load_dataset
>>> dataset = load_dataset("rotten_tomatoes")
>>> dataset
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

Configurations

get_dataset_config_names():

>>> from datasets import get_dataset_config_names
>>> configs = get_dataset_config_names("PolyAI/minds14")
>>> print(configs)
['cs-CZ', 'de-DE', 'en-AU' ...]

Then load the configuration you want:

from datasets import load_dataset
mindsFR = load_dataset("PolyAI/minds14", "fr-FR", split="train")

Know your dataset

Dataset

加载数据:

from datasets import load_dataset
dataset = load_dataset("rotten_tomatoes", split="train")

Indexing:

>>> dataset[0]
{'label': 1,
 'text': 'the rock is des...'}

# a list of all the values in the column
>>> dataset["text"]
[
        'the rock is desti...',
        'the gorgeously elabora...',
        ...,
        'things really ...'
]

Slicing:

>>> dataset[:3]
{'label': [1, 1, 1],
 'text':
        [
                'the rock is desti...',
                'the gorgeously elabora...',
                ...,
                'things really ...'
        ]
}
>>> dataset[3:6]
...

IterableDataset

An IterableDataset is loaded when you set the streaming parameter to True in load_dataset():

>>> from datasets import load_dataset
>>> iterable_dataset = load_dataset("food101", split="train", streaming=True)
>>> for example in iterable_dataset:
...    print(example)
...    break
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F0681F5C520>, 'label': 6}


>>> next(iter(iterable_dataset))
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F0681F59B50>,
 'label': 6}

>>> for example in iterable_dataset:
...    print(example)
...    break
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F7479DE82B0>, 'label': 6}

使用 IterableDataset.take() 返回包含特定数量示例的数据集子集:

# Get first three examples
>>> list(iterable_dataset.take(3))
[{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F7479DEE9D0>,
  'label': 6},
 {'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x512 at 0x7F7479DE8190>,
  'label': 6},
 {'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x383 at 0x7F7479DE8310>,
  'label': 6}]

Preprocess

Tokenize text

Start by loading the rotten_tomatoes dataset and the tokenizer corresponding to a pretrained BERT model:

>>> from transformers import AutoTokenizer
>>> from datasets import load_dataset

>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
>>> dataset = load_dataset("rotten_tomatoes", split="train")

Call your tokenizer on the first row of text in the dataset:

>>> tokenizer(dataset[0]["text"])
{
 'input_ids': [101, 1103, 2067 ...],
 'token_type_ids': [0, 0, 0, 0, 0 ...],
 'attention_mask': [1, 1, 1, 1, 1 ...]
}
说明:
        input_ids: the numbers representing the tokens in the text.
        token_type_ids: indicates which sequence a token belongs to if there is more than one sequence.
        attention_mask: indicates whether a token should be masked or not.

The fastest way to tokenize your entire dataset is to use the map() function:

def tokenization(example):
    return tokenizer(example["text"])

dataset = dataset.map(tokenization, batched=True)

Set the format of your dataset to be compatible with your machine learning framework:

# Pytorch
dataset.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "labels"])
dataset.format['type']

# TensorFlow
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")
tf_dataset = dataset.to_tf_dataset(
    columns=["input_ids", "token_type_ids", "attention_mask"],
    label_cols=["labels"],
    batch_size=2,
    collate_fn=data_collator,
    shuffle=True
)

Resample audio signals

Start by loading the MInDS-14 dataset:

from transformers import AutoFeatureExtractor
from datasets import load_dataset, Audio

feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base-960h")
dataset = load_dataset("PolyAI/minds14", "en-US", split="train")

When you call the audio column of the dataset, it is automatically decoded and resampled:

>>> dataset[0]["audio"]
{'array': array([ 0.        ,  0.00024414, -0.00024414, ..., -0.00024414,
         0.        ,  0.        ], dtype=float32),
 'path': '~/.cache/huggingface/datasets/downloads/extracted/xxxxx/en-US~JOINT_ACCOUNT/xxx.wav',
 'sampling_rate': 8000}

Use the cast_column() function and set the sampling_rate parameter in the Audio feature to upsample the audio signal. When you call the audio column now, it is decoded and resampled to 16kHz:

>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000))
>>> dataset[0]["audio"]
{'array': array([ 2.3443763e-05,  2.1729663e-04,  2.2145823e-04, ...,
         3.8356509e-05, -7.3497440e-06, -2.1754686e-05], dtype=float32),
 'path': '~/.cache/huggingface/datasets/downloads/extracted/xxx/en-US~JOINT_ACCOUNT/xxx.wav',
 'sampling_rate': 16000}

Use the map() function to resample the entire dataset to 16kHz:

def preprocess_function(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(
        audio_arrays, sampling_rate=feature_extractor.sampling_rate, max_length=16000, truncation=True
    )
    return inputs

dataset = dataset.map(preprocess_function, batched=True)

Apply data augmentations

  1. Start by loading the Beans dataset, the Image feature, and the feature extractor corresponding to a pretrained ViT model:

    from transformers import AutoFeatureExtractor
    from datasets import load_dataset, Image
    
    feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224-in21k")
    dataset = load_dataset("beans", split="train")
    
  2. When you call the image column of the dataset, the underlying PIL object is automatically decoded into an image:

    dataset[0]["image"]
    
  3. some transforms to the image:

    from torchvision.transforms import RandomRotation
    
    rotate = RandomRotation(degrees=(0, 90))
    def transforms(examples):
        examples["pixel_values"] = [rotate(image.convert("RGB")) for image in examples["image"]]
        return examples
    
  4. Use the set_transform() function to apply the transform on-the-fly:

    dataset.set_transform(transforms)
    dataset[0]["pixel_values"]
    

Evaluate predictions

You can see what metrics are available with list_metrics():

from datasets import list_metrics
metrics_list = list_metrics()
len(metrics_list)
print(metrics_list)

Load metric:

# Load a metric from the Hub with load_metric()
from datasets import load_metric
metric = load_metric('glue', 'mrpc')

Metrics object:

>>> print(metric.inputs_description)
Compute GLUE evaluation metric associated to each GLUE dataset.
Args:
    predictions: list of predictions to score.
        Each translation should be tokenized into a list of tokens.
    references: list of lists of references for each translation.
        Each reference should be tokenized into a list of tokens.
Returns: depending on the GLUE subset, one or several of:
    "accuracy": Accuracy
    "f1": F1 score
    "pearson": Pearson Correlation
    "spearmanr": Spearman Correlation
    "matthews_correlation": Matthew Correlation

Examples:
        # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
    >>> glue_metric = datasets.load_metric('glue', 'sst2')
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'accuracy': 1.0}

    >>> glue_metric = datasets.load_metric('glue', 'mrpc')  # 'mrpc' or 'qqp'
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'accuracy': 1.0, 'f1': 1.0}

    >>> glue_metric = datasets.load_metric('glue', 'stsb')
    >>> references = [0., 1., 2., 3., 4., 5.]
    >>> predictions = [0., 1., 2., 3., 4., 5.]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print({"pearson": round(results["pearson"], 2), "spearmanr": round(results["spearmanr"], 2)})
    {'pearson': 1.0, 'spearmanr': 1.0}

    >>> glue_metric = datasets.load_metric('glue', 'cola')
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'matthews_correlation': 1.0}

Compute metric:

model_predictions = model(model_inputs)
final_score = metric.compute(predictions=model_predictions, references=gold_references)

Create a dataset

Folder-based builders

  • Folder-based builders for quickly creating an image or audio dataset

https://img.zhaoweiguo.com/uPic/2023/07/mxt9eY.jpg

This is how the folder-based builder generates

from datasets import ImageFolder
dataset = load_dataset("imagefolder", data_dir="/path/to/pokemon")

from datasets import AudioFolder
dataset = load_dataset("audiofolder", data_dir="/path/to/folder")

From local files

>>> from datasets import Dataset
>>> def gen():
...     yield {"pokemon": "bulbasaur", "type": "grass"}
...     yield {"pokemon": "squirtle", "type": "water"}
>>> ds = Dataset.from_generator(gen)
>>> ds[0]
{"pokemon": "bulbasaur", "type": "grass"}
>>> from datasets import IterableDataset
>>> ds = IterableDataset.from_generator(gen)
>>> for example in ds:
...    print(example)
{"pokemon": "bulbasaur", "type": "grass"}
{"pokemon": "squirtle", "type": "water"}
>>> from datasets import Dataset
>>> ds = Dataset.from_dict({"pokemon": ["bulbasaur", "squirtle"], "type": ["grass", "water"]})
>>> ds[0]
{"pokemon": "bulbasaur", "type": "grass"}
audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", ..., "path/to/audio_n"]}).cast_column("audio", Audio())

HOW-TO GUIDES

General usage

Load

Local loading script:

dataset = load_dataset("path/to/local/loading_script/loading_script.py", split="train")
# equivalent because the file has the same name as the directory
dataset = load_dataset("path/to/local/loading_script", split="train")

Local and remote files:

from datasets import load_dataset
dataset = load_dataset("csv", data_files="my_file.csv")

示例: json数据
{"a": 1, "b": 2.0, "c": "foo", "d": false}
{"a": 4, "b": -5.5, "c": null, "d": true}
代码:
from datasets import load_dataset
dataset = load_dataset("json", data_files="my_file.json")


示例: json数据
{"version": "0.1.0",
 "data": [{"a": 1, "b": 2.0, "c": "foo", "d": false},
          {"a": 4, "b": -5.5, "c": null, "d": true}]
}
代码:
from datasets import load_dataset
dataset = load_dataset("json", data_files="my_file.json", field="data")


示例: 远程 JSON 文件
url = "https://rajpurkar.github.io/SQuAD-explorer/dataset/"
dataset = load_dataset(
        "json",
        data_files={"train": url + "train-v1.json", "validation": url + "dev-v1.json"},
        field="data"
)

Process

glue数据

示例:

from datasets import load_dataset
dataset = load_dataset("glue", "mrpc", split="train")

使用:

# Sort排序
dataset["label"][:10]
sorted_dataset = dataset.sort("label")
sorted_dataset["label"][:10]
sorted_dataset["label"][-10:]

# Shuffle混乱
shuffled_dataset = sorted_dataset.shuffle(seed=42)
shuffled_dataset["label"][:10]

iterable_dataset = dataset.to_iterable_dataset(num_shards=128)
shuffled_iterable_dataset = iterable_dataset.shuffle(seed=42, buffer_size=1000)


# Select
small_dataset = dataset.select([0, 10, 20, 30, 40, 50])
len(small_dataset)

# Filter
start_with_ar = dataset.filter(lambda example: example["sentence1"].startswith("Ar"))
len(start_with_ar)
start_with_ar["sentence1"]

even_dataset = dataset.filter(lambda example, idx: idx % 2 == 0, with_indices=True)
len(even_dataset)
len(dataset) / 2

# Map
def add_prefix(example):
    example["sentence1"] = 'My sentence: ' + example["sentence1"]
    return example
updated_dataset = small_dataset.map(add_prefix)

>>> updated_dataset = dataset.map(
        lambda example: {"new_sentence": example["sentence1"]}, remove_columns=["sentence1"]
)
>>> updated_dataset.column_names
['sentence2', 'label', 'idx', 'new_sentence']


# Split
>>> dataset.train_test_split(test_size=0.1)
DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3301
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 367
    })
})

# Cast
>>> dataset.features
{'sentence1': Value(dtype='string', id=None),
'sentence2': Value(dtype='string', id=None),
'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None),
'idx': Value(dtype='int32', id=None)}
>>> from datasets import ClassLabel, Value
>>> new_features = dataset.features.copy()
>>> new_features["label"] = ClassLabel(names=["negative", "positive"])
>>> new_features["idx"] = Value("int64")
>>> dataset = dataset.cast(new_features)
>>> dataset.features
{'sentence1': Value(dtype='string', id=None),
'sentence2': Value(dtype='string', id=None),
'label': ClassLabel(num_classes=2, names=['negative', 'positive'], names_file=None, id=None),
'idx': Value(dtype='int64', id=None)}


>>> dataset.features
{'audio': Audio(sampling_rate=44100, mono=True, id=None)}
>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
>>> dataset.features
{'audio': Audio(sampling_rate=16000, mono=True, id=None)}


# Rename
>>> dataset
Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx'],
    num_rows: 3668
})
>>> dataset = dataset.rename_column("sentence1", "sentenceA")
>>> dataset = dataset.rename_column("sentence2", "sentenceB")
>>> dataset
Dataset({
    features: ['sentenceA', 'sentenceB', 'label', 'idx'],
    num_rows: 3668
})

# Remove
>>> dataset = dataset.remove_columns("label")
>>> dataset
Dataset({
    features: ['sentence1', 'sentence2', 'idx'],
    num_rows: 3668
})
>>> dataset = dataset.remove_columns(["sentence1", "sentence2"])
>>> dataset
Dataset({
    features: ['idx'],
    num_rows: 3668
})
imdb数据

示例:

>>> from datasets import load_dataset
>>> datasets = load_dataset("imdb", split="train")
>>> print(dataset)
Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})

使用:

# Shard
>>> dataset.shard(num_shards=4, index=0)
Dataset({
    features: ['text', 'label'],
    num_rows: 6250
})
SQuAD数据集

示例:

>>> from datasets import load_dataset
>>> dataset = load_dataset("squad", split="train")
>>> dataset.features
{'answers': Sequence(feature={
        'text': Value(dtype='string', id=None),
        'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None),
'context': Value(dtype='string', id=None),
'id': Value(dtype='string', id=None),
'question': Value(dtype='string', id=None),
'title': Value(dtype='string', id=None)}

使用:

>>> flat_dataset = dataset.flatten()
>>> flat_dataset
Dataset({
    features: ['id', 'title', 'context', 'question', 'answers.text', 'answers.answer_start'],
 num_rows: 87599
})

Stream

https://img.zhaoweiguo.com/uPic/2023/07/7xFX0b.jpg

数据集流式处理:可以解决使用非常大的数据集,但你没有这么大的磁盘、不想等待、只是想快速浏览几个样本。

示例:

>> from datasets import load_dataset
>> dataset = load_dataset('oscar-corpus/OSCAR-2201', 'en', split='train', streaming=True)
>> print(next(iter(dataset)))
{'id': 0, 'text': 'Founded in 2015, ...', ...

流式传输数百个压缩 JSONL 文件(如 oscar-corpus/OSCAR-2201)的本地数据集:

>> from datasets import load_dataset
>> data_files = {'train': 'path/to/OSCAR-2201/compressed/en_meta/*.jsonl.gz'}
>> dataset = load_dataset('json', data_files=data_files, split='train', streaming=True)
>> print(next(iter(dataset)))
{'id': 0, 'text': 'Founded in 2015, ...', ...

使用:

# Shuffle
>>> from datasets import load_dataset
>>> dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True)
>>> shuffled_dataset = dataset.shuffle(seed=42, buffer_size=10_000)

# Split dataset
dataset_head = dataset.take(2)          # 获取前2个示例
train_dataset = dataset.skip(1000)      # 两者配合

# Interleave
>>> from datasets import interleave_datasets
>>> en_dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True)
>>> fr_dataset = load_dataset('oscar', "unshuffled_deduplicated_fr", split='train', streaming=True)
>>> multilingual_dataset = interleave_datasets([en_dataset, fr_dataset])

完整示例:

import torch
from torch.utils.data import DataLoader
from transformers import AutoModelForMaskedLM, DataCollatorForLanguageModeling
from tqdm import tqdm

dataset = dataset.with_format("torch")
dataloader = DataLoader(dataset, collate_fn=DataCollatorForLanguageModeling(tokenizer))
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = AutoModelForMaskedLM.from_pretrained("distilbert-base-uncased")
model.train().to(device)
optimizer = torch.optim.AdamW(params=model.parameters(), lr=1e-5)
for epoch in range(3):
    dataset.set_epoch(epoch)
    for i, batch in enumerate(tqdm(dataloader, total=5)):
        if i == 5:
            break
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs[0]
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        if i % 10 == 0:
            print(f"loss: {loss}")

Audio

Load audio data

Local files:

>>> audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", ...]})
>>>     .cast_column("audio", Audio())
>>> audio_dataset[0]["audio"]
{'array': array([ 0.        ,  0.00024414, -0.00024414, ..., -0.00024414,
         0.        ,  0.        ], dtype=float32),
 'path': 'path/to/audio_1',
 'sampling_rate': 16000}

AudioFolder:

from datasets import load_dataset

dataset = load_dataset("audiofolder", data_dir="/path/to/folder")
# OR by specifying the list of files
dataset = load_dataset("audiofolder", data_files=["path/to/audio_1", ...])

# load remote datasets from their URLs
dataset = load_dataset("audiofolder", data_files=["https://foo.bar/audio_1", ...]
# for example, pass SpeechCommands archive:
dataset = load_dataset("audiofolder", data_files="https://s3.amazonaws.com/test.tar.gz")

Process audio data

Cast:

from datasets import load_dataset, Audio
dataset = load_dataset("PolyAI/minds14", "en-US", split="train")
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
https://img.zhaoweiguo.com/uPic/2023/07/72sLGK.jpg

Map:

from transformers import AutoTokenizer, AutoFeatureExtractor, AutoProcessor

model_checkpoint = "facebook/wav2vec2-large-xlsr-53"
tokenizer = AutoTokenizer("./v.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")
feature_extractor = AutoFeatureExtractor.from_pretrained(model_checkpoint)
processor = AutoProcessor.from_pretrained(feature_extractor=feature_extractor, tokenizer=tokenizer)

def prepare_dataset(batch):
    audio = batch["audio"]
    batch["input_values"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]
    batch["input_length"] = len(batch["input_values"])
    with processor.as_target_processor():
        batch["labels"] = processor(batch["sentence"]).input_ids
    return batch
dataset = dataset.map(prepare_dataset, remove_columns=dataset.column_names)

Create an audio dataset

Local files
>>> audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", ...]}).cast_column("audio", Audio())
# 查看
>>> audio_dataset[0]["audio"]
# 上传到huggingface hub
>>> audio_dataset.push_to_hub("<username>/my_dataset")

目录:

my_dataset/
├── README.md
└── data/
    └── train-00000-of-00001.parquet
AudioFolder

示例:

my_dataset/
├── README.md
├── metadata.csv
├── data/
├───── first_audio_file.mp3
├───── second_audio_file.mp3
└───── third_audio_file.mp3

metadata.csv:

file_name,transcription
data/first_audio_file.mp3,znowu się duch z ciałem
data/second_audio_file.mp3,już u źwierzyńca podwojów
data/third_audio_file.mp3,pewnie kędyś w obłędzie

使用:

>>> from datasets import load_dataset
>>> dataset = load_dataset("audiofolder", data_dir="/path/to/data")
>>> dataset["train"][0]
{'audio':
    {'path': '/path/to/extracted/audio/first_audio_file.mp3',
    'array': array([ 0.00088501,  0.0012207 ,  0.00131226, ..., -0.00045776], dtype=float32),
    'sampling_rate': 16000},
 'transcription': 'znowu się duch z ciałem'
}

示例:

data/train/electronic/01.mp3
data/train/punk/01.mp3

data/test/electronic/09.mp3
data/test/punk/09.mp3

使用:

>>> from datasets import load_dataset
>>> dataset = load_dataset("audiofolder", data_dir="/path/to/data")
>>> dataset["train"][0]
{'audio':
    {'path': '/path/to/electronic/01.mp3',
     'array': array([ 3.9714024e-07,  7.3031038e-07, ..., -1.1244172e-01], dtype=float32),
     'sampling_rate': 44100},
 'label': 0  # "electronic"
}
>>> dataset["train"][-1]
{'audio':
    {'path': '/path/to/punk/01.mp3',
     'array': array([0.15237972, 0.13222949, ..., 0.33717662], dtype=float32),
     'sampling_rate': 44100},
 'label': 1  # "punk"
}
Loading script

示例:

my_dataset/
├── README.md
├── my_dataset.py
└── data/
    ├── train.tar.gz
    ├── test.tar.gz
    └── metadata.csv

使用:

from datasets import load_dataset
dataset = load_dataset("path/to/my_dataset")

Vision

Load image data

load:

from datasets import load_dataset, Image

dataset = load_dataset("beans", split="train")
dataset[0]["image"]

Local files:

>>> from datasets import load_dataset, Image
>>> dataset = Dataset.from_dict({"image": ["path/to/image_1", ...]}).cast_column("image", Image())
>>> dataset[0]["image"]
<PIL.PngImagePlugin.PngImageFile image mode=RGBA size=1200x215 at 0x15E6D7160>]


>>> dataset = load_dataset("beans", split="train").cast_column("image", Image(decode=False))
>>> dataset[0]["image"]
ImageFolder

示例:

folder/train/dog/golden_retriever.png
folder/train/dog/german_shepherd.png
folder/train/dog/chihuahua.png

folder/train/cat/maine_coon.png
folder/train/cat/bengal.png
folder/train/cat/birman.png

使用:

>>> from datasets import load_dataset
>>> dataset = load_dataset("imagefolder", data_dir="/path/to/folder")
>>> dataset["train"][0]
{"image": <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=1200x215 at 0x15E6D7160>, "label": 0}

>>> dataset["train"][-1]
{"image": <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=1200x215 at 0x15E8DAD30>, "label": 1}

Load remote datasets from their URLs:

dataset = load_dataset("imagefolder", data_files="https://a.cn/name.zip", split="train")

Process image data

Map
>>> def transforms(examples):
...    examples["pixel_values"] = [image.convert("RGB").resize((100,100)) for image in examples["image"]]
...    return examples

>>> dataset = dataset.map(transforms, remove_columns=["image"], batched=True)
>>> dataset[0]
{'label': 6,
 'pixel_values': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=100x100 at 0x7F058237BB10>}
Apply transforms
from torchvision.transforms import Compose, ColorJitter, ToTensor
jitter = Compose(
    [
         ColorJitter(brightness=0.25, contrast=0.25, saturation=0.25, hue=0.7),
         ToTensor(),
    ]
)

def transforms(examples):
    examples["pixel_values"] = [jitter(image.convert("RGB")) for image in examples["image"]]
    return examples

dataset.set_transform(transforms)

Create an image dataset

示例:

folder/train/metadata.csv
folder/train/0001.png
folder/train/0002.png
folder/train/0003.png

压缩版示例:

folder/metadata.csv
folder/train.zip
folder/test.zip
folder/valid.zip

metadata.csv 文件内容:

file_name,additional_feature
0001.png,This is a first value of a text feature you added to your images
0002.png,This is a second value of a text feature you added to your images
0003.png,This is a third value of a text feature you added to your images

or using metadata.jsonl:

{"file_name": "0001.png", "additional_feature": "This is a first value of a text feature you added to your images"}
{"file_name": "0002.png", "additional_feature": "This is a second value of a text feature you added to your images"}
{"file_name": "0003.png", "additional_feature": "This is a third value of a text feature you added to your images"}

示例-metadata.csv:

file_name,text
0001.png,This is a golden retriever playing with a ball
0002.png,A german shepherd
0003.png,One chihuahua

使用:

>>> dataset = load_dataset("imagefolder", data_dir="/path/to/folder", split="train")
>>> dataset[0]["text"]
"This is a golden retriever playing with a ball"

Upload dataset to the Hub:

from datasets import load_dataset
dataset = load_dataset("imagefolder", data_dir="/path/to/folder", split="train")
dataset.push_to_hub("stevhliu/my-image-captioning-dataset")

Depth estimation

train_dataset = load_dataset("sayakpaul/nyu_depth_v2", split="train")

Image classification

dataset = load_dataset("beans")

Semantic segmentation

安装:

pip install -U albumentations

使用示例:

train_dataset = load_dataset("sayakpaul/nyu_depth_v2", split="train")

Object detection

安装:

pip install -U albumentations opencv-python

使用示例:

ds = load_dataset("cppe-5")

Text

Load text data

from datasets import load_dataset
dataset = load_dataset("text",
        data_files={
                "train": ["my_text_1.txt", "my_text_2.txt"],
                "test": "my_test_file.txt"
        }
)
dataset = load_dataset("text",
        data_dir="path/to/text/dataset")

dataset = load_dataset("text",
        data_files={"train": "my_train_file.txt", "test": "my_test_file.txt"},
        sample_by="paragraph"
)
dataset = load_dataset("text",
        data_files={"train": "my_train_file.txt", "test": "my_test_file.txt"},
        sample_by="document"
)

示例:

from datasets import load_dataset
c4_subset = load_dataset("allenai/c4", data_files="en/c4-train.0000*-of-01024.json.gz")


dataset = load_dataset("text", data_files="https://huggingface.co/datasets/lhoestq/test/resolve/main/some_text.txt")

Process text data

Map:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

dataset = dataset.map(lambda examples: tokenizer(examples["text"]), batched=True)
dataset[0]
{'text': 'the rock is destined to be the 21st century...',
 'label': 1,
 'input_ids': [101, 1996, 2600, 2003, 16036, 2000, ...],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]}

Tabular

CSV files

from datasets import load_dataset
dataset = load_dataset("csv", data_files="my_file.csv")

# load multiple CSV files
dataset = load_dataset("csv", data_files=["my_file_1.csv", "my_file_2.csv", "my_file_3.csv"])

# 训练集、测试集拆分
dataset = load_dataset("csv", data_files={
        "train": ["my_train_file_1.csv", "my_train_file_2.csv"],
        "test": "my_test_file.csv"}
)

Pandas DataFrames

from datasets import Dataset
import pandas as pd

df = pd.read_csv("https://huggingface.co/datasets/imodels/credit-card/raw/main/train.csv")
df = pd.DataFrame(df)
dataset = Dataset.from_pandas(df)

Databases

SQLite

创建数据库:

import sqlite3
import pandas as pd

conn = sqlite3.connect("us_covid_data.db")
df = pd.read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv")
df.to_sql("states", conn, if_exists="replace")

load:

from datasets import Dataset

uri = "sqlite:///us_covid_data.db"
ds = Dataset.from_sql("states", uri)
# ds = Dataset.from_sql('SELECT * FROM states WHERE state="California";', uri)
ds

使用:

ds.filter(lambda x: x["state"] == "California")
# ds.filter(lambda x: x["cases"] > 10000)
PostgreSQL

Dataset repository

Share

create

$ huggingface-cli repo create your_dataset_name --type dataset

clone:

# Make sure you have git-lfs installed
# (https://git-lfs.github.com/)
git lfs install

git clone https://huggingface.co/datasets/namespace/your_dataset_name
  • 剩下的和普通的git差不多操作…

Create a dataset loading script

A dataset loading script should have the same name as a dataset repository or directory:

my_dataset/
├── README.md
└── my_dataset.py

主页

索引

模块索引

搜索页面