6.3.4. Datasets¶

Datasets is a library for easily accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP) tasks.

pip安装:

pip install datasets
# check
python -c "from datasets import load_dataset; print(load_dataset('squad', split='train')[0])"

# Audio
pip install 'datasets[audio]'
# check
python -c "import soundfile; print(soundfile.__libsndfile_version__)"

# Vision
pip install 'datasets[vision]'

源码安装:

git clone https://github.com/huggingface/datasets.git
cd datasets
pip install -e .

TUTORIALS¶

Load a dataset from the Hub¶

This tutorial uses the rotten_tomatoes and MInDS_14 datasets

Load a dataset¶

load_dataset_builder():

>>> from datasets import load_dataset_builder
>>> ds_builder = load_dataset_builder("rotten_tomatoes")

# Inspect dataset description
>>> ds_builder.info.description
Movie Review Dataset.
This is a dataset of containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews.
This data was first used in Bo Pang and Lillian Lee,
        ``Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.'', Proceedings of the ACL, 2005.


# Inspect dataset features
>>> ds_builder.info.features
{'label': ClassLabel(num_classes=2, names=['neg', 'pos'], id=None),
 'text': Value(dtype='string', id=None)}

load_dataset():

>>> from datasets import load_dataset
>>> dataset = load_dataset("rotten_tomatoes", split="train")
>>> dataset
Dataset({
    features: ['text', 'label'],
    num_rows: 8530
})

Splits¶

get_dataset_split_names():

>>> from datasets import get_dataset_split_names
>>> get_dataset_split_names("rotten_tomatoes")
['train', 'validation', 'test']

前面load_dataset()函数指定了split，如果不指定:

>>> from datasets import load_dataset
>>> dataset = load_dataset("rotten_tomatoes")
>>> dataset
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

Configurations¶

get_dataset_config_names():

>>> from datasets import get_dataset_config_names
>>> configs = get_dataset_config_names("PolyAI/minds14")
>>> print(configs)
['cs-CZ', 'de-DE', 'en-AU' ...]

Then load the configuration you want:

from datasets import load_dataset
mindsFR = load_dataset("PolyAI/minds14", "fr-FR", split="train")

Know your dataset¶

Dataset¶

加载数据:

from datasets import load_dataset
dataset = load_dataset("rotten_tomatoes", split="train")

Indexing:

>>> dataset[0]
{'label': 1,
 'text': 'the rock is des...'}

# a list of all the values in the column
>>> dataset["text"]
[
        'the rock is desti...',
        'the gorgeously elabora...',
        ...,
        'things really ...'
]

Slicing:

>>> dataset[:3]
{'label': [1, 1, 1],
 'text':
        [
                'the rock is desti...',
                'the gorgeously elabora...',
                ...,
                'things really ...'
        ]
}
>>> dataset[3:6]
...

IterableDataset¶

An IterableDataset is loaded when you set the streaming parameter to True in load_dataset():

>>> from datasets import load_dataset
>>> iterable_dataset = load_dataset("food101", split="train", streaming=True)
>>> for example in iterable_dataset:
...    print(example)
...    break
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F0681F5C520>, 'label': 6}


>>> next(iter(iterable_dataset))
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F0681F59B50>,
 'label': 6}

>>> for example in iterable_dataset:
...    print(example)
...    break
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F7479DE82B0>, 'label': 6}

使用 IterableDataset.take（）返回包含特定数量示例的数据集子集:

# Get first three examples
>>> list(iterable_dataset.take(3))
[{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F7479DEE9D0>,
  'label': 6},
 {'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x512 at 0x7F7479DE8190>,
  'label': 6},
 {'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x383 at 0x7F7479DE8310>,
  'label': 6}]

Preprocess¶

Tokenize text¶

Start by loading the rotten_tomatoes dataset and the tokenizer corresponding to a pretrained BERT model:

>>> from transformers import AutoTokenizer
>>> from datasets import load_dataset

>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
>>> dataset = load_dataset("rotten_tomatoes", split="train")

Call your tokenizer on the first row of text in the dataset:

>>> tokenizer(dataset[0]["text"])
{
 'input_ids': [101, 1103, 2067 ...],
 'token_type_ids': [0, 0, 0, 0, 0 ...],
 'attention_mask': [1, 1, 1, 1, 1 ...]
}
说明:
        input_ids: the numbers representing the tokens in the text.
        token_type_ids: indicates which sequence a token belongs to if there is more than one sequence.
        attention_mask: indicates whether a token should be masked or not.

The fastest way to tokenize your entire dataset is to use the map() function:

def tokenization(example):
    return tokenizer(example["text"])

dataset = dataset.map(tokenization, batched=True)

Set the format of your dataset to be compatible with your machine learning framework:

# Pytorch
dataset.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "labels"])
dataset.format['type']

# TensorFlow
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")
tf_dataset = dataset.to_tf_dataset(
    columns=["input_ids", "token_type_ids", "attention_mask"],
    label_cols=["labels"],
    batch_size=2,
    collate_fn=data_collator,
    shuffle=True
)

Resample audio signals¶

Start by loading the MInDS-14 dataset:

from transformers import AutoFeatureExtractor
from datasets import load_dataset, Audio

feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base-960h")
dataset = load_dataset("PolyAI/minds14", "en-US", split="train")

When you call the audio column of the dataset, it is automatically decoded and resampled:

>>> dataset[0]["audio"]
{'array': array([ 0.        ,  0.00024414, -0.00024414, ..., -0.00024414,
         0.        ,  0.        ], dtype=float32),
 'path': '~/.cache/huggingface/datasets/downloads/extracted/xxxxx/en-US~JOINT_ACCOUNT/xxx.wav',
 'sampling_rate': 8000}

Use the cast_column() function and set the sampling_rate parameter in the Audio feature to upsample the audio signal. When you call the audio column now, it is decoded and resampled to 16kHz:

>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000))
>>> dataset[0]["audio"]
{'array': array([ 2.3443763e-05,  2.1729663e-04,  2.2145823e-04, ...,
         3.8356509e-05, -7.3497440e-06, -2.1754686e-05], dtype=float32),
 'path': '~/.cache/huggingface/datasets/downloads/extracted/xxx/en-US~JOINT_ACCOUNT/xxx.wav',
 'sampling_rate': 16000}

Use the map() function to resample the entire dataset to 16kHz:

def preprocess_function(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(
        audio_arrays, sampling_rate=feature_extractor.sampling_rate, max_length=16000, truncation=True
    )
    return inputs

dataset = dataset.map(preprocess_function, batched=True)

Apply data augmentations¶

Start by loading the Beans dataset, the Image feature, and the feature extractor corresponding to a pretrained ViT model:

from transformers import AutoFeatureExtractor
from datasets import load_dataset, Image

feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224-in21k")
dataset = load_dataset("beans", split="train")

When you call the image column of the dataset, the underlying PIL object is automatically decoded into an image:
```
dataset[0]["image"]
```

some transforms to the image:

from torchvision.transforms import RandomRotation

rotate = RandomRotation(degrees=(0, 90))
def transforms(examples):
    examples["pixel_values"] = [rotate(image.convert("RGB")) for image in examples["image"]]
    return examples

Use the set_transform() function to apply the transform on-the-fly:
```
dataset.set_transform(transforms)
dataset[0]["pixel_values"]
```

Evaluate predictions¶

You can see what metrics are available with list_metrics():

from datasets import list_metrics
metrics_list = list_metrics()
len(metrics_list)
print(metrics_list)

Load metric:

# Load a metric from the Hub with load_metric()
from datasets import load_metric
metric = load_metric('glue', 'mrpc')

Metrics object:

>>> print(metric.inputs_description)
Compute GLUE evaluation metric associated to each GLUE dataset.
Args:
    predictions: list of predictions to score.
        Each translation should be tokenized into a list of tokens.
    references: list of lists of references for each translation.
        Each reference should be tokenized into a list of tokens.
Returns: depending on the GLUE subset, one or several of:
    "accuracy": Accuracy
    "f1": F1 score
    "pearson": Pearson Correlation
    "spearmanr": Spearman Correlation
    "matthews_correlation": Matthew Correlation

Examples:
        # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
    >>> glue_metric = datasets.load_metric('glue', 'sst2')
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'accuracy': 1.0}

    >>> glue_metric = datasets.load_metric('glue', 'mrpc')  # 'mrpc' or 'qqp'
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'accuracy': 1.0, 'f1': 1.0}

    >>> glue_metric = datasets.load_metric('glue', 'stsb')
    >>> references = [0., 1., 2., 3., 4., 5.]
    >>> predictions = [0., 1., 2., 3., 4., 5.]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print({"pearson": round(results["pearson"], 2), "spearmanr": round(results["spearmanr"], 2)})
    {'pearson': 1.0, 'spearmanr': 1.0}

    >>> glue_metric = datasets.load_metric('glue', 'cola')
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'matthews_correlation': 1.0}

Compute metric:

model_predictions = model(model_inputs)
final_score = metric.compute(predictions=model_predictions, references=gold_references)

Create a dataset¶

Folder-based builders¶

Folder-based builders for quickly creating an image or audio dataset

https://img.zhaoweiguo.com/uPic/2023/07/mxt9eY.jpg — This is how the folder-based builder generates¶

from datasets import ImageFolder
dataset = load_dataset("imagefolder", data_dir="/path/to/pokemon")

from datasets import AudioFolder
dataset = load_dataset("audiofolder", data_dir="/path/to/folder")

From local files¶

>>> from datasets import Dataset
>>> def gen():
...     yield {"pokemon": "bulbasaur", "type": "grass"}
...     yield {"pokemon": "squirtle", "type": "water"}
>>> ds = Dataset.from_generator(gen)
>>> ds[0]
{"pokemon": "bulbasaur", "type": "grass"}

>>> from datasets import IterableDataset
>>> ds = IterableDataset.from_generator(gen)
>>> for example in ds:
...    print(example)
{"pokemon": "bulbasaur", "type": "grass"}
{"pokemon": "squirtle", "type": "water"}

>>> from datasets import Dataset
>>> ds = Dataset.from_dict({"pokemon": ["bulbasaur", "squirtle"], "type": ["grass", "water"]})
>>> ds[0]
{"pokemon": "bulbasaur", "type": "grass"}

audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", ..., "path/to/audio_n"]}).cast_column("audio", Audio())

HOW-TO GUIDES¶

General usage¶

Load¶

Local loading script:

dataset = load_dataset("path/to/local/loading_script/loading_script.py", split="train")
# equivalent because the file has the same name as the directory
dataset = load_dataset("path/to/local/loading_script", split="train")

Local and remote files:

from datasets import load_dataset
dataset = load_dataset("csv", data_files="my_file.csv")

示例: json数据
{"a": 1, "b": 2.0, "c": "foo", "d": false}
{"a": 4, "b": -5.5, "c": null, "d": true}
代码:
from datasets import load_dataset
dataset = load_dataset("json", data_files="my_file.json")


示例: json数据
{"version": "0.1.0",
 "data": [{"a": 1, "b": 2.0, "c": "foo", "d": false},
          {"a": 4, "b": -5.5, "c": null, "d": true}]
}
代码:
from datasets import load_dataset
dataset = load_dataset("json", data_files="my_file.json", field="data")


示例: 远程 JSON 文件
url = "https://rajpurkar.github.io/SQuAD-explorer/dataset/"
dataset = load_dataset(
        "json",
        data_files={"train": url + "train-v1.json", "validation": url + "dev-v1.json"},
        field="data"
)

Process¶

glue数据¶

示例:

from datasets import load_dataset
dataset = load_dataset("glue", "mrpc", split="train")

使用:

# Sort排序
dataset["label"][:10]
sorted_dataset = dataset.sort("label")
sorted_dataset["label"][:10]
sorted_dataset["label"][-10:]

# Shuffle混乱
shuffled_dataset = sorted_dataset.shuffle(seed=42)
shuffled_dataset["label"][:10]

iterable_dataset = dataset.to_iterable_dataset(num_shards=128)
shuffled_iterable_dataset = iterable_dataset.shuffle(seed=42, buffer_size=1000)


# Select
small_dataset = dataset.select([0, 10, 20, 30, 40, 50])
len(small_dataset)

# Filter
start_with_ar = dataset.filter(lambda example: example["sentence1"].startswith("Ar"))
len(start_with_ar)
start_with_ar["sentence1"]

even_dataset = dataset.filter(lambda example, idx: idx % 2 == 0, with_indices=True)
len(even_dataset)
len(dataset) / 2

# Map
def add_prefix(example):
    example["sentence1"] = 'My sentence: ' + example["sentence1"]
    return example
updated_dataset = small_dataset.map(add_prefix)

>>> updated_dataset = dataset.map(
        lambda example: {"new_sentence": example["sentence1"]}, remove_columns=["sentence1"]
)
>>> updated_dataset.column_names
['sentence2', 'label', 'idx', 'new_sentence']


# Split
>>> dataset.train_test_split(test_size=0.1)
DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3301
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 367
    })
})

# Cast
>>> dataset.features
{'sentence1': Value(dtype='string', id=None),
'sentence2': Value(dtype='string', id=None),
'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None),
'idx': Value(dtype='int32', id=None)}
>>> from datasets import ClassLabel, Value
>>> new_features = dataset.features.copy()
>>> new_features["label"] = ClassLabel(names=["negative", "positive"])
>>> new_features["idx"] = Value("int64")
>>> dataset = dataset.cast(new_features)
>>> dataset.features
{'sentence1': Value(dtype='string', id=None),
'sentence2': Value(dtype='string', id=None),
'label': ClassLabel(num_classes=2, names=['negative', 'positive'], names_file=None, id=None),
'idx': Value(dtype='int64', id=None)}


>>> dataset.features
{'audio': Audio(sampling_rate=44100, mono=True, id=None)}
>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
>>> dataset.features
{'audio': Audio(sampling_rate=16000, mono=True, id=None)}


# Rename
>>> dataset
Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx'],
    num_rows: 3668
})
>>> dataset = dataset.rename_column("sentence1", "sentenceA")
>>> dataset = dataset.rename_column("sentence2", "sentenceB")
>>> dataset
Dataset({
    features: ['sentenceA', 'sentenceB', 'label', 'idx'],
    num_rows: 3668
})

# Remove
>>> dataset = dataset.remove_columns("label")
>>> dataset
Dataset({
    features: ['sentence1', 'sentence2', 'idx'],
    num_rows: 3668
})
>>> dataset = dataset.remove_columns(["sentence1", "sentence2"])
>>> dataset
Dataset({
    features: ['idx'],
    num_rows: 3668
})

imdb数据¶

示例:

>>> from datasets import load_dataset
>>> datasets = load_dataset("imdb", split="train")
>>> print(dataset)
Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})

使用:

# Shard
>>> dataset.shard(num_shards=4, index=0)
Dataset({
    features: ['text', 'label'],
    num_rows: 6250
})

SQuAD数据集¶

示例:

>>> from datasets import load_dataset
>>> dataset = load_dataset("squad", split="train")
>>> dataset.features
{'answers': Sequence(feature={
        'text': Value(dtype='string', id=None),
        'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None),
'context': Value(dtype='string', id=None),
'id': Value(dtype='string', id=None),
'question': Value(dtype='string', id=None),
'title': Value(dtype='string', id=None)}

使用:

>>> flat_dataset = dataset.flatten()
>>> flat_dataset
Dataset({
    features: ['id', 'title', 'context', 'question', 'answers.text', 'answers.answer_start'],
 num_rows: 87599
})

Stream¶

https://img.zhaoweiguo.com/uPic/2023/07/7xFX0b.jpg — 数据集流式处理：可以解决使用非常大的数据集，但你没有这么大的磁盘、不想等待、只是想快速浏览几个样本。¶

示例:

>> from datasets import load_dataset
>> dataset = load_dataset('oscar-corpus/OSCAR-2201', 'en', split='train', streaming=True)
>> print(next(iter(dataset)))
{'id': 0, 'text': 'Founded in 2015, ...', ...

流式传输数百个压缩 JSONL 文件（如 oscar-corpus/OSCAR-2201）的本地数据集:

>> from datasets import load_dataset
>> data_files = {'train': 'path/to/OSCAR-2201/compressed/en_meta/*.jsonl.gz'}
>> dataset = load_dataset('json', data_files=data_files, split='train', streaming=True)
>> print(next(iter(dataset)))
{'id': 0, 'text': 'Founded in 2015, ...', ...

使用:

# Shuffle
>>> from datasets import load_dataset
>>> dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True)
>>> shuffled_dataset = dataset.shuffle(seed=42, buffer_size=10_000)

# Split dataset
dataset_head = dataset.take(2)          # 获取前2个示例
train_dataset = dataset.skip(1000)      # 两者配合

# Interleave
>>> from datasets import interleave_datasets
>>> en_dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True)
>>> fr_dataset = load_dataset('oscar', "unshuffled_deduplicated_fr", split='train', streaming=True)
>>> multilingual_dataset = interleave_datasets([en_dataset, fr_dataset])

完整示例:

import torch
from torch.utils.data import DataLoader
from transformers import AutoModelForMaskedLM, DataCollatorForLanguageModeling
from tqdm import tqdm

dataset = dataset.with_format("torch")
dataloader = DataLoader(dataset, collate_fn=DataCollatorForLanguageModeling(tokenizer))
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = AutoModelForMaskedLM.from_pretrained("distilbert-base-uncased")
model.train().to(device)
optimizer = torch.optim.AdamW(params=model.parameters(), lr=1e-5)
for epoch in range(3):
    dataset.set_epoch(epoch)
    for i, batch in enumerate(tqdm(dataloader, total=5)):
        if i == 5:
            break
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs[0]
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        if i % 10 == 0:
            print(f"loss: {loss}")

Audio¶

Load audio data¶

Local files:

>>> audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", ...]})
>>>     .cast_column("audio", Audio())
>>> audio_dataset[0]["audio"]
{'array': array([ 0.        ,  0.00024414, -0.00024414, ..., -0.00024414,
         0.        ,  0.        ], dtype=float32),
 'path': 'path/to/audio_1',
 'sampling_rate': 16000}

AudioFolder:

from datasets import load_dataset

dataset = load_dataset("audiofolder", data_dir="/path/to/folder")
# OR by specifying the list of files
dataset = load_dataset("audiofolder", data_files=["path/to/audio_1", ...])

# load remote datasets from their URLs
dataset = load_dataset("audiofolder", data_files=["https://foo.bar/audio_1", ...]
# for example, pass SpeechCommands archive:
dataset = load_dataset("audiofolder", data_files="https://s3.amazonaws.com/test.tar.gz")

Process audio data¶

Cast:

from datasets import load_dataset, Audio
dataset = load_dataset("PolyAI/minds14", "en-US", split="train")
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))

https://img.zhaoweiguo.com/uPic/2023/07/72sLGK.jpg

Map:

from transformers import AutoTokenizer, AutoFeatureExtractor, AutoProcessor

model_checkpoint = "facebook/wav2vec2-large-xlsr-53"
tokenizer = AutoTokenizer("./v.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")
feature_extractor = AutoFeatureExtractor.from_pretrained(model_checkpoint)
processor = AutoProcessor.from_pretrained(feature_extractor=feature_extractor, tokenizer=tokenizer)

def prepare_dataset(batch):
    audio = batch["audio"]
    batch["input_values"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]
    batch["input_length"] = len(batch["input_values"])
    with processor.as_target_processor():
        batch["labels"] = processor(batch["sentence"]).input_ids
    return batch
dataset = dataset.map(prepare_dataset, remove_columns=dataset.column_names)

Create an audio dataset¶

Local files¶

>>> audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", ...]}).cast_column("audio", Audio())
# 查看
>>> audio_dataset[0]["audio"]
# 上传到huggingface hub
>>> audio_dataset.push_to_hub("<username>/my_dataset")

my_dataset/
├── README.md
└── data/
    └── train-00000-of-00001.parquet

AudioFolder¶

示例:

my_dataset/
├── README.md
├── metadata.csv
├── data/
├───── first_audio_file.mp3
├───── second_audio_file.mp3
└───── third_audio_file.mp3

metadata.csv:

file_name,transcription
data/first_audio_file.mp3,znowu się duch z ciałem
data/second_audio_file.mp3,już u źwierzyńca podwojów
data/third_audio_file.mp3,pewnie kędyś w obłędzie

使用:

>>> from datasets import load_dataset
>>> dataset = load_dataset("audiofolder", data_dir="/path/to/data")
>>> dataset["train"][0]
{'audio':
    {'path': '/path/to/extracted/audio/first_audio_file.mp3',
    'array': array([ 0.00088501,  0.0012207 ,  0.00131226, ..., -0.00045776], dtype=float32),
    'sampling_rate': 16000},
 'transcription': 'znowu się duch z ciałem'
}

示例:

data/train/electronic/01.mp3
data/train/punk/01.mp3

data/test/electronic/09.mp3
data/test/punk/09.mp3

使用:

>>> from datasets import load_dataset
>>> dataset = load_dataset("audiofolder", data_dir="/path/to/data")
>>> dataset["train"][0]
{'audio':
    {'path': '/path/to/electronic/01.mp3',
     'array': array([ 3.9714024e-07,  7.3031038e-07, ..., -1.1244172e-01], dtype=float32),
     'sampling_rate': 44100},
 'label': 0  # "electronic"
}
>>> dataset["train"][-1]
{'audio':
    {'path': '/path/to/punk/01.mp3',
     'array': array([0.15237972, 0.13222949, ..., 0.33717662], dtype=float32),
     'sampling_rate': 44100},
 'label': 1  # "punk"
}

Loading script¶

示例:

my_dataset/
├── README.md
├── my_dataset.py
└── data/
    ├── train.tar.gz
    ├── test.tar.gz
    └── metadata.csv

使用:

from datasets import load_dataset
dataset = load_dataset("path/to/my_dataset")

Vision¶

Load image data¶

load:

from datasets import load_dataset, Image

dataset = load_dataset("beans", split="train")
dataset[0]["image"]

Local files:

>>> from datasets import load_dataset, Image
>>> dataset = Dataset.from_dict({"image": ["path/to/image_1", ...]}).cast_column("image", Image())
>>> dataset[0]["image"]
<PIL.PngImagePlugin.PngImageFile image mode=RGBA size=1200x215 at 0x15E6D7160>]


>>> dataset = load_dataset("beans", split="train").cast_column("image", Image(decode=False))
>>> dataset[0]["image"]

ImageFolder¶

示例:

folder/train/dog/golden_retriever.png
folder/train/dog/german_shepherd.png
folder/train/dog/chihuahua.png

folder/train/cat/maine_coon.png
folder/train/cat/bengal.png
folder/train/cat/birman.png

使用:

>>> from datasets import load_dataset
>>> dataset = load_dataset("imagefolder", data_dir="/path/to/folder")
>>> dataset["train"][0]
{"image": <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=1200x215 at 0x15E6D7160>, "label": 0}

>>> dataset["train"][-1]
{"image": <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=1200x215 at 0x15E8DAD30>, "label": 1}

Load remote datasets from their URLs:

dataset = load_dataset("imagefolder", data_files="https://a.cn/name.zip", split="train")

Process image data¶

Map¶

>>> def transforms(examples):
...    examples["pixel_values"] = [image.convert("RGB").resize((100,100)) for image in examples["image"]]
...    return examples

>>> dataset = dataset.map(transforms, remove_columns=["image"], batched=True)
>>> dataset[0]
{'label': 6,
 'pixel_values': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=100x100 at 0x7F058237BB10>}

Apply transforms¶

data augmentation libraries
torchvision: https://pytorch.org/vision/stable/index.html
Albumentations: https://imgaug.readthedocs.io/en/latest/
Kornia: https://albumentations.ai/docs/
imgaug: https://kornia.readthedocs.io/en/latest/

from torchvision.transforms import Compose, ColorJitter, ToTensor
jitter = Compose(
    [
         ColorJitter(brightness=0.25, contrast=0.25, saturation=0.25, hue=0.7),
         ToTensor(),
    ]
)

def transforms(examples):
    examples["pixel_values"] = [jitter(image.convert("RGB")) for image in examples["image"]]
    return examples

dataset.set_transform(transforms)

Create an image dataset¶

示例:

folder/train/metadata.csv
folder/train/0001.png
folder/train/0002.png
folder/train/0003.png

压缩版示例:

folder/metadata.csv
folder/train.zip
folder/test.zip
folder/valid.zip

metadata.csv 文件内容:

file_name,additional_feature
png,This is a first value of a text feature you added to your images
png,This is a second value of a text feature you added to your images
png,This is a third value of a text feature you added to your images

or using metadata.jsonl:

{"file_name": "0001.png", "additional_feature": "This is a first value of a text feature you added to your images"}
{"file_name": "0002.png", "additional_feature": "This is a second value of a text feature you added to your images"}
{"file_name": "0003.png", "additional_feature": "This is a third value of a text feature you added to your images"}

示例-metadata.csv:

file_name,text
png,This is a golden retriever playing with a ball
png,A german shepherd
png,One chihuahua

使用:

>>> dataset = load_dataset("imagefolder", data_dir="/path/to/folder", split="train")
>>> dataset[0]["text"]
"This is a golden retriever playing with a ball"

Upload dataset to the Hub:

from datasets import load_dataset
dataset = load_dataset("imagefolder", data_dir="/path/to/folder", split="train")
dataset.push_to_hub("stevhliu/my-image-captioning-dataset")

Depth estimation¶

train_dataset = load_dataset("sayakpaul/nyu_depth_v2", split="train")

Image classification¶

dataset = load_dataset("beans")

Semantic segmentation¶

安装:

pip install -U albumentations

使用示例:

train_dataset = load_dataset("sayakpaul/nyu_depth_v2", split="train")

Object detection¶

安装:

pip install -U albumentations opencv-python