6.3.4. Datasets¶
Datasets is a library for easily accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP) tasks.
pip安装:
pip install datasets
# check
python -c "from datasets import load_dataset; print(load_dataset('squad', split='train')[0])"
# Audio
pip install 'datasets[audio]'
# check
python -c "import soundfile; print(soundfile.__libsndfile_version__)"
# Vision
pip install 'datasets[vision]'
源码安装:
git clone https://github.com/huggingface/datasets.git
cd datasets
pip install -e .
TUTORIALS¶
Load a dataset from the Hub¶
This tutorial uses the rotten_tomatoes and MInDS_14 datasets
Load a dataset¶
load_dataset_builder():
>>> from datasets import load_dataset_builder
>>> ds_builder = load_dataset_builder("rotten_tomatoes")
# Inspect dataset description
>>> ds_builder.info.description
Movie Review Dataset.
This is a dataset of containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews.
This data was first used in Bo Pang and Lillian Lee,
``Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.'', Proceedings of the ACL, 2005.
# Inspect dataset features
>>> ds_builder.info.features
{'label': ClassLabel(num_classes=2, names=['neg', 'pos'], id=None),
'text': Value(dtype='string', id=None)}
load_dataset():
>>> from datasets import load_dataset
>>> dataset = load_dataset("rotten_tomatoes", split="train")
>>> dataset
Dataset({
features: ['text', 'label'],
num_rows: 8530
})
Splits¶
get_dataset_split_names():
>>> from datasets import get_dataset_split_names
>>> get_dataset_split_names("rotten_tomatoes")
['train', 'validation', 'test']
前面load_dataset()函数指定了split,如果不指定:
>>> from datasets import load_dataset
>>> dataset = load_dataset("rotten_tomatoes")
>>> dataset
DatasetDict({
train: Dataset({
features: ['text', 'label'],
num_rows: 8530
})
validation: Dataset({
features: ['text', 'label'],
num_rows: 1066
})
test: Dataset({
features: ['text', 'label'],
num_rows: 1066
})
})
Configurations¶
get_dataset_config_names():
>>> from datasets import get_dataset_config_names
>>> configs = get_dataset_config_names("PolyAI/minds14")
>>> print(configs)
['cs-CZ', 'de-DE', 'en-AU' ...]
Then load the configuration you want:
from datasets import load_dataset
mindsFR = load_dataset("PolyAI/minds14", "fr-FR", split="train")
Know your dataset¶
Dataset¶
加载数据:
from datasets import load_dataset
dataset = load_dataset("rotten_tomatoes", split="train")
Indexing:
>>> dataset[0]
{'label': 1,
'text': 'the rock is des...'}
# a list of all the values in the column
>>> dataset["text"]
[
'the rock is desti...',
'the gorgeously elabora...',
...,
'things really ...'
]
Slicing:
>>> dataset[:3]
{'label': [1, 1, 1],
'text':
[
'the rock is desti...',
'the gorgeously elabora...',
...,
'things really ...'
]
}
>>> dataset[3:6]
...
IterableDataset¶
An IterableDataset is loaded when you set the streaming parameter to True in load_dataset():
>>> from datasets import load_dataset
>>> iterable_dataset = load_dataset("food101", split="train", streaming=True)
>>> for example in iterable_dataset:
... print(example)
... break
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F0681F5C520>, 'label': 6}
>>> next(iter(iterable_dataset))
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F0681F59B50>,
'label': 6}
>>> for example in iterable_dataset:
... print(example)
... break
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F7479DE82B0>, 'label': 6}
使用 IterableDataset.take() 返回包含特定数量示例的数据集子集:
# Get first three examples
>>> list(iterable_dataset.take(3))
[{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F7479DEE9D0>,
'label': 6},
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x512 at 0x7F7479DE8190>,
'label': 6},
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x383 at 0x7F7479DE8310>,
'label': 6}]
Preprocess¶
Tokenize text¶
Start by loading the rotten_tomatoes dataset and the tokenizer corresponding to a pretrained BERT model:
>>> from transformers import AutoTokenizer
>>> from datasets import load_dataset
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
>>> dataset = load_dataset("rotten_tomatoes", split="train")
Call your tokenizer on the first row of text in the dataset:
>>> tokenizer(dataset[0]["text"])
{
'input_ids': [101, 1103, 2067 ...],
'token_type_ids': [0, 0, 0, 0, 0 ...],
'attention_mask': [1, 1, 1, 1, 1 ...]
}
说明:
input_ids: the numbers representing the tokens in the text.
token_type_ids: indicates which sequence a token belongs to if there is more than one sequence.
attention_mask: indicates whether a token should be masked or not.
The fastest way to tokenize your entire dataset is to use the map() function:
def tokenization(example):
return tokenizer(example["text"])
dataset = dataset.map(tokenization, batched=True)
Set the format of your dataset to be compatible with your machine learning framework:
# Pytorch
dataset.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "labels"])
dataset.format['type']
# TensorFlow
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")
tf_dataset = dataset.to_tf_dataset(
columns=["input_ids", "token_type_ids", "attention_mask"],
label_cols=["labels"],
batch_size=2,
collate_fn=data_collator,
shuffle=True
)
Resample audio signals¶
Start by loading the MInDS-14 dataset:
from transformers import AutoFeatureExtractor
from datasets import load_dataset, Audio
feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base-960h")
dataset = load_dataset("PolyAI/minds14", "en-US", split="train")
When you call the audio column of the dataset, it is automatically decoded and resampled:
>>> dataset[0]["audio"]
{'array': array([ 0. , 0.00024414, -0.00024414, ..., -0.00024414,
0. , 0. ], dtype=float32),
'path': '~/.cache/huggingface/datasets/downloads/extracted/xxxxx/en-US~JOINT_ACCOUNT/xxx.wav',
'sampling_rate': 8000}
Use the cast_column() function and set the sampling_rate parameter in the Audio feature to upsample the audio signal. When you call the audio column now, it is decoded and resampled to 16kHz:
>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000))
>>> dataset[0]["audio"]
{'array': array([ 2.3443763e-05, 2.1729663e-04, 2.2145823e-04, ...,
3.8356509e-05, -7.3497440e-06, -2.1754686e-05], dtype=float32),
'path': '~/.cache/huggingface/datasets/downloads/extracted/xxx/en-US~JOINT_ACCOUNT/xxx.wav',
'sampling_rate': 16000}
Use the map() function to resample the entire dataset to 16kHz:
def preprocess_function(examples):
audio_arrays = [x["array"] for x in examples["audio"]]
inputs = feature_extractor(
audio_arrays, sampling_rate=feature_extractor.sampling_rate, max_length=16000, truncation=True
)
return inputs
dataset = dataset.map(preprocess_function, batched=True)
Apply data augmentations¶
Start by loading the Beans dataset, the Image feature, and the feature extractor corresponding to a pretrained ViT model:
from transformers import AutoFeatureExtractor from datasets import load_dataset, Image feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224-in21k") dataset = load_dataset("beans", split="train")
When you call the image column of the dataset, the underlying PIL object is automatically decoded into an image:
dataset[0]["image"]
some transforms to the image:
from torchvision.transforms import RandomRotation rotate = RandomRotation(degrees=(0, 90)) def transforms(examples): examples["pixel_values"] = [rotate(image.convert("RGB")) for image in examples["image"]] return examples
Use the set_transform() function to apply the transform on-the-fly:
dataset.set_transform(transforms) dataset[0]["pixel_values"]
Evaluate predictions¶
You can see what metrics are available with list_metrics():
from datasets import list_metrics
metrics_list = list_metrics()
len(metrics_list)
print(metrics_list)
Load metric:
# Load a metric from the Hub with load_metric()
from datasets import load_metric
metric = load_metric('glue', 'mrpc')
Metrics object:
>>> print(metric.inputs_description)
Compute GLUE evaluation metric associated to each GLUE dataset.
Args:
predictions: list of predictions to score.
Each translation should be tokenized into a list of tokens.
references: list of lists of references for each translation.
Each reference should be tokenized into a list of tokens.
Returns: depending on the GLUE subset, one or several of:
"accuracy": Accuracy
"f1": F1 score
"pearson": Pearson Correlation
"spearmanr": Spearman Correlation
"matthews_correlation": Matthew Correlation
Examples:
# 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
>>> glue_metric = datasets.load_metric('glue', 'sst2')
>>> references = [0, 1]
>>> predictions = [0, 1]
>>> results = glue_metric.compute(predictions=predictions, references=references)
>>> print(results)
{'accuracy': 1.0}
>>> glue_metric = datasets.load_metric('glue', 'mrpc') # 'mrpc' or 'qqp'
>>> references = [0, 1]
>>> predictions = [0, 1]
>>> results = glue_metric.compute(predictions=predictions, references=references)
>>> print(results)
{'accuracy': 1.0, 'f1': 1.0}
>>> glue_metric = datasets.load_metric('glue', 'stsb')
>>> references = [0., 1., 2., 3., 4., 5.]
>>> predictions = [0., 1., 2., 3., 4., 5.]
>>> results = glue_metric.compute(predictions=predictions, references=references)
>>> print({"pearson": round(results["pearson"], 2), "spearmanr": round(results["spearmanr"], 2)})
{'pearson': 1.0, 'spearmanr': 1.0}
>>> glue_metric = datasets.load_metric('glue', 'cola')
>>> references = [0, 1]
>>> predictions = [0, 1]
>>> results = glue_metric.compute(predictions=predictions, references=references)
>>> print(results)
{'matthews_correlation': 1.0}
Compute metric:
model_predictions = model(model_inputs)
final_score = metric.compute(predictions=model_predictions, references=gold_references)
Create a dataset¶
Folder-based builders¶
Folder-based builders for quickly creating an image or audio dataset
from datasets import ImageFolder
dataset = load_dataset("imagefolder", data_dir="/path/to/pokemon")
from datasets import AudioFolder
dataset = load_dataset("audiofolder", data_dir="/path/to/folder")
From local files¶
>>> from datasets import Dataset
>>> def gen():
... yield {"pokemon": "bulbasaur", "type": "grass"}
... yield {"pokemon": "squirtle", "type": "water"}
>>> ds = Dataset.from_generator(gen)
>>> ds[0]
{"pokemon": "bulbasaur", "type": "grass"}
>>> from datasets import IterableDataset
>>> ds = IterableDataset.from_generator(gen)
>>> for example in ds:
... print(example)
{"pokemon": "bulbasaur", "type": "grass"}
{"pokemon": "squirtle", "type": "water"}
>>> from datasets import Dataset
>>> ds = Dataset.from_dict({"pokemon": ["bulbasaur", "squirtle"], "type": ["grass", "water"]})
>>> ds[0]
{"pokemon": "bulbasaur", "type": "grass"}
audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", ..., "path/to/audio_n"]}).cast_column("audio", Audio())
HOW-TO GUIDES¶
General usage¶
Load¶
Local loading script:
dataset = load_dataset("path/to/local/loading_script/loading_script.py", split="train")
# equivalent because the file has the same name as the directory
dataset = load_dataset("path/to/local/loading_script", split="train")
Local and remote files:
from datasets import load_dataset
dataset = load_dataset("csv", data_files="my_file.csv")
示例: json数据
{"a": 1, "b": 2.0, "c": "foo", "d": false}
{"a": 4, "b": -5.5, "c": null, "d": true}
代码:
from datasets import load_dataset
dataset = load_dataset("json", data_files="my_file.json")
示例: json数据
{"version": "0.1.0",
"data": [{"a": 1, "b": 2.0, "c": "foo", "d": false},
{"a": 4, "b": -5.5, "c": null, "d": true}]
}
代码:
from datasets import load_dataset
dataset = load_dataset("json", data_files="my_file.json", field="data")
示例: 远程 JSON 文件
url = "https://rajpurkar.github.io/SQuAD-explorer/dataset/"
dataset = load_dataset(
"json",
data_files={"train": url + "train-v1.json", "validation": url + "dev-v1.json"},
field="data"
)
Process¶
glue数据¶
示例:
from datasets import load_dataset
dataset = load_dataset("glue", "mrpc", split="train")
使用:
# Sort排序
dataset["label"][:10]
sorted_dataset = dataset.sort("label")
sorted_dataset["label"][:10]
sorted_dataset["label"][-10:]
# Shuffle混乱
shuffled_dataset = sorted_dataset.shuffle(seed=42)
shuffled_dataset["label"][:10]
iterable_dataset = dataset.to_iterable_dataset(num_shards=128)
shuffled_iterable_dataset = iterable_dataset.shuffle(seed=42, buffer_size=1000)
# Select
small_dataset = dataset.select([0, 10, 20, 30, 40, 50])
len(small_dataset)
# Filter
start_with_ar = dataset.filter(lambda example: example["sentence1"].startswith("Ar"))
len(start_with_ar)
start_with_ar["sentence1"]
even_dataset = dataset.filter(lambda example, idx: idx % 2 == 0, with_indices=True)
len(even_dataset)
len(dataset) / 2
# Map
def add_prefix(example):
example["sentence1"] = 'My sentence: ' + example["sentence1"]
return example
updated_dataset = small_dataset.map(add_prefix)
>>> updated_dataset = dataset.map(
lambda example: {"new_sentence": example["sentence1"]}, remove_columns=["sentence1"]
)
>>> updated_dataset.column_names
['sentence2', 'label', 'idx', 'new_sentence']
# Split
>>> dataset.train_test_split(test_size=0.1)
DatasetDict({
train: Dataset({
features: ['sentence1', 'sentence2', 'label', 'idx'],
num_rows: 3301
})
test: Dataset({
features: ['sentence1', 'sentence2', 'label', 'idx'],
num_rows: 367
})
})
# Cast
>>> dataset.features
{'sentence1': Value(dtype='string', id=None),
'sentence2': Value(dtype='string', id=None),
'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None),
'idx': Value(dtype='int32', id=None)}
>>> from datasets import ClassLabel, Value
>>> new_features = dataset.features.copy()
>>> new_features["label"] = ClassLabel(names=["negative", "positive"])
>>> new_features["idx"] = Value("int64")
>>> dataset = dataset.cast(new_features)
>>> dataset.features
{'sentence1': Value(dtype='string', id=None),
'sentence2': Value(dtype='string', id=None),
'label': ClassLabel(num_classes=2, names=['negative', 'positive'], names_file=None, id=None),
'idx': Value(dtype='int64', id=None)}
>>> dataset.features
{'audio': Audio(sampling_rate=44100, mono=True, id=None)}
>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
>>> dataset.features
{'audio': Audio(sampling_rate=16000, mono=True, id=None)}
# Rename
>>> dataset
Dataset({
features: ['sentence1', 'sentence2', 'label', 'idx'],
num_rows: 3668
})
>>> dataset = dataset.rename_column("sentence1", "sentenceA")
>>> dataset = dataset.rename_column("sentence2", "sentenceB")
>>> dataset
Dataset({
features: ['sentenceA', 'sentenceB', 'label', 'idx'],
num_rows: 3668
})
# Remove
>>> dataset = dataset.remove_columns("label")
>>> dataset
Dataset({
features: ['sentence1', 'sentence2', 'idx'],
num_rows: 3668
})
>>> dataset = dataset.remove_columns(["sentence1", "sentence2"])
>>> dataset
Dataset({
features: ['idx'],
num_rows: 3668
})
imdb数据¶
示例:
>>> from datasets import load_dataset
>>> datasets = load_dataset("imdb", split="train")
>>> print(dataset)
Dataset({
features: ['text', 'label'],
num_rows: 25000
})
使用:
# Shard
>>> dataset.shard(num_shards=4, index=0)
Dataset({
features: ['text', 'label'],
num_rows: 6250
})
SQuAD数据集¶
示例:
>>> from datasets import load_dataset
>>> dataset = load_dataset("squad", split="train")
>>> dataset.features
{'answers': Sequence(feature={
'text': Value(dtype='string', id=None),
'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None),
'context': Value(dtype='string', id=None),
'id': Value(dtype='string', id=None),
'question': Value(dtype='string', id=None),
'title': Value(dtype='string', id=None)}
使用:
>>> flat_dataset = dataset.flatten()
>>> flat_dataset
Dataset({
features: ['id', 'title', 'context', 'question', 'answers.text', 'answers.answer_start'],
num_rows: 87599
})
Stream¶
示例:
>> from datasets import load_dataset
>> dataset = load_dataset('oscar-corpus/OSCAR-2201', 'en', split='train', streaming=True)
>> print(next(iter(dataset)))
{'id': 0, 'text': 'Founded in 2015, ...', ...
流式传输数百个压缩 JSONL 文件(如 oscar-corpus/OSCAR-2201)的本地数据集:
>> from datasets import load_dataset
>> data_files = {'train': 'path/to/OSCAR-2201/compressed/en_meta/*.jsonl.gz'}
>> dataset = load_dataset('json', data_files=data_files, split='train', streaming=True)
>> print(next(iter(dataset)))
{'id': 0, 'text': 'Founded in 2015, ...', ...
使用:
# Shuffle
>>> from datasets import load_dataset
>>> dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True)
>>> shuffled_dataset = dataset.shuffle(seed=42, buffer_size=10_000)
# Split dataset
dataset_head = dataset.take(2) # 获取前2个示例
train_dataset = dataset.skip(1000) # 两者配合
# Interleave
>>> from datasets import interleave_datasets
>>> en_dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True)
>>> fr_dataset = load_dataset('oscar', "unshuffled_deduplicated_fr", split='train', streaming=True)
>>> multilingual_dataset = interleave_datasets([en_dataset, fr_dataset])
完整示例:
import torch
from torch.utils.data import DataLoader
from transformers import AutoModelForMaskedLM, DataCollatorForLanguageModeling
from tqdm import tqdm
dataset = dataset.with_format("torch")
dataloader = DataLoader(dataset, collate_fn=DataCollatorForLanguageModeling(tokenizer))
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = AutoModelForMaskedLM.from_pretrained("distilbert-base-uncased")
model.train().to(device)
optimizer = torch.optim.AdamW(params=model.parameters(), lr=1e-5)
for epoch in range(3):
dataset.set_epoch(epoch)
for i, batch in enumerate(tqdm(dataloader, total=5)):
if i == 5:
break
batch = {k: v.to(device) for k, v in batch.items()}
outputs = model(**batch)
loss = outputs[0]
loss.backward()
optimizer.step()
optimizer.zero_grad()
if i % 10 == 0:
print(f"loss: {loss}")
Audio¶
Load audio data¶
Local files:
>>> audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", ...]})
>>> .cast_column("audio", Audio())
>>> audio_dataset[0]["audio"]
{'array': array([ 0. , 0.00024414, -0.00024414, ..., -0.00024414,
0. , 0. ], dtype=float32),
'path': 'path/to/audio_1',
'sampling_rate': 16000}
AudioFolder:
from datasets import load_dataset
dataset = load_dataset("audiofolder", data_dir="/path/to/folder")
# OR by specifying the list of files
dataset = load_dataset("audiofolder", data_files=["path/to/audio_1", ...])
# load remote datasets from their URLs
dataset = load_dataset("audiofolder", data_files=["https://foo.bar/audio_1", ...]
# for example, pass SpeechCommands archive:
dataset = load_dataset("audiofolder", data_files="https://s3.amazonaws.com/test.tar.gz")
Process audio data¶
Cast:
from datasets import load_dataset, Audio
dataset = load_dataset("PolyAI/minds14", "en-US", split="train")
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
Map:
from transformers import AutoTokenizer, AutoFeatureExtractor, AutoProcessor
model_checkpoint = "facebook/wav2vec2-large-xlsr-53"
tokenizer = AutoTokenizer("./v.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")
feature_extractor = AutoFeatureExtractor.from_pretrained(model_checkpoint)
processor = AutoProcessor.from_pretrained(feature_extractor=feature_extractor, tokenizer=tokenizer)
def prepare_dataset(batch):
audio = batch["audio"]
batch["input_values"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]
batch["input_length"] = len(batch["input_values"])
with processor.as_target_processor():
batch["labels"] = processor(batch["sentence"]).input_ids
return batch
dataset = dataset.map(prepare_dataset, remove_columns=dataset.column_names)
Create an audio dataset¶
Local files¶
>>> audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", ...]}).cast_column("audio", Audio())
# 查看
>>> audio_dataset[0]["audio"]
# 上传到huggingface hub
>>> audio_dataset.push_to_hub("<username>/my_dataset")
目录:
my_dataset/
├── README.md
└── data/
└── train-00000-of-00001.parquet
AudioFolder¶
示例:
my_dataset/
├── README.md
├── metadata.csv
├── data/
├───── first_audio_file.mp3
├───── second_audio_file.mp3
└───── third_audio_file.mp3
metadata.csv:
file_name,transcription
data/first_audio_file.mp3,znowu się duch z ciałem
data/second_audio_file.mp3,już u źwierzyńca podwojów
data/third_audio_file.mp3,pewnie kędyś w obłędzie
使用:
>>> from datasets import load_dataset
>>> dataset = load_dataset("audiofolder", data_dir="/path/to/data")
>>> dataset["train"][0]
{'audio':
{'path': '/path/to/extracted/audio/first_audio_file.mp3',
'array': array([ 0.00088501, 0.0012207 , 0.00131226, ..., -0.00045776], dtype=float32),
'sampling_rate': 16000},
'transcription': 'znowu się duch z ciałem'
}
示例:
data/train/electronic/01.mp3
data/train/punk/01.mp3
data/test/electronic/09.mp3
data/test/punk/09.mp3
使用:
>>> from datasets import load_dataset
>>> dataset = load_dataset("audiofolder", data_dir="/path/to/data")
>>> dataset["train"][0]
{'audio':
{'path': '/path/to/electronic/01.mp3',
'array': array([ 3.9714024e-07, 7.3031038e-07, ..., -1.1244172e-01], dtype=float32),
'sampling_rate': 44100},
'label': 0 # "electronic"
}
>>> dataset["train"][-1]
{'audio':
{'path': '/path/to/punk/01.mp3',
'array': array([0.15237972, 0.13222949, ..., 0.33717662], dtype=float32),
'sampling_rate': 44100},
'label': 1 # "punk"
}
Loading script¶
示例:
my_dataset/
├── README.md
├── my_dataset.py
└── data/
├── train.tar.gz
├── test.tar.gz
└── metadata.csv
使用:
from datasets import load_dataset
dataset = load_dataset("path/to/my_dataset")
Vision¶
Load image data¶
load:
from datasets import load_dataset, Image
dataset = load_dataset("beans", split="train")
dataset[0]["image"]
Local files:
>>> from datasets import load_dataset, Image
>>> dataset = Dataset.from_dict({"image": ["path/to/image_1", ...]}).cast_column("image", Image())
>>> dataset[0]["image"]
<PIL.PngImagePlugin.PngImageFile image mode=RGBA size=1200x215 at 0x15E6D7160>]
>>> dataset = load_dataset("beans", split="train").cast_column("image", Image(decode=False))
>>> dataset[0]["image"]
ImageFolder¶
示例:
folder/train/dog/golden_retriever.png
folder/train/dog/german_shepherd.png
folder/train/dog/chihuahua.png
folder/train/cat/maine_coon.png
folder/train/cat/bengal.png
folder/train/cat/birman.png
使用:
>>> from datasets import load_dataset
>>> dataset = load_dataset("imagefolder", data_dir="/path/to/folder")
>>> dataset["train"][0]
{"image": <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=1200x215 at 0x15E6D7160>, "label": 0}
>>> dataset["train"][-1]
{"image": <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=1200x215 at 0x15E8DAD30>, "label": 1}
Load remote datasets from their URLs:
dataset = load_dataset("imagefolder", data_files="https://a.cn/name.zip", split="train")
Process image data¶
Map¶
>>> def transforms(examples):
... examples["pixel_values"] = [image.convert("RGB").resize((100,100)) for image in examples["image"]]
... return examples
>>> dataset = dataset.map(transforms, remove_columns=["image"], batched=True)
>>> dataset[0]
{'label': 6,
'pixel_values': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=100x100 at 0x7F058237BB10>}
Apply transforms¶
data augmentation libraries
torchvision: https://pytorch.org/vision/stable/index.html
Albumentations: https://imgaug.readthedocs.io/en/latest/
Kornia: https://albumentations.ai/docs/
from torchvision.transforms import Compose, ColorJitter, ToTensor
jitter = Compose(
[
ColorJitter(brightness=0.25, contrast=0.25, saturation=0.25, hue=0.7),
ToTensor(),
]
)
def transforms(examples):
examples["pixel_values"] = [jitter(image.convert("RGB")) for image in examples["image"]]
return examples
dataset.set_transform(transforms)
Create an image dataset¶
示例:
folder/train/metadata.csv
folder/train/0001.png
folder/train/0002.png
folder/train/0003.png
压缩版示例:
folder/metadata.csv
folder/train.zip
folder/test.zip
folder/valid.zip
metadata.csv
文件内容:
file_name,additional_feature
0001.png,This is a first value of a text feature you added to your images
0002.png,This is a second value of a text feature you added to your images
0003.png,This is a third value of a text feature you added to your images
or using metadata.jsonl
:
{"file_name": "0001.png", "additional_feature": "This is a first value of a text feature you added to your images"}
{"file_name": "0002.png", "additional_feature": "This is a second value of a text feature you added to your images"}
{"file_name": "0003.png", "additional_feature": "This is a third value of a text feature you added to your images"}
示例-metadata.csv:
file_name,text
0001.png,This is a golden retriever playing with a ball
0002.png,A german shepherd
0003.png,One chihuahua
使用:
>>> dataset = load_dataset("imagefolder", data_dir="/path/to/folder", split="train")
>>> dataset[0]["text"]
"This is a golden retriever playing with a ball"
Upload dataset to the Hub:
from datasets import load_dataset
dataset = load_dataset("imagefolder", data_dir="/path/to/folder", split="train")
dataset.push_to_hub("stevhliu/my-image-captioning-dataset")
Depth estimation¶
train_dataset = load_dataset("sayakpaul/nyu_depth_v2", split="train")
Image classification¶
dataset = load_dataset("beans")
Semantic segmentation¶
安装:
pip install -U albumentations
使用示例:
train_dataset = load_dataset("sayakpaul/nyu_depth_v2", split="train")
Object detection¶
安装:
pip install -U albumentations opencv-python
使用示例:
ds = load_dataset("cppe-5")
Text¶
Load text data¶
from datasets import load_dataset
dataset = load_dataset("text",
data_files={
"train": ["my_text_1.txt", "my_text_2.txt"],
"test": "my_test_file.txt"
}
)
dataset = load_dataset("text",
data_dir="path/to/text/dataset")
dataset = load_dataset("text",
data_files={"train": "my_train_file.txt", "test": "my_test_file.txt"},
sample_by="paragraph"
)
dataset = load_dataset("text",
data_files={"train": "my_train_file.txt", "test": "my_test_file.txt"},
sample_by="document"
)
示例:
from datasets import load_dataset
c4_subset = load_dataset("allenai/c4", data_files="en/c4-train.0000*-of-01024.json.gz")
dataset = load_dataset("text", data_files="https://huggingface.co/datasets/lhoestq/test/resolve/main/some_text.txt")
Process text data¶
Map:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
dataset = dataset.map(lambda examples: tokenizer(examples["text"]), batched=True)
dataset[0]
{'text': 'the rock is destined to be the 21st century...',
'label': 1,
'input_ids': [101, 1996, 2600, 2003, 16036, 2000, ...],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]}
Tabular¶
CSV files¶
from datasets import load_dataset
dataset = load_dataset("csv", data_files="my_file.csv")
# load multiple CSV files
dataset = load_dataset("csv", data_files=["my_file_1.csv", "my_file_2.csv", "my_file_3.csv"])
# 训练集、测试集拆分
dataset = load_dataset("csv", data_files={
"train": ["my_train_file_1.csv", "my_train_file_2.csv"],
"test": "my_test_file.csv"}
)
Pandas DataFrames¶
from datasets import Dataset
import pandas as pd
df = pd.read_csv("https://huggingface.co/datasets/imodels/credit-card/raw/main/train.csv")
df = pd.DataFrame(df)
dataset = Dataset.from_pandas(df)
Databases¶
SQLite¶
创建数据库:
import sqlite3
import pandas as pd
conn = sqlite3.connect("us_covid_data.db")
df = pd.read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv")
df.to_sql("states", conn, if_exists="replace")
load:
from datasets import Dataset
uri = "sqlite:///us_covid_data.db"
ds = Dataset.from_sql("states", uri)
# ds = Dataset.from_sql('SELECT * FROM states WHERE state="California";', uri)
ds
使用:
ds.filter(lambda x: x["state"] == "California")
# ds.filter(lambda x: x["cases"] > 10000)
PostgreSQL¶
Dataset repository¶
Create a dataset loading script¶
A dataset loading script should have the same name as a dataset repository or directory:
my_dataset/
├── README.md
└── my_dataset.py