Datasets
########

* Datasets is a library for easily accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP) tasks.


pip安装::

	pip install datasets
	# check
	python -c "from datasets import load_dataset; print(load_dataset('squad', split='train')[0])"

	# Audio
	pip install 'datasets[audio]'
	# check
	python -c "import soundfile; print(soundfile.__libsndfile_version__)"

	# Vision
	pip install 'datasets[vision]'

源码安装::

	git clone https://github.com/huggingface/datasets.git
	cd datasets
	pip install -e .


TUTORIALS
=========


Load a dataset from the Hub
---------------------------

* This tutorial uses the `rotten_tomatoes <https://huggingface.co/datasets/rotten_tomatoes>`_ and `MInDS_14 <https://huggingface.co/datasets/PolyAI/minds14>`_ datasets


Load a dataset
^^^^^^^^^^^^^^

load_dataset_builder()::

	>>> from datasets import load_dataset_builder
	>>> ds_builder = load_dataset_builder("rotten_tomatoes")

	# Inspect dataset description
	>>> ds_builder.info.description
	Movie Review Dataset. 
	This is a dataset of containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. 
	This data was first used in Bo Pang and Lillian Lee, 
		``Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.'', Proceedings of the ACL, 2005.


	# Inspect dataset features
	>>> ds_builder.info.features
	{'label': ClassLabel(num_classes=2, names=['neg', 'pos'], id=None),
	 'text': Value(dtype='string', id=None)}


load_dataset()::

	>>> from datasets import load_dataset
	>>> dataset = load_dataset("rotten_tomatoes", split="train")
	>>> dataset
	Dataset({
	    features: ['text', 'label'],
	    num_rows: 8530
	})

Splits
^^^^^^

get_dataset_split_names()::

	>>> from datasets import get_dataset_split_names
	>>> get_dataset_split_names("rotten_tomatoes")
	['train', 'validation', 'test']

前面load_dataset()函数指定了split，如果不指定::

	>>> from datasets import load_dataset
	>>> dataset = load_dataset("rotten_tomatoes")
	>>> dataset
	DatasetDict({
	    train: Dataset({
	        features: ['text', 'label'],
	        num_rows: 8530
	    })
	    validation: Dataset({
	        features: ['text', 'label'],
	        num_rows: 1066
	    })
	    test: Dataset({
	        features: ['text', 'label'],
	        num_rows: 1066
	    })
	})


Configurations
^^^^^^^^^^^^^^

get_dataset_config_names()::

	>>> from datasets import get_dataset_config_names
	>>> configs = get_dataset_config_names("PolyAI/minds14")
	>>> print(configs)
	['cs-CZ', 'de-DE', 'en-AU' ...]

Then load the configuration you want::

	from datasets import load_dataset
	mindsFR = load_dataset("PolyAI/minds14", "fr-FR", split="train")

Know your dataset
-----------------

Dataset
^^^^^^^

加载数据::

	from datasets import load_dataset
	dataset = load_dataset("rotten_tomatoes", split="train")

Indexing::

	>>> dataset[0]
	{'label': 1,
	 'text': 'the rock is des...'}

	# a list of all the values in the column
	>>> dataset["text"]
	[
		'the rock is desti...',
		'the gorgeously elabora...',
		...,
		'things really ...'
	]

Slicing::

	>>> dataset[:3]
	{'label': [1, 1, 1],
	 'text': 
		[
			'the rock is desti...',
			'the gorgeously elabora...',
			...,
			'things really ...'
		]
	}
	>>> dataset[3:6]
	...


IterableDataset
^^^^^^^^^^^^^^^

An IterableDataset is loaded when you set the streaming parameter to True in load_dataset()::

	>>> from datasets import load_dataset
	>>> iterable_dataset = load_dataset("food101", split="train", streaming=True)
	>>> for example in iterable_dataset:
	...    print(example)
	...    break
	{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F0681F5C520>, 'label': 6}


	>>> next(iter(iterable_dataset))
	{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F0681F59B50>,
	 'label': 6}

	>>> for example in iterable_dataset:
	...    print(example)
	...    break
	{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F7479DE82B0>, 'label': 6}

使用 IterableDataset.take（） 返回包含特定数量示例的数据集子集::

	# Get first three examples
	>>> list(iterable_dataset.take(3))
	[{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F7479DEE9D0>,
	  'label': 6},
	 {'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x512 at 0x7F7479DE8190>,
	  'label': 6},
	 {'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x383 at 0x7F7479DE8310>,
	  'label': 6}]

Preprocess
^^^^^^^^^^

Tokenize text
"""""""""""""

Start by loading the rotten_tomatoes dataset and the tokenizer corresponding to a pretrained BERT model::

	>>> from transformers import AutoTokenizer
	>>> from datasets import load_dataset

	>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
	>>> dataset = load_dataset("rotten_tomatoes", split="train")

Call your tokenizer on the first row of text in the dataset::

	>>> tokenizer(dataset[0]["text"])
	{
	 'input_ids': [101, 1103, 2067 ...], 
	 'token_type_ids': [0, 0, 0, 0, 0 ...], 
	 'attention_mask': [1, 1, 1, 1, 1 ...]
	}
	说明:
		input_ids: the numbers representing the tokens in the text.
		token_type_ids: indicates which sequence a token belongs to if there is more than one sequence.
		attention_mask: indicates whether a token should be masked or not.

The fastest way to tokenize your entire dataset is to use the map() function::

	def tokenization(example):
	    return tokenizer(example["text"])

	dataset = dataset.map(tokenization, batched=True)

Set the format of your dataset to be compatible with your machine learning framework::

	# Pytorch
	dataset.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "labels"])
	dataset.format['type']

	# TensorFlow
	from transformers import DataCollatorWithPadding
	data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")
	tf_dataset = dataset.to_tf_dataset(
	    columns=["input_ids", "token_type_ids", "attention_mask"],
	    label_cols=["labels"],
	    batch_size=2,
	    collate_fn=data_collator,
	    shuffle=True
	)


Resample audio signals
""""""""""""""""""""""

Start by loading the MInDS-14 dataset::

	from transformers import AutoFeatureExtractor
	from datasets import load_dataset, Audio

	feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base-960h")
	dataset = load_dataset("PolyAI/minds14", "en-US", split="train")

When you call the audio column of the dataset, it is automatically decoded and resampled::

	>>> dataset[0]["audio"]
	{'array': array([ 0.        ,  0.00024414, -0.00024414, ..., -0.00024414,
	         0.        ,  0.        ], dtype=float32),
	 'path': '~/.cache/huggingface/datasets/downloads/extracted/xxxxx/en-US~JOINT_ACCOUNT/xxx.wav',
	 'sampling_rate': 8000}


Use the cast_column() function and set the sampling_rate parameter in the Audio feature to upsample the audio signal. When you call the audio column now, it is decoded and resampled to 16kHz::

	>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000))
	>>> dataset[0]["audio"]
	{'array': array([ 2.3443763e-05,  2.1729663e-04,  2.2145823e-04, ...,
	         3.8356509e-05, -7.3497440e-06, -2.1754686e-05], dtype=float32),
	 'path': '~/.cache/huggingface/datasets/downloads/extracted/xxx/en-US~JOINT_ACCOUNT/xxx.wav',
	 'sampling_rate': 16000}

Use the map() function to resample the entire dataset to 16kHz::

	def preprocess_function(examples):
	    audio_arrays = [x["array"] for x in examples["audio"]]
	    inputs = feature_extractor(
	        audio_arrays, sampling_rate=feature_extractor.sampling_rate, max_length=16000, truncation=True
	    )
	    return inputs

	dataset = dataset.map(preprocess_function, batched=True)


Apply data augmentations
""""""""""""""""""""""""

1. Start by loading the Beans dataset, the Image feature, and the feature extractor corresponding to a pretrained ViT model::

	from transformers import AutoFeatureExtractor
	from datasets import load_dataset, Image

	feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224-in21k")
	dataset = load_dataset("beans", split="train")

2. When you call the image column of the dataset, the underlying PIL object is automatically decoded into an image::

	dataset[0]["image"]

3. some transforms to the image::

	from torchvision.transforms import RandomRotation

	rotate = RandomRotation(degrees=(0, 90))
	def transforms(examples):
	    examples["pixel_values"] = [rotate(image.convert("RGB")) for image in examples["image"]]
	    return examples

4. Use the set_transform() function to apply the transform on-the-fly::

	dataset.set_transform(transforms)
	dataset[0]["pixel_values"]


Evaluate predictions
^^^^^^^^^^^^^^^^^^^^

You can see what metrics are available with list_metrics()::

	from datasets import list_metrics
	metrics_list = list_metrics()
	len(metrics_list)
	print(metrics_list)

Load metric::

	# Load a metric from the Hub with load_metric()
	from datasets import load_metric
	metric = load_metric('glue', 'mrpc')

Metrics object::

	>>> print(metric.inputs_description)
	Compute GLUE evaluation metric associated to each GLUE dataset.
	Args:
	    predictions: list of predictions to score.
	        Each translation should be tokenized into a list of tokens.
	    references: list of lists of references for each translation.
	        Each reference should be tokenized into a list of tokens.
	Returns: depending on the GLUE subset, one or several of:
	    "accuracy": Accuracy
	    "f1": F1 score
	    "pearson": Pearson Correlation
	    "spearmanr": Spearman Correlation
	    "matthews_correlation": Matthew Correlation

	Examples:
		# 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
	    >>> glue_metric = datasets.load_metric('glue', 'sst2')
	    >>> references = [0, 1]
	    >>> predictions = [0, 1]
	    >>> results = glue_metric.compute(predictions=predictions, references=references)
	    >>> print(results)
	    {'accuracy': 1.0}

	    >>> glue_metric = datasets.load_metric('glue', 'mrpc')  # 'mrpc' or 'qqp'
	    >>> references = [0, 1]
	    >>> predictions = [0, 1]
	    >>> results = glue_metric.compute(predictions=predictions, references=references)
	    >>> print(results)
	    {'accuracy': 1.0, 'f1': 1.0}

	    >>> glue_metric = datasets.load_metric('glue', 'stsb')
	    >>> references = [0., 1., 2., 3., 4., 5.]
	    >>> predictions = [0., 1., 2., 3., 4., 5.]
	    >>> results = glue_metric.compute(predictions=predictions, references=references)
	    >>> print({"pearson": round(results["pearson"], 2), "spearmanr": round(results["spearmanr"], 2)})
	    {'pearson': 1.0, 'spearmanr': 1.0}

	    >>> glue_metric = datasets.load_metric('glue', 'cola')
	    >>> references = [0, 1]
	    >>> predictions = [0, 1]
	    >>> results = glue_metric.compute(predictions=predictions, references=references)
	    >>> print(results)
	    {'matthews_correlation': 1.0}

Compute metric::

	model_predictions = model(model_inputs)
	final_score = metric.compute(predictions=model_predictions, references=gold_references)


Create a dataset
^^^^^^^^^^^^^^^^

Folder-based builders
"""""""""""""""""""""

* Folder-based builders for quickly creating an image or audio dataset

.. figure:: https://img.zhaoweiguo.com/uPic/2023/07/mxt9eY.jpg

	This is how the folder-based builder generates

::

	from datasets import ImageFolder
	dataset = load_dataset("imagefolder", data_dir="/path/to/pokemon")

	from datasets import AudioFolder
	dataset = load_dataset("audiofolder", data_dir="/path/to/folder")

From local files
""""""""""""""""

::

	>>> from datasets import Dataset
	>>> def gen():
	...     yield {"pokemon": "bulbasaur", "type": "grass"}
	...     yield {"pokemon": "squirtle", "type": "water"}
	>>> ds = Dataset.from_generator(gen)
	>>> ds[0]
	{"pokemon": "bulbasaur", "type": "grass"}

::

	>>> from datasets import IterableDataset
	>>> ds = IterableDataset.from_generator(gen)
	>>> for example in ds:
	...    print(example)
	{"pokemon": "bulbasaur", "type": "grass"}
	{"pokemon": "squirtle", "type": "water"}

::

	>>> from datasets import Dataset
	>>> ds = Dataset.from_dict({"pokemon": ["bulbasaur", "squirtle"], "type": ["grass", "water"]})
	>>> ds[0]
	{"pokemon": "bulbasaur", "type": "grass"}

::

	audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", ..., "path/to/audio_n"]}).cast_column("audio", Audio())


HOW-TO GUIDES
=============

General usage
-------------

Load
^^^^

Local loading script::

	dataset = load_dataset("path/to/local/loading_script/loading_script.py", split="train")
	# equivalent because the file has the same name as the directory
	dataset = load_dataset("path/to/local/loading_script", split="train")

Local and remote files::

	from datasets import load_dataset
	dataset = load_dataset("csv", data_files="my_file.csv")

	示例: json数据
	{"a": 1, "b": 2.0, "c": "foo", "d": false}
	{"a": 4, "b": -5.5, "c": null, "d": true}
	代码:
	from datasets import load_dataset
	dataset = load_dataset("json", data_files="my_file.json")


	示例: json数据
	{"version": "0.1.0",
	 "data": [{"a": 1, "b": 2.0, "c": "foo", "d": false},
	          {"a": 4, "b": -5.5, "c": null, "d": true}]
	}
	代码:
	from datasets import load_dataset
	dataset = load_dataset("json", data_files="my_file.json", field="data")


	示例: 远程 JSON 文件
	url = "https://rajpurkar.github.io/SQuAD-explorer/dataset/"
	dataset = load_dataset(
		"json", 
		data_files={"train": url + "train-v1.json", "validation": url + "dev-v1.json"}, 
		field="data"
	)

Process
^^^^^^^

glue数据
""""""""

示例::

	from datasets import load_dataset
	dataset = load_dataset("glue", "mrpc", split="train")

使用::

	# Sort排序
	dataset["label"][:10]
	sorted_dataset = dataset.sort("label")
	sorted_dataset["label"][:10]
	sorted_dataset["label"][-10:]

	# Shuffle混乱
	shuffled_dataset = sorted_dataset.shuffle(seed=42)
	shuffled_dataset["label"][:10]

	iterable_dataset = dataset.to_iterable_dataset(num_shards=128)
	shuffled_iterable_dataset = iterable_dataset.shuffle(seed=42, buffer_size=1000)


	# Select 
	small_dataset = dataset.select([0, 10, 20, 30, 40, 50])
	len(small_dataset)

	# Filter
	start_with_ar = dataset.filter(lambda example: example["sentence1"].startswith("Ar"))
	len(start_with_ar)
	start_with_ar["sentence1"]

	even_dataset = dataset.filter(lambda example, idx: idx % 2 == 0, with_indices=True)
	len(even_dataset)
	len(dataset) / 2

	# Map
	def add_prefix(example):
	    example["sentence1"] = 'My sentence: ' + example["sentence1"]
	    return example
	updated_dataset = small_dataset.map(add_prefix)

	>>> updated_dataset = dataset.map(
		lambda example: {"new_sentence": example["sentence1"]}, remove_columns=["sentence1"]
	)
	>>> updated_dataset.column_names
	['sentence2', 'label', 'idx', 'new_sentence']


	# Split
	>>> dataset.train_test_split(test_size=0.1)
	DatasetDict({
	    train: Dataset({
	        features: ['sentence1', 'sentence2', 'label', 'idx'],
	        num_rows: 3301
	    })
	    test: Dataset({
	        features: ['sentence1', 'sentence2', 'label', 'idx'],
	        num_rows: 367
	    })
	})

	# Cast
	>>> dataset.features
	{'sentence1': Value(dtype='string', id=None),
	'sentence2': Value(dtype='string', id=None),
	'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None),
	'idx': Value(dtype='int32', id=None)}
	>>> from datasets import ClassLabel, Value
	>>> new_features = dataset.features.copy()
	>>> new_features["label"] = ClassLabel(names=["negative", "positive"])
	>>> new_features["idx"] = Value("int64")
	>>> dataset = dataset.cast(new_features)
	>>> dataset.features
	{'sentence1': Value(dtype='string', id=None),
	'sentence2': Value(dtype='string', id=None),
	'label': ClassLabel(num_classes=2, names=['negative', 'positive'], names_file=None, id=None),
	'idx': Value(dtype='int64', id=None)}


	>>> dataset.features
	{'audio': Audio(sampling_rate=44100, mono=True, id=None)}
	>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
	>>> dataset.features
	{'audio': Audio(sampling_rate=16000, mono=True, id=None)}


	# Rename
	>>> dataset
	Dataset({
	    features: ['sentence1', 'sentence2', 'label', 'idx'],
	    num_rows: 3668
	})
	>>> dataset = dataset.rename_column("sentence1", "sentenceA")
	>>> dataset = dataset.rename_column("sentence2", "sentenceB")
	>>> dataset
	Dataset({
	    features: ['sentenceA', 'sentenceB', 'label', 'idx'],
	    num_rows: 3668
	})

	# Remove
	>>> dataset = dataset.remove_columns("label")
	>>> dataset
	Dataset({
	    features: ['sentence1', 'sentence2', 'idx'],
	    num_rows: 3668
	})
	>>> dataset = dataset.remove_columns(["sentence1", "sentence2"])
	>>> dataset
	Dataset({
	    features: ['idx'],
	    num_rows: 3668
	})


imdb数据
""""""""

示例::

	>>> from datasets import load_dataset
	>>> datasets = load_dataset("imdb", split="train")
	>>> print(dataset)
	Dataset({
	    features: ['text', 'label'],
	    num_rows: 25000
	})

使用::

	# Shard
	>>> dataset.shard(num_shards=4, index=0)
	Dataset({
	    features: ['text', 'label'],
	    num_rows: 6250
	})


SQuAD数据集
"""""""""""

示例::

	>>> from datasets import load_dataset
	>>> dataset = load_dataset("squad", split="train")
	>>> dataset.features
	{'answers': Sequence(feature={
		'text': Value(dtype='string', id=None), 
		'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None),
	'context': Value(dtype='string', id=None),
	'id': Value(dtype='string', id=None),
	'question': Value(dtype='string', id=None),
	'title': Value(dtype='string', id=None)}

使用::

	>>> flat_dataset = dataset.flatten()
	>>> flat_dataset
	Dataset({
	    features: ['id', 'title', 'context', 'question', 'answers.text', 'answers.answer_start'],
	 num_rows: 87599
	})


Stream
^^^^^^

.. figure:: https://img.zhaoweiguo.com/uPic/2023/07/7xFX0b.jpg

	数据集流式处理：可以解决使用非常大的数据集，但你没有这么大的磁盘、不想等待、只是想快速浏览几个样本。


示例::

	>> from datasets import load_dataset
	>> dataset = load_dataset('oscar-corpus/OSCAR-2201', 'en', split='train', streaming=True)
	>> print(next(iter(dataset)))
	{'id': 0, 'text': 'Founded in 2015, ...', ...

流式传输数百个压缩 JSONL 文件（如 oscar-corpus/OSCAR-2201）的本地数据集::

	>> from datasets import load_dataset
	>> data_files = {'train': 'path/to/OSCAR-2201/compressed/en_meta/*.jsonl.gz'}
	>> dataset = load_dataset('json', data_files=data_files, split='train', streaming=True)
	>> print(next(iter(dataset)))
	{'id': 0, 'text': 'Founded in 2015, ...', ...


使用::

	# Shuffle
	>>> from datasets import load_dataset
	>>> dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True)
	>>> shuffled_dataset = dataset.shuffle(seed=42, buffer_size=10_000)

	# Split dataset
	dataset_head = dataset.take(2)		# 获取前2个示例
	train_dataset = dataset.skip(1000)	# 两者配合

	# Interleave
	>>> from datasets import interleave_datasets
	>>> en_dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True)
	>>> fr_dataset = load_dataset('oscar', "unshuffled_deduplicated_fr", split='train', streaming=True)
	>>> multilingual_dataset = interleave_datasets([en_dataset, fr_dataset])

完整示例::

	import torch
	from torch.utils.data import DataLoader
	from transformers import AutoModelForMaskedLM, DataCollatorForLanguageModeling
	from tqdm import tqdm

	dataset = dataset.with_format("torch")
	dataloader = DataLoader(dataset, collate_fn=DataCollatorForLanguageModeling(tokenizer))
	device = 'cuda' if torch.cuda.is_available() else 'cpu' 
	model = AutoModelForMaskedLM.from_pretrained("distilbert-base-uncased")
	model.train().to(device)
	optimizer = torch.optim.AdamW(params=model.parameters(), lr=1e-5)
	for epoch in range(3):
	    dataset.set_epoch(epoch)
	    for i, batch in enumerate(tqdm(dataloader, total=5)):
	        if i == 5:
	            break
	        batch = {k: v.to(device) for k, v in batch.items()}
	        outputs = model(**batch)
	        loss = outputs[0]
	        loss.backward()
	        optimizer.step()
	        optimizer.zero_grad()
	        if i % 10 == 0:
	            print(f"loss: {loss}")


Audio
-----

Load audio data
^^^^^^^^^^^^^^^

Local files::

	audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", ...]})
		.cast_column("audio", Audio())
	audio_dataset[0]["audio"]

AudioFolder::

	from datasets import load_dataset

	dataset = load_dataset("audiofolder", data_dir="/path/to/folder")
	# OR by specifying the list of files
	dataset = load_dataset("audiofolder", data_files=["path/to/audio_1", ...])

	# load remote datasets from their URLs
	dataset = load_dataset("audiofolder", data_files=["https://foo.bar/audio_1", ...]
	# for example, pass SpeechCommands archive:
	dataset = load_dataset("audiofolder", data_files="https://s3.amazonaws.com/test.tar.gz")


Process audio data
^^^^^^^^^^^^^^^^^^

Cast::

	from datasets import load_dataset, Audio
	dataset = load_dataset("PolyAI/minds14", "en-US", split="train")
	dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))

.. figure:: https://img.zhaoweiguo.com/uPic/2023/07/72sLGK.jpg


Map::

	from transformers import AutoTokenizer, AutoFeatureExtractor, AutoProcessor

	model_checkpoint = "facebook/wav2vec2-large-xlsr-53"
	tokenizer = AutoTokenizer("./v.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")
	feature_extractor = AutoFeatureExtractor.from_pretrained(model_checkpoint)
	processor = AutoProcessor.from_pretrained(feature_extractor=feature_extractor, tokenizer=tokenizer)

	def prepare_dataset(batch):
	    audio = batch["audio"]
	    batch["input_values"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]
	    batch["input_length"] = len(batch["input_values"])
	    with processor.as_target_processor():
	        batch["labels"] = processor(batch["sentence"]).input_ids
	    return batch
	dataset = dataset.map(prepare_dataset, remove_columns=dataset.column_names)


Create an audio dataset
^^^^^^^^^^^^^^^^^^^^^^^

Local files
"""""""""""

::

	>>> audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", ...]}).cast_column("audio", Audio())
	# 查看
	>>> audio_dataset[0]["audio"]
	# 上传到huggingface hub
	>>> audio_dataset.push_to_hub("<username>/my_dataset")

目录::

	my_dataset/
	├── README.md
	└── data/
	    └── train-00000-of-00001.parquet


AudioFolder
"""""""""""

示例::

	my_dataset/
	├── README.md
	├── metadata.csv
	├── data/
	├───── first_audio_file.mp3
	├───── second_audio_file.mp3
	└───── third_audio_file.mp3


metadata.csv::

	file_name,transcription
	data/first_audio_file.mp3,znowu się duch z ciałem
	data/second_audio_file.mp3,już u źwierzyńca podwojów
	data/third_audio_file.mp3,pewnie kędyś w obłędzie

使用::

	>>> from datasets import load_dataset
	>>> dataset = load_dataset("audiofolder", data_dir="/path/to/data")
	>>> dataset["train"][0]
	{'audio':
	    {'path': '/path/to/extracted/audio/first_audio_file.mp3',
	    'array': array([ 0.00088501,  0.0012207 ,  0.00131226, ..., -0.00045776], dtype=float32),
	    'sampling_rate': 16000},
	 'transcription': 'znowu się duch z ciałem'
	}

示例::

	data/train/electronic/01.mp3
	data/train/punk/01.mp3

	data/test/electronic/09.mp3
	data/test/punk/09.mp3

使用::

	>>> from datasets import load_dataset
	>>> dataset = load_dataset("audiofolder", data_dir="/path/to/data")
	>>> dataset["train"][0]
	{'audio':
	    {'path': '/path/to/electronic/01.mp3',
	     'array': array([ 3.9714024e-07,  7.3031038e-07, ..., -1.1244172e-01], dtype=float32),
	     'sampling_rate': 44100},
	 'label': 0  # "electronic"
	}
	>>> dataset["train"][-1]
	{'audio':
	    {'path': '/path/to/punk/01.mp3',
	     'array': array([0.15237972, 0.13222949, ..., 0.33717662], dtype=float32),
	     'sampling_rate': 44100},
	 'label': 1  # "punk"
	}


Loading script
""""""""""""""

示例::

	my_dataset/
	├── README.md
	├── my_dataset.py
	└── data/
	    ├── train.tar.gz
	    ├── test.tar.gz
	    └── metadata.csv

使用::

	from datasets import load_dataset
	dataset = load_dataset("path/to/my_dataset")


Vision
------

Load image data
^^^^^^^^^^^^^^^

load::

	from datasets import load_dataset, Image

	dataset = load_dataset("beans", split="train")
	dataset[0]["image"]


Local files::

	>>> from datasets import load_dataset, Image
	>>> dataset = Dataset.from_dict({"image": ["path/to/image_1", ...]}).cast_column("image", Image())
	>>> dataset[0]["image"]
	<PIL.PngImagePlugin.PngImageFile image mode=RGBA size=1200x215 at 0x15E6D7160>]


	>>> dataset = load_dataset("beans", split="train").cast_column("image", Image(decode=False))
	>>> dataset[0]["image"]

ImageFolder
"""""""""""

示例::

	folder/train/dog/golden_retriever.png
	folder/train/dog/german_shepherd.png
	folder/train/dog/chihuahua.png

	folder/train/cat/maine_coon.png
	folder/train/cat/bengal.png
	folder/train/cat/birman.png

使用::

	>>> from datasets import load_dataset
	>>> dataset = load_dataset("imagefolder", data_dir="/path/to/folder")
	>>> dataset["train"][0]
	{"image": <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=1200x215 at 0x15E6D7160>, "label": 0}

	>>> dataset["train"][-1]
	{"image": <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=1200x215 at 0x15E8DAD30>, "label": 1}

Load remote datasets from their URLs::

	dataset = load_dataset("imagefolder", data_files="https://a.cn/name.zip", split="train")


Process image data
^^^^^^^^^^^^^^^^^^

Map
"""

::

	>>> def transforms(examples):
	...    examples["pixel_values"] = [image.convert("RGB").resize((100,100)) for image in examples["image"]]
	...    return examples

	>>> dataset = dataset.map(transforms, remove_columns=["image"], batched=True)
	>>> dataset[0]
	{'label': 6,
	 'pixel_values': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=100x100 at 0x7F058237BB10>}


Apply transforms
""""""""""""""""

* data augmentation libraries
* torchvision: https://pytorch.org/vision/stable/index.html
* Albumentations: https://imgaug.readthedocs.io/en/latest/
* Kornia: https://albumentations.ai/docs/
* imgaug: https://kornia.readthedocs.io/en/latest/

::

	from torchvision.transforms import Compose, ColorJitter, ToTensor
	jitter = Compose(
	    [
	         ColorJitter(brightness=0.25, contrast=0.25, saturation=0.25, hue=0.7),
	         ToTensor(),
	    ]
	)

	def transforms(examples):
	    examples["pixel_values"] = [jitter(image.convert("RGB")) for image in examples["image"]]
	    return examples

	dataset.set_transform(transforms)


Create an image dataset
^^^^^^^^^^^^^^^^^^^^^^^

示例::

	folder/train/metadata.csv
	folder/train/0001.png
	folder/train/0002.png
	folder/train/0003.png

压缩版示例::

	folder/metadata.csv
	folder/train.zip
	folder/test.zip
	folder/valid.zip

``metadata.csv`` 文件内容::

	file_name,additional_feature
	0001.png,This is a first value of a text feature you added to your images
	0002.png,This is a second value of a text feature you added to your images
	0003.png,This is a third value of a text feature you added to your images

or using ``metadata.jsonl``::

	{"file_name": "0001.png", "additional_feature": "This is a first value of a text feature you added to your images"}
	{"file_name": "0002.png", "additional_feature": "This is a second value of a text feature you added to your images"}
	{"file_name": "0003.png", "additional_feature": "This is a third value of a text feature you added to your images"}

示例-metadata.csv::

	file_name,text
	0001.png,This is a golden retriever playing with a ball
	0002.png,A german shepherd
	0003.png,One chihuahua

使用::

	>>> dataset = load_dataset("imagefolder", data_dir="/path/to/folder", split="train")
	>>> dataset[0]["text"]
	"This is a golden retriever playing with a ball"


Upload dataset to the Hub::

	from datasets import load_dataset
	dataset = load_dataset("imagefolder", data_dir="/path/to/folder", split="train")
	dataset.push_to_hub("stevhliu/my-image-captioning-dataset")


Depth estimation
^^^^^^^^^^^^^^^^

::

	train_dataset = load_dataset("sayakpaul/nyu_depth_v2", split="train")


Image classification
^^^^^^^^^^^^^^^^^^^^

::

	dataset = load_dataset("beans")


Semantic segmentation
^^^^^^^^^^^^^^^^^^^^^

安装::

	pip install -U albumentations

使用示例::

	train_dataset = load_dataset("sayakpaul/nyu_depth_v2", split="train")


Object detection
^^^^^^^^^^^^^^^^

安装::

	pip install -U albumentations opencv-python

使用示例::

	ds = load_dataset("cppe-5")


Text
----

Load text data
^^^^^^^^^^^^^^

::

	from datasets import load_dataset
	dataset = load_dataset("text", 
		data_files={
			"train": ["my_text_1.txt", "my_text_2.txt"], 
			"test": "my_test_file.txt"
		}
	)
	dataset = load_dataset("text", 
		data_dir="path/to/text/dataset")

	dataset = load_dataset("text", 
		data_files={"train": "my_train_file.txt", "test": "my_test_file.txt"}, 
		sample_by="paragraph"
	)
	dataset = load_dataset("text", 
		data_files={"train": "my_train_file.txt", "test": "my_test_file.txt"}, 
		sample_by="document"
	)


示例::

	from datasets import load_dataset
	c4_subset = load_dataset("allenai/c4", data_files="en/c4-train.0000*-of-01024.json.gz")


	dataset = load_dataset("text", data_files="https://huggingface.co/datasets/lhoestq/test/resolve/main/some_text.txt")


Process text data
^^^^^^^^^^^^^^^^^

Map::

	from transformers import AutoTokenizer
	tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

	dataset = dataset.map(lambda examples: tokenizer(examples["text"]), batched=True)
	dataset[0]
	{'text': 'the rock is destined to be the 21st century...', 
	 'label': 1, 
	 'input_ids': [101, 1996, 2600, 2003, 16036, 2000, ...], 
	 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...], 
	 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]}


Tabular
-------

CSV files
^^^^^^^^^

::

	from datasets import load_dataset
	dataset = load_dataset("csv", data_files="my_file.csv")

	# load multiple CSV files
	dataset = load_dataset("csv", data_files=["my_file_1.csv", "my_file_2.csv", "my_file_3.csv"])

	# 训练集、测试集拆分
	dataset = load_dataset("csv", data_files={
		"train": ["my_train_file_1.csv", "my_train_file_2.csv"], 
		"test": "my_test_file.csv"}
	)


Pandas DataFrames
^^^^^^^^^^^^^^^^^

::

	from datasets import Dataset
	import pandas as pd

	df = pd.read_csv("https://huggingface.co/datasets/imodels/credit-card/raw/main/train.csv")
	df = pd.DataFrame(df)
	dataset = Dataset.from_pandas(df)


Databases
^^^^^^^^^

SQLite
""""""

创建数据库::

	import sqlite3
	import pandas as pd

	conn = sqlite3.connect("us_covid_data.db")
	df = pd.read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv")
	df.to_sql("states", conn, if_exists="replace")


load::

	from datasets import Dataset

	uri = "sqlite:///us_covid_data.db"
	ds = Dataset.from_sql("states", uri)
	# ds = Dataset.from_sql('SELECT * FROM states WHERE state="California";', uri)
	ds


使用::

	ds.filter(lambda x: x["state"] == "California")
	# ds.filter(lambda x: x["cases"] > 10000)

PostgreSQL
""""""""""

* 参考: https://huggingface.co/docs/datasets/v2.13.1/en/package_reference/main_classes#datasets.Dataset.from_sql


Dataset repository
------------------

Share
^^^^^

create ::

	$ huggingface-cli repo create your_dataset_name --type dataset


clone::

	# Make sure you have git-lfs installed
	# (https://git-lfs.github.com/)
	git lfs install

	git clone https://huggingface.co/datasets/namespace/your_dataset_name

* 剩下的和普通的git差不多操作...

Create a dataset loading script
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A dataset loading script should have the same name as a dataset repository or directory::

	my_dataset/
	├── README.md
	└── my_dataset.py