Datasets ######## * Datasets is a library for easily accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP) tasks. pip安装:: pip install datasets # check python -c "from datasets import load_dataset; print(load_dataset('squad', split='train')[0])" # Audio pip install 'datasets[audio]' # check python -c "import soundfile; print(soundfile.__libsndfile_version__)" # Vision pip install 'datasets[vision]' 源码安装:: git clone https://github.com/huggingface/datasets.git cd datasets pip install -e . TUTORIALS ========= Load a dataset from the Hub --------------------------- * This tutorial uses the `rotten_tomatoes `_ and `MInDS_14 `_ datasets Load a dataset ^^^^^^^^^^^^^^ load_dataset_builder():: >>> from datasets import load_dataset_builder >>> ds_builder = load_dataset_builder("rotten_tomatoes") # Inspect dataset description >>> ds_builder.info.description Movie Review Dataset. This is a dataset of containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. This data was first used in Bo Pang and Lillian Lee, ``Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.'', Proceedings of the ACL, 2005. # Inspect dataset features >>> ds_builder.info.features {'label': ClassLabel(num_classes=2, names=['neg', 'pos'], id=None), 'text': Value(dtype='string', id=None)} load_dataset():: >>> from datasets import load_dataset >>> dataset = load_dataset("rotten_tomatoes", split="train") >>> dataset Dataset({ features: ['text', 'label'], num_rows: 8530 }) Splits ^^^^^^ get_dataset_split_names():: >>> from datasets import get_dataset_split_names >>> get_dataset_split_names("rotten_tomatoes") ['train', 'validation', 'test'] 前面load_dataset()函数指定了split,如果不指定:: >>> from datasets import load_dataset >>> dataset = load_dataset("rotten_tomatoes") >>> dataset DatasetDict({ train: Dataset({ features: ['text', 'label'], num_rows: 8530 }) validation: Dataset({ features: ['text', 'label'], num_rows: 1066 }) test: Dataset({ features: ['text', 'label'], num_rows: 1066 }) }) Configurations ^^^^^^^^^^^^^^ get_dataset_config_names():: >>> from datasets import get_dataset_config_names >>> configs = get_dataset_config_names("PolyAI/minds14") >>> print(configs) ['cs-CZ', 'de-DE', 'en-AU' ...] Then load the configuration you want:: from datasets import load_dataset mindsFR = load_dataset("PolyAI/minds14", "fr-FR", split="train") Know your dataset ----------------- Dataset ^^^^^^^ 加载数据:: from datasets import load_dataset dataset = load_dataset("rotten_tomatoes", split="train") Indexing:: >>> dataset[0] {'label': 1, 'text': 'the rock is des...'} # a list of all the values in the column >>> dataset["text"] [ 'the rock is desti...', 'the gorgeously elabora...', ..., 'things really ...' ] Slicing:: >>> dataset[:3] {'label': [1, 1, 1], 'text': [ 'the rock is desti...', 'the gorgeously elabora...', ..., 'things really ...' ] } >>> dataset[3:6] ... IterableDataset ^^^^^^^^^^^^^^^ An IterableDataset is loaded when you set the streaming parameter to True in load_dataset():: >>> from datasets import load_dataset >>> iterable_dataset = load_dataset("food101", split="train", streaming=True) >>> for example in iterable_dataset: ... print(example) ... break {'image': , 'label': 6} >>> next(iter(iterable_dataset)) {'image': , 'label': 6} >>> for example in iterable_dataset: ... print(example) ... break {'image': , 'label': 6} 使用 IterableDataset.take() 返回包含特定数量示例的数据集子集:: # Get first three examples >>> list(iterable_dataset.take(3)) [{'image': , 'label': 6}, {'image': , 'label': 6}, {'image': , 'label': 6}] Preprocess ^^^^^^^^^^ Tokenize text """"""""""""" Start by loading the rotten_tomatoes dataset and the tokenizer corresponding to a pretrained BERT model:: >>> from transformers import AutoTokenizer >>> from datasets import load_dataset >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") >>> dataset = load_dataset("rotten_tomatoes", split="train") Call your tokenizer on the first row of text in the dataset:: >>> tokenizer(dataset[0]["text"]) { 'input_ids': [101, 1103, 2067 ...], 'token_type_ids': [0, 0, 0, 0, 0 ...], 'attention_mask': [1, 1, 1, 1, 1 ...] } 说明: input_ids: the numbers representing the tokens in the text. token_type_ids: indicates which sequence a token belongs to if there is more than one sequence. attention_mask: indicates whether a token should be masked or not. The fastest way to tokenize your entire dataset is to use the map() function:: def tokenization(example): return tokenizer(example["text"]) dataset = dataset.map(tokenization, batched=True) Set the format of your dataset to be compatible with your machine learning framework:: # Pytorch dataset.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "labels"]) dataset.format['type'] # TensorFlow from transformers import DataCollatorWithPadding data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf") tf_dataset = dataset.to_tf_dataset( columns=["input_ids", "token_type_ids", "attention_mask"], label_cols=["labels"], batch_size=2, collate_fn=data_collator, shuffle=True ) Resample audio signals """""""""""""""""""""" Start by loading the MInDS-14 dataset:: from transformers import AutoFeatureExtractor from datasets import load_dataset, Audio feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base-960h") dataset = load_dataset("PolyAI/minds14", "en-US", split="train") When you call the audio column of the dataset, it is automatically decoded and resampled:: >>> dataset[0]["audio"] {'array': array([ 0. , 0.00024414, -0.00024414, ..., -0.00024414, 0. , 0. ], dtype=float32), 'path': '~/.cache/huggingface/datasets/downloads/extracted/xxxxx/en-US~JOINT_ACCOUNT/xxx.wav', 'sampling_rate': 8000} Use the cast_column() function and set the sampling_rate parameter in the Audio feature to upsample the audio signal. When you call the audio column now, it is decoded and resampled to 16kHz:: >>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000)) >>> dataset[0]["audio"] {'array': array([ 2.3443763e-05, 2.1729663e-04, 2.2145823e-04, ..., 3.8356509e-05, -7.3497440e-06, -2.1754686e-05], dtype=float32), 'path': '~/.cache/huggingface/datasets/downloads/extracted/xxx/en-US~JOINT_ACCOUNT/xxx.wav', 'sampling_rate': 16000} Use the map() function to resample the entire dataset to 16kHz:: def preprocess_function(examples): audio_arrays = [x["array"] for x in examples["audio"]] inputs = feature_extractor( audio_arrays, sampling_rate=feature_extractor.sampling_rate, max_length=16000, truncation=True ) return inputs dataset = dataset.map(preprocess_function, batched=True) Apply data augmentations """""""""""""""""""""""" 1. Start by loading the Beans dataset, the Image feature, and the feature extractor corresponding to a pretrained ViT model:: from transformers import AutoFeatureExtractor from datasets import load_dataset, Image feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224-in21k") dataset = load_dataset("beans", split="train") 2. When you call the image column of the dataset, the underlying PIL object is automatically decoded into an image:: dataset[0]["image"] 3. some transforms to the image:: from torchvision.transforms import RandomRotation rotate = RandomRotation(degrees=(0, 90)) def transforms(examples): examples["pixel_values"] = [rotate(image.convert("RGB")) for image in examples["image"]] return examples 4. Use the set_transform() function to apply the transform on-the-fly:: dataset.set_transform(transforms) dataset[0]["pixel_values"] Evaluate predictions ^^^^^^^^^^^^^^^^^^^^ You can see what metrics are available with list_metrics():: from datasets import list_metrics metrics_list = list_metrics() len(metrics_list) print(metrics_list) Load metric:: # Load a metric from the Hub with load_metric() from datasets import load_metric metric = load_metric('glue', 'mrpc') Metrics object:: >>> print(metric.inputs_description) Compute GLUE evaluation metric associated to each GLUE dataset. Args: predictions: list of predictions to score. Each translation should be tokenized into a list of tokens. references: list of lists of references for each translation. Each reference should be tokenized into a list of tokens. Returns: depending on the GLUE subset, one or several of: "accuracy": Accuracy "f1": F1 score "pearson": Pearson Correlation "spearmanr": Spearman Correlation "matthews_correlation": Matthew Correlation Examples: # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"] >>> glue_metric = datasets.load_metric('glue', 'sst2') >>> references = [0, 1] >>> predictions = [0, 1] >>> results = glue_metric.compute(predictions=predictions, references=references) >>> print(results) {'accuracy': 1.0} >>> glue_metric = datasets.load_metric('glue', 'mrpc') # 'mrpc' or 'qqp' >>> references = [0, 1] >>> predictions = [0, 1] >>> results = glue_metric.compute(predictions=predictions, references=references) >>> print(results) {'accuracy': 1.0, 'f1': 1.0} >>> glue_metric = datasets.load_metric('glue', 'stsb') >>> references = [0., 1., 2., 3., 4., 5.] >>> predictions = [0., 1., 2., 3., 4., 5.] >>> results = glue_metric.compute(predictions=predictions, references=references) >>> print({"pearson": round(results["pearson"], 2), "spearmanr": round(results["spearmanr"], 2)}) {'pearson': 1.0, 'spearmanr': 1.0} >>> glue_metric = datasets.load_metric('glue', 'cola') >>> references = [0, 1] >>> predictions = [0, 1] >>> results = glue_metric.compute(predictions=predictions, references=references) >>> print(results) {'matthews_correlation': 1.0} Compute metric:: model_predictions = model(model_inputs) final_score = metric.compute(predictions=model_predictions, references=gold_references) Create a dataset ^^^^^^^^^^^^^^^^ Folder-based builders """"""""""""""""""""" * Folder-based builders for quickly creating an image or audio dataset .. figure:: https://img.zhaoweiguo.com/uPic/2023/07/mxt9eY.jpg This is how the folder-based builder generates :: from datasets import ImageFolder dataset = load_dataset("imagefolder", data_dir="/path/to/pokemon") from datasets import AudioFolder dataset = load_dataset("audiofolder", data_dir="/path/to/folder") From local files """""""""""""""" :: >>> from datasets import Dataset >>> def gen(): ... yield {"pokemon": "bulbasaur", "type": "grass"} ... yield {"pokemon": "squirtle", "type": "water"} >>> ds = Dataset.from_generator(gen) >>> ds[0] {"pokemon": "bulbasaur", "type": "grass"} :: >>> from datasets import IterableDataset >>> ds = IterableDataset.from_generator(gen) >>> for example in ds: ... print(example) {"pokemon": "bulbasaur", "type": "grass"} {"pokemon": "squirtle", "type": "water"} :: >>> from datasets import Dataset >>> ds = Dataset.from_dict({"pokemon": ["bulbasaur", "squirtle"], "type": ["grass", "water"]}) >>> ds[0] {"pokemon": "bulbasaur", "type": "grass"} :: audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", ..., "path/to/audio_n"]}).cast_column("audio", Audio()) HOW-TO GUIDES ============= General usage ------------- Load ^^^^ Local loading script:: dataset = load_dataset("path/to/local/loading_script/loading_script.py", split="train") # equivalent because the file has the same name as the directory dataset = load_dataset("path/to/local/loading_script", split="train") Local and remote files:: from datasets import load_dataset dataset = load_dataset("csv", data_files="my_file.csv") 示例: json数据 {"a": 1, "b": 2.0, "c": "foo", "d": false} {"a": 4, "b": -5.5, "c": null, "d": true} 代码: from datasets import load_dataset dataset = load_dataset("json", data_files="my_file.json") 示例: json数据 {"version": "0.1.0", "data": [{"a": 1, "b": 2.0, "c": "foo", "d": false}, {"a": 4, "b": -5.5, "c": null, "d": true}] } 代码: from datasets import load_dataset dataset = load_dataset("json", data_files="my_file.json", field="data") 示例: 远程 JSON 文件 url = "https://rajpurkar.github.io/SQuAD-explorer/dataset/" dataset = load_dataset( "json", data_files={"train": url + "train-v1.json", "validation": url + "dev-v1.json"}, field="data" ) Process ^^^^^^^ glue数据 """""""" 示例:: from datasets import load_dataset dataset = load_dataset("glue", "mrpc", split="train") 使用:: # Sort排序 dataset["label"][:10] sorted_dataset = dataset.sort("label") sorted_dataset["label"][:10] sorted_dataset["label"][-10:] # Shuffle混乱 shuffled_dataset = sorted_dataset.shuffle(seed=42) shuffled_dataset["label"][:10] iterable_dataset = dataset.to_iterable_dataset(num_shards=128) shuffled_iterable_dataset = iterable_dataset.shuffle(seed=42, buffer_size=1000) # Select small_dataset = dataset.select([0, 10, 20, 30, 40, 50]) len(small_dataset) # Filter start_with_ar = dataset.filter(lambda example: example["sentence1"].startswith("Ar")) len(start_with_ar) start_with_ar["sentence1"] even_dataset = dataset.filter(lambda example, idx: idx % 2 == 0, with_indices=True) len(even_dataset) len(dataset) / 2 # Map def add_prefix(example): example["sentence1"] = 'My sentence: ' + example["sentence1"] return example updated_dataset = small_dataset.map(add_prefix) >>> updated_dataset = dataset.map( lambda example: {"new_sentence": example["sentence1"]}, remove_columns=["sentence1"] ) >>> updated_dataset.column_names ['sentence2', 'label', 'idx', 'new_sentence'] # Split >>> dataset.train_test_split(test_size=0.1) DatasetDict({ train: Dataset({ features: ['sentence1', 'sentence2', 'label', 'idx'], num_rows: 3301 }) test: Dataset({ features: ['sentence1', 'sentence2', 'label', 'idx'], num_rows: 367 }) }) # Cast >>> dataset.features {'sentence1': Value(dtype='string', id=None), 'sentence2': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None), 'idx': Value(dtype='int32', id=None)} >>> from datasets import ClassLabel, Value >>> new_features = dataset.features.copy() >>> new_features["label"] = ClassLabel(names=["negative", "positive"]) >>> new_features["idx"] = Value("int64") >>> dataset = dataset.cast(new_features) >>> dataset.features {'sentence1': Value(dtype='string', id=None), 'sentence2': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=2, names=['negative', 'positive'], names_file=None, id=None), 'idx': Value(dtype='int64', id=None)} >>> dataset.features {'audio': Audio(sampling_rate=44100, mono=True, id=None)} >>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16000)) >>> dataset.features {'audio': Audio(sampling_rate=16000, mono=True, id=None)} # Rename >>> dataset Dataset({ features: ['sentence1', 'sentence2', 'label', 'idx'], num_rows: 3668 }) >>> dataset = dataset.rename_column("sentence1", "sentenceA") >>> dataset = dataset.rename_column("sentence2", "sentenceB") >>> dataset Dataset({ features: ['sentenceA', 'sentenceB', 'label', 'idx'], num_rows: 3668 }) # Remove >>> dataset = dataset.remove_columns("label") >>> dataset Dataset({ features: ['sentence1', 'sentence2', 'idx'], num_rows: 3668 }) >>> dataset = dataset.remove_columns(["sentence1", "sentence2"]) >>> dataset Dataset({ features: ['idx'], num_rows: 3668 }) imdb数据 """""""" 示例:: >>> from datasets import load_dataset >>> datasets = load_dataset("imdb", split="train") >>> print(dataset) Dataset({ features: ['text', 'label'], num_rows: 25000 }) 使用:: # Shard >>> dataset.shard(num_shards=4, index=0) Dataset({ features: ['text', 'label'], num_rows: 6250 }) SQuAD数据集 """"""""""" 示例:: >>> from datasets import load_dataset >>> dataset = load_dataset("squad", split="train") >>> dataset.features {'answers': Sequence(feature={ 'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None), 'context': Value(dtype='string', id=None), 'id': Value(dtype='string', id=None), 'question': Value(dtype='string', id=None), 'title': Value(dtype='string', id=None)} 使用:: >>> flat_dataset = dataset.flatten() >>> flat_dataset Dataset({ features: ['id', 'title', 'context', 'question', 'answers.text', 'answers.answer_start'], num_rows: 87599 }) Stream ^^^^^^ .. figure:: https://img.zhaoweiguo.com/uPic/2023/07/7xFX0b.jpg 数据集流式处理:可以解决使用非常大的数据集,但你没有这么大的磁盘、不想等待、只是想快速浏览几个样本。 示例:: >> from datasets import load_dataset >> dataset = load_dataset('oscar-corpus/OSCAR-2201', 'en', split='train', streaming=True) >> print(next(iter(dataset))) {'id': 0, 'text': 'Founded in 2015, ...', ... 流式传输数百个压缩 JSONL 文件(如 oscar-corpus/OSCAR-2201)的本地数据集:: >> from datasets import load_dataset >> data_files = {'train': 'path/to/OSCAR-2201/compressed/en_meta/*.jsonl.gz'} >> dataset = load_dataset('json', data_files=data_files, split='train', streaming=True) >> print(next(iter(dataset))) {'id': 0, 'text': 'Founded in 2015, ...', ... 使用:: # Shuffle >>> from datasets import load_dataset >>> dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True) >>> shuffled_dataset = dataset.shuffle(seed=42, buffer_size=10_000) # Split dataset dataset_head = dataset.take(2) # 获取前2个示例 train_dataset = dataset.skip(1000) # 两者配合 # Interleave >>> from datasets import interleave_datasets >>> en_dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True) >>> fr_dataset = load_dataset('oscar', "unshuffled_deduplicated_fr", split='train', streaming=True) >>> multilingual_dataset = interleave_datasets([en_dataset, fr_dataset]) 完整示例:: import torch from torch.utils.data import DataLoader from transformers import AutoModelForMaskedLM, DataCollatorForLanguageModeling from tqdm import tqdm dataset = dataset.with_format("torch") dataloader = DataLoader(dataset, collate_fn=DataCollatorForLanguageModeling(tokenizer)) device = 'cuda' if torch.cuda.is_available() else 'cpu' model = AutoModelForMaskedLM.from_pretrained("distilbert-base-uncased") model.train().to(device) optimizer = torch.optim.AdamW(params=model.parameters(), lr=1e-5) for epoch in range(3): dataset.set_epoch(epoch) for i, batch in enumerate(tqdm(dataloader, total=5)): if i == 5: break batch = {k: v.to(device) for k, v in batch.items()} outputs = model(**batch) loss = outputs[0] loss.backward() optimizer.step() optimizer.zero_grad() if i % 10 == 0: print(f"loss: {loss}") Audio ----- Load audio data ^^^^^^^^^^^^^^^ Local files:: audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", ...]}) .cast_column("audio", Audio()) audio_dataset[0]["audio"] AudioFolder:: from datasets import load_dataset dataset = load_dataset("audiofolder", data_dir="/path/to/folder") # OR by specifying the list of files dataset = load_dataset("audiofolder", data_files=["path/to/audio_1", ...]) # load remote datasets from their URLs dataset = load_dataset("audiofolder", data_files=["https://foo.bar/audio_1", ...] # for example, pass SpeechCommands archive: dataset = load_dataset("audiofolder", data_files="https://s3.amazonaws.com/test.tar.gz") Process audio data ^^^^^^^^^^^^^^^^^^ Cast:: from datasets import load_dataset, Audio dataset = load_dataset("PolyAI/minds14", "en-US", split="train") dataset = dataset.cast_column("audio", Audio(sampling_rate=16000)) .. figure:: https://img.zhaoweiguo.com/uPic/2023/07/72sLGK.jpg Map:: from transformers import AutoTokenizer, AutoFeatureExtractor, AutoProcessor model_checkpoint = "facebook/wav2vec2-large-xlsr-53" tokenizer = AutoTokenizer("./v.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|") feature_extractor = AutoFeatureExtractor.from_pretrained(model_checkpoint) processor = AutoProcessor.from_pretrained(feature_extractor=feature_extractor, tokenizer=tokenizer) def prepare_dataset(batch): audio = batch["audio"] batch["input_values"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0] batch["input_length"] = len(batch["input_values"]) with processor.as_target_processor(): batch["labels"] = processor(batch["sentence"]).input_ids return batch dataset = dataset.map(prepare_dataset, remove_columns=dataset.column_names) Create an audio dataset ^^^^^^^^^^^^^^^^^^^^^^^ Local files """"""""""" :: >>> audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", ...]}).cast_column("audio", Audio()) # 查看 >>> audio_dataset[0]["audio"] # 上传到huggingface hub >>> audio_dataset.push_to_hub("/my_dataset") 目录:: my_dataset/ ├── README.md └── data/ └── train-00000-of-00001.parquet AudioFolder """"""""""" 示例:: my_dataset/ ├── README.md ├── metadata.csv ├── data/ ├───── first_audio_file.mp3 ├───── second_audio_file.mp3 └───── third_audio_file.mp3 metadata.csv:: file_name,transcription data/first_audio_file.mp3,znowu się duch z ciałem data/second_audio_file.mp3,już u źwierzyńca podwojów data/third_audio_file.mp3,pewnie kędyś w obłędzie 使用:: >>> from datasets import load_dataset >>> dataset = load_dataset("audiofolder", data_dir="/path/to/data") >>> dataset["train"][0] {'audio': {'path': '/path/to/extracted/audio/first_audio_file.mp3', 'array': array([ 0.00088501, 0.0012207 , 0.00131226, ..., -0.00045776], dtype=float32), 'sampling_rate': 16000}, 'transcription': 'znowu się duch z ciałem' } 示例:: data/train/electronic/01.mp3 data/train/punk/01.mp3 data/test/electronic/09.mp3 data/test/punk/09.mp3 使用:: >>> from datasets import load_dataset >>> dataset = load_dataset("audiofolder", data_dir="/path/to/data") >>> dataset["train"][0] {'audio': {'path': '/path/to/electronic/01.mp3', 'array': array([ 3.9714024e-07, 7.3031038e-07, ..., -1.1244172e-01], dtype=float32), 'sampling_rate': 44100}, 'label': 0 # "electronic" } >>> dataset["train"][-1] {'audio': {'path': '/path/to/punk/01.mp3', 'array': array([0.15237972, 0.13222949, ..., 0.33717662], dtype=float32), 'sampling_rate': 44100}, 'label': 1 # "punk" } Loading script """""""""""""" 示例:: my_dataset/ ├── README.md ├── my_dataset.py └── data/ ├── train.tar.gz ├── test.tar.gz └── metadata.csv 使用:: from datasets import load_dataset dataset = load_dataset("path/to/my_dataset") Vision ------ Load image data ^^^^^^^^^^^^^^^ load:: from datasets import load_dataset, Image dataset = load_dataset("beans", split="train") dataset[0]["image"] Local files:: >>> from datasets import load_dataset, Image >>> dataset = Dataset.from_dict({"image": ["path/to/image_1", ...]}).cast_column("image", Image()) >>> dataset[0]["image"] ] >>> dataset = load_dataset("beans", split="train").cast_column("image", Image(decode=False)) >>> dataset[0]["image"] ImageFolder """"""""""" 示例:: folder/train/dog/golden_retriever.png folder/train/dog/german_shepherd.png folder/train/dog/chihuahua.png folder/train/cat/maine_coon.png folder/train/cat/bengal.png folder/train/cat/birman.png 使用:: >>> from datasets import load_dataset >>> dataset = load_dataset("imagefolder", data_dir="/path/to/folder") >>> dataset["train"][0] {"image": , "label": 0} >>> dataset["train"][-1] {"image": , "label": 1} Load remote datasets from their URLs:: dataset = load_dataset("imagefolder", data_files="https://a.cn/name.zip", split="train") Process image data ^^^^^^^^^^^^^^^^^^ Map """ :: >>> def transforms(examples): ... examples["pixel_values"] = [image.convert("RGB").resize((100,100)) for image in examples["image"]] ... return examples >>> dataset = dataset.map(transforms, remove_columns=["image"], batched=True) >>> dataset[0] {'label': 6, 'pixel_values': } Apply transforms """""""""""""""" * data augmentation libraries * torchvision: https://pytorch.org/vision/stable/index.html * Albumentations: https://imgaug.readthedocs.io/en/latest/ * Kornia: https://albumentations.ai/docs/ * imgaug: https://kornia.readthedocs.io/en/latest/ :: from torchvision.transforms import Compose, ColorJitter, ToTensor jitter = Compose( [ ColorJitter(brightness=0.25, contrast=0.25, saturation=0.25, hue=0.7), ToTensor(), ] ) def transforms(examples): examples["pixel_values"] = [jitter(image.convert("RGB")) for image in examples["image"]] return examples dataset.set_transform(transforms) Create an image dataset ^^^^^^^^^^^^^^^^^^^^^^^ 示例:: folder/train/metadata.csv folder/train/0001.png folder/train/0002.png folder/train/0003.png 压缩版示例:: folder/metadata.csv folder/train.zip folder/test.zip folder/valid.zip ``metadata.csv`` 文件内容:: file_name,additional_feature 0001.png,This is a first value of a text feature you added to your images 0002.png,This is a second value of a text feature you added to your images 0003.png,This is a third value of a text feature you added to your images or using ``metadata.jsonl``:: {"file_name": "0001.png", "additional_feature": "This is a first value of a text feature you added to your images"} {"file_name": "0002.png", "additional_feature": "This is a second value of a text feature you added to your images"} {"file_name": "0003.png", "additional_feature": "This is a third value of a text feature you added to your images"} 示例-metadata.csv:: file_name,text 0001.png,This is a golden retriever playing with a ball 0002.png,A german shepherd 0003.png,One chihuahua 使用:: >>> dataset = load_dataset("imagefolder", data_dir="/path/to/folder", split="train") >>> dataset[0]["text"] "This is a golden retriever playing with a ball" Upload dataset to the Hub:: from datasets import load_dataset dataset = load_dataset("imagefolder", data_dir="/path/to/folder", split="train") dataset.push_to_hub("stevhliu/my-image-captioning-dataset") Depth estimation ^^^^^^^^^^^^^^^^ :: train_dataset = load_dataset("sayakpaul/nyu_depth_v2", split="train") Image classification ^^^^^^^^^^^^^^^^^^^^ :: dataset = load_dataset("beans") Semantic segmentation ^^^^^^^^^^^^^^^^^^^^^ 安装:: pip install -U albumentations 使用示例:: train_dataset = load_dataset("sayakpaul/nyu_depth_v2", split="train") Object detection ^^^^^^^^^^^^^^^^ 安装:: pip install -U albumentations opencv-python 使用示例:: ds = load_dataset("cppe-5") Text ---- Load text data ^^^^^^^^^^^^^^ :: from datasets import load_dataset dataset = load_dataset("text", data_files={ "train": ["my_text_1.txt", "my_text_2.txt"], "test": "my_test_file.txt" } ) dataset = load_dataset("text", data_dir="path/to/text/dataset") dataset = load_dataset("text", data_files={"train": "my_train_file.txt", "test": "my_test_file.txt"}, sample_by="paragraph" ) dataset = load_dataset("text", data_files={"train": "my_train_file.txt", "test": "my_test_file.txt"}, sample_by="document" ) 示例:: from datasets import load_dataset c4_subset = load_dataset("allenai/c4", data_files="en/c4-train.0000*-of-01024.json.gz") dataset = load_dataset("text", data_files="https://huggingface.co/datasets/lhoestq/test/resolve/main/some_text.txt") Process text data ^^^^^^^^^^^^^^^^^ Map:: from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") dataset = dataset.map(lambda examples: tokenizer(examples["text"]), batched=True) dataset[0] {'text': 'the rock is destined to be the 21st century...', 'label': 1, 'input_ids': [101, 1996, 2600, 2003, 16036, 2000, ...], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]} Tabular ------- CSV files ^^^^^^^^^ :: from datasets import load_dataset dataset = load_dataset("csv", data_files="my_file.csv") # load multiple CSV files dataset = load_dataset("csv", data_files=["my_file_1.csv", "my_file_2.csv", "my_file_3.csv"]) # 训练集、测试集拆分 dataset = load_dataset("csv", data_files={ "train": ["my_train_file_1.csv", "my_train_file_2.csv"], "test": "my_test_file.csv"} ) Pandas DataFrames ^^^^^^^^^^^^^^^^^ :: from datasets import Dataset import pandas as pd df = pd.read_csv("https://huggingface.co/datasets/imodels/credit-card/raw/main/train.csv") df = pd.DataFrame(df) dataset = Dataset.from_pandas(df) Databases ^^^^^^^^^ SQLite """""" 创建数据库:: import sqlite3 import pandas as pd conn = sqlite3.connect("us_covid_data.db") df = pd.read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv") df.to_sql("states", conn, if_exists="replace") load:: from datasets import Dataset uri = "sqlite:///us_covid_data.db" ds = Dataset.from_sql("states", uri) # ds = Dataset.from_sql('SELECT * FROM states WHERE state="California";', uri) ds 使用:: ds.filter(lambda x: x["state"] == "California") # ds.filter(lambda x: x["cases"] > 10000) PostgreSQL """""""""" * 参考: https://huggingface.co/docs/datasets/v2.13.1/en/package_reference/main_classes#datasets.Dataset.from_sql Dataset repository ------------------ Share ^^^^^ create :: $ huggingface-cli repo create your_dataset_name --type dataset clone:: # Make sure you have git-lfs installed # (https://git-lfs.github.com/) git lfs install git clone https://huggingface.co/datasets/namespace/your_dataset_name * 剩下的和普通的git差不多操作... Create a dataset loading script ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ A dataset loading script should have the same name as a dataset repository or directory:: my_dataset/ ├── README.md └── my_dataset.py