chapter five

5 Exploring, tokenizing, and visualizing Hugging Face datasets

This chapter covers

What Hugging Face datasets are
How to download datasets programmatically
How to apply tokenization to datasets
How to perform data visualization on datasets

Hugging Face is an AI platform that develops, trains, and deploys cutting-edge open-source machine learning models. Alongside providing a hub for these trained models, Hugging Face also hosts a wide array of datasets (available at https://huggingface.co/datasets), which you can use for your own projects.

This chapter guides you through accessing datasets from Hugging Face and shows you how to download them programmatically to your local machine. You will gain a deeper understanding of tokenization, including how to tokenize datasets and prepare your data for fine-tuning (covered in chapter 6). Finally, you will explore how to visualize various datasets with Hugging Face.

5.1 What are Hugging Face datasets?

5.1.1 Getting the list of datasets available

5.1.2 Validating the availability of a dataset

5.1.3 Downloading a dataset

5.1.4 Shuffling a dataset

5.1.5 Streaming a dataset

5.1.6 Getting the Parquet files of a dataset

5.2 Tokenization in NLP

5.2.1 Types of tokenization methods

5.2.2 Tokenizing datasets

5.3 Visualizing datasets

5.3.1 Using the twitter-financial-news-topic dataset

5.3.2 Using the CIFAR-10 dataset

Summary