concept test set in category R

This is an excerpt from Manning's book Machine Learning with R, the tidyverse, and mlr.
Figure 3.12. Holdout CV. The data is randomly split into a training set and test set. The training set is used to train the model, which is then used to make predictions on the test set. The similarity of the predictions to the true values of the test set is used to evaluate model performance.
![]()
When following this approach, you need to decide what proportion of the data to use as the test set. The larger the test set is, the smaller your training set will be. Here’s the confusing part: performance estimation by CV is also subject to error and the bias-variance trade-off. If your test set is too small, then the estimate of performance is going to have high variance; but if the training set is too small, then the estimate of performance is going to have high bias. A commonly used split is to use two-thirds of the data for training and the remaining one-third as a test set, but this depends on the number of cases in the data, among other things.
Because the test set is only a single observation, leave-one-out CV tends to give quite variable estimates of model performance (because the performance estimate of each iteration depends on correctly labeling that single test case). But it can give less-variable estimates of model performance than k-fold when your dataset is small. When you have a small dataset, splitting it up into k folds will leave you with a very small training set. The variance of a model trained on a small dataset tends to be higher because it will be more influenced by sampling error/unusual cases. Therefore, leave-one-out CV is useful for small datasets where splitting it into k folds would give variable results. It is also computationally less expensive than repeated, k-fold CV.
Cross-validation is a set of techniques for evaluating model performance by splitting the data into training and test sets. Three common types of cross-validation are holdout, where a single split is used; k-fold, where the data is split into k chunks and the validation performed on each chunk; and leave-one-out, where the test set is a single case.

This is an excerpt from Manning's book Deep Learning with R.
train_images and train_labels form the training set: the data from which the model will learn. The model will then be tested on the test set: test_images and test_labels. The images are encoded as 3D arrays, and the labels are a 1D array of digits, ranging from 0 to 9. The images and labels have a one-to-one correspondence.
In the three examples presented in chapter 3, we split the data into a training set, a validation set, and a test set. The reason not to evaluate the models on the same data they were trained on quickly became evident: after just a few epochs, all three models began to overfit. That is, their performance on never-before-seen data started stalling (or worsening) compared to their performance on the training data—which always improves as training progresses.
You may ask, why not have two sets: a training set and a test set? You’d train on the training data and evaluate on the test data. Much simpler!