chapter ten

10 Best Practices in Developing NLP Applications

This chapter covers

Making neural network inference more efficient by sorting, padding, and masking tokens
Applying character-based and BPE tokenization for splitting text into tokens
Avoiding overfitting via regularization
Dealing with imbalanced datasets with upsampling, downsampling, and loss weighting
Optimizing hyperparameters

We’ve covered a lot of ground so far, including deep neural network models such as RNNs, CNNs, and the Transformer, and modern NLP frameworks such as AllenNLP and HuggingFace Transformers. However, we’ve paid little attention to the details of training and inference. For example, how do you train and make predictions efficiently? How do you avoid your model from overfitting? How do you optimize hyperparameters? These factors could make a huge impact on the final performance and generalizability of your model. This chapter covers these important topics that you need to consider in order to build robust and accurate NLP applications that perform well in the real world.

10.1 Batching instances

In chapter 2, we briefly mentioned batching, a machine learning technique where instances are grouped together to form batches and sent to the processor (CPU or more often, GPU). Batching is almost always necessary when training large neural networks—it is critical for efficient and stable training. In this section, we’ll dive into some more techniques and considerations related to batching.

10.1.1 Padding

10.1.2 Sorting

10.1.3 Masking

10.2 Tokenization for neural models

10.2.1 Unknown words

10.2.2 Character models

10.2.3 Subword models

10.3 Avoiding overfitting

10.3.1 Regularization

10.3.2 Early stopping

10.3.3 Cross validation

10.4 Dealing with imbalanced dataset

10.4.1 Use appropriate evaluation metrics

10.4.2 Upsampling and downsampling

10.4.3 Weighting losses

10.5 Hyperparameter tuning

10.5.1 Examples of hyperparameters

10.5.2 Grid search vs random search

10.5.3 Hyperparameter tuning with Optuna

10.6 Summary