Most machine learning tasks are more complicated than labeling an entire image or document. Imagine that you need to generate subtitles for movies in a creative way. Creating transcriptions of spoken and signed language is a language generation task. If you want to emphasize angry language with bold text, that task is an additional sequence labeling task. If you want to display the transcriptions like the speech bubbles of text in comics, you could use object detection to make sure that the speech bubble comes from the right person, and you could also use semantic segmentation to ensure that the speech bubble is placed over background elements in the scene. You might also want to predict what a given person might rate the film as part of a recommendation system or feed the content into a search engine that can find matches for abstract phrases such as motivational speeches.