Chapter 8. What’s in a file?

 

This chapter covers

By now, your Tika-fu is strong, and you’re feeling like there’s not much that you can’t do with your favorite tool for file detection, metadata extraction, and language identification. Believe it or not, there’s plenty more to learn!

One thing we’ve purposefully stayed away from is telling you what’s in those files that Tika makes sense of.[1] That’s because files are a source of rich information, recording not only text or metadata, but also things like detailed descriptions of scenery, such as a bright image of a soccer ball on a grass field; waveforms representing music recorded in stereo sound; all the way to geolocated and time-referenced observations recorded by a Fourier Transform Spectrometer (FTS) instrument on a spacecraft. The short of the matter is that their intricacies and complexities deserve treatment in their own right.

1 We covered some parts of the file contents, for example, we discussed BOM markers in chapter 4 while talking about file detection. In chapter 5, we discussed methods for dealing with file reading via InputStreams. In both cases, we stayed away from the actual contents of files in particular, since it would receive full treatment in this chapter.

8.1. Types of content

8.2. How Tika extracts content

8.3. Summary