Chapter 6. Understanding metadata

 

Conquering your fears of extracting text from files in a few lines of Java code has hopefully put Tika on your personal must-have list. The ease and simplicity with which Tika can turn an afternoon’s parsing work into a smorgasbord of content handler plugins and event-based text processing is likely fresh on your mind. If not, head back to chapter 5 and relive the memories.

Looking ahead, sometimes before you’ve even obtained the textual content within the files you’re interested in, you may be able to weed out which files you’re not interested in, based on a few simple criteria, and save yourself a bunch of time (and processing power).

Take, for example, the use case presented in figure 6.1.

Figure 6.1. The search engine process and metadata. Metadata about a page, including its title, a short description, and its link are used to determine whether to “click” the link and obtain the content.

6.1. The standards of metadata

6.2. Metadata quality

6.3. Metadata in Tika

6.4. Practical uses of metadata

6.5. Summary

sitemap