Chapter 13. Content management with Apache Jackrabbit
This chapter covers
- The Apache Jackrabbit Content Repository
- The use of Tika in Jackrabbit
- File detection and parsing for Jackrabbit WebDAV
Apache Jackrabbit, http://jackrabbit.apache.org, is a content repository that provides a rich storage layer on which to build content and document management systems like the ones we discussed earlier in chapter 9. Full-text search and WebDAV integration are two key features of a content repository. In this case study we’ll learn how Jackrabbit uses Tika to help implement these features.
We’ll start by briefly describing the key features of Apache Jackrabbit and the Content Repository for Java technology (JCR) API (http://www.jcp.org/en/jsr/detail?id=170) that it implements. Armed with this background, we’ll then look deeper into how Jackrabbit’s search feature uses a pool of Tika threads to achieve the illusion of being able to index arbitrarily large documents nearly in real time. We’ll also look at how Tika’s type detection feature is used to add smarts to Jackrabbit’s WebDAV integration layer. We’ll end this case study with a brief summary.
Apache Jackrabbit is an implementation of a new special kind of a database called a content repository. Defined in Java Specification Requests (JSRs) 170 and 283, a content repository is a hierarchically organized storage engine that combines features from advanced file systems and relational databases.