
Foreword
I’m a big fan of search engines and Java, so early in the year 2004 I was looking for a good Java-based open source project on search engines. I quickly discovered Nutch. Nutch is an open source search engine project from the Apache Software Foundation. It was initiated by Doug Cutting, the well-known father of Lucene.
With my new toy on my laptop, I tested and tried to evaluate it. Even if Nutch was in its early stages, it was a promising project—exactly what I was looking for. I proposed my first patches to Nutch relating to language identification in early 2005. Then, in the middle of 2005 I become a Nutch committer and increased my number of contributions relating to language identification, content-type guessing, and document analysis. Looking more deeply at Lucene, I discovered a wide set of projects around it: Nutch, Solr, and what would eventually become Mahout. Lucene provides its own analysis tools, as do Nutch and Solr, and each one employs some “proprietary” interfaces to deal with analysis engines.
So I consulted with Chris Mattmann, another Nutch committer with whom I had worked, about the potential for refactoring all these disparate tools in a common and standardized project. The concept of Tika was born.
Chris began to advocate for Tika as a standalone project in 2006. Then Jukka Zitting came into the picture and took the lead on the Tika project; after a lot of refactoring and enhancements, Tika became a Lucene top-level project.