Chapter 12. Case study 1: Krugle

 

Krugle: Searching source code

Contributed by KEN KRUGLER and GRANT GLOUSER

Krugle.org provides an amazing service: it’s a source-code search engine that continuously catalogs 4,000+ open source projects (including Lucene and its sister projects under the Apache Lucene umbrella), enabling you to search the source code itself as well as developers’ comments in the source code control system. A search for lucene turns up matches not only from Lucene’s source code, but from the many open source projects that use Lucene.

Krugle is built with Lucene, but there are some fun challenges that emerge when your documents are primarily source code. For example, a search for deletion policy must match tokens like DeletionPolicy in the source code. Punctuation like = and (, which in any other domain would be quickly discarded during analysis, must instead be carefully preserved so that a search like for(int x=0 produces the expected results. Unlike a natural language where the frequent terms are classified as stop words that are then discarded, Krugle must keep all tokens from the source code.

12.1. Introducing Krugle

12.2. Appliance architecture

12.3. Search performance

12.4. Parsing source code

12.5. Substring searching

12.6. Query vs. search

12.7. Future improvements

12.8. Summary