Chapter 13. Case study 2: SIREn

 

Searching semistructured documents with SIREn

Contributed by RENAUD DELBRU, NICKOLAI TOUPIKOV, MICHELLE CATASTA, ROBERT FULLER, and GIOVANNI TUMMARELLO

In this case study, the crew from the Digital Enterprise Research Institute (DERI; http://www.deri.ie) describes how they created the Semantic Information Retrieval Engine (SIREn) using Lucene. SIREn (which is open source and available at http://siren.sindice.com) searches the semantic web, also known as Web 3.0 or the “Web of Data,” which is a quickly growing collection of semistructured documents available from web pages adopting the Resource Description Framework (RDF)[1] standard. With RDF, pages publicly available on the web encode structural relationships between arbitrary entities and objects via predicates. Although the standard has been defined for some time, it’s only recently that websites have begun adopting it in earnest.

A publicly accessible demonstration of SIREn is running at http://sindice.com, covering more than 50 million crawled structured documents, resulting in over 1 billion entity, predicate, and object triples. SIREn is a powerful alternative to the more common RDF triplestores, typically backed by relational databases and thus often limited when it comes to full-text search.

13.1. Introducing SIREn

13.2. SIREn’s benefits

13.3. Indexing entities with SIREn

13.4. Searching entities with SIREn

13.5. Integrating SIREn in Solr

13.6. Benchmark

13.7. Summary