Chapter 7. Language detection

 

Imagine you’re in charge of developing a searchable document database for a multilingual organization like the European Union, an international corporation, or a local restaurant that wants to publish its menus in more than one language. Typically no single user of such a database knows all the languages used in the stored documents, so the system should be able to categorize and retrieve documents by language in order to present users with information that they can understand. And, to make things challenging, most of the documents added to the database don’t come with reliable metadata about the language they’re written in.

To implement such a multilingual document database, you need a language detection tool like the one shown in figure 7.1. This tool would act like a pipeline that takes incoming documents with no language metadata and automatically annotates them with the languages they’re written in. The annotation should take the form of a Metadata.LANGUAGE entry in the document metadata. Our task in this chapter is to find out how this can be done.

Figure 7.1. The language detection pipeline. Incoming documents with no language metadata are analyzed to determine the language they’re written in. The resulting language information is associated with the documents as an extra piece of metadata.

7.1. The most translated document in the world

7.2. Sounds Greek to me—theory of language detection

7.3. Language detection in Tika

7.4. Summary