8 NED with open LLMs and domain ontologies
This chapter covers
- Understanding limitations of traditional Named Entity Disambiguation (NED) tools
- Combining general-purpose LLMs and domain ontologies for the NED task
- Introducing a novel, multi-step approach for disambiguation, including shortest-path detection, path-to-text translation, and textual path summarization
Chapter 7 focused on Named Entity Disambiguation (NED), highlighting the role of ScispaCy, a specialized natural language processing (NLP) tool built on the spaCy framework. This tool is designed for processing documents and publications by providing pre-trained models in the biomedical domain.
ScispaCy incorporates specific vocabularies and ontologies, such as the Unified Medical Language System (UMLS), which provides canonical entities useful for disambiguating mentions in the text.
However, this approach presents some limitations:
- It is specifically designed for a particular application domain, namely the biomedical field.
- It presents challenges in expanding and updating the reference knowledge base to incorporate new entities and terminologies (e.g., additional aliases for existing entities).
- It fails to fully leverage the extensive information available within the knowledge base.
More specifically, ScispaCy does not leverage the existing relationships and paths between entities for the disambiguation task. To understand the impact of this last point, let’s recap the example we discussed at the beginning of chapter 7: