Indexing infrastructure for semantics full-text search
University of New Brunswick
The increasing effectiveness and wide spread use of automated entity linking platforms has enabled search techniques to adopt semantic-enabled methods such as sense disambiguation, intent determination, and instance identification within the search process. Researchers have already delved into the possibility of integrating semantic information into practical search engines, a paradigm known as semantic full-text search. However, the practical and efficient incorporation of semantic information within search indices is still an open challenge. In this thesis, we proposed two indexing approaches for building efficient and effective semantic full-text indices. In the first approach, we remain faithful to the traditional form of building search indices where the index key of the index is guaranteed to be present in each of the indexed documents. As such, we will assume that the documents related to each of keyword, semantic entity, semantic type, do in fact explicitly contain this information. For this reason, the first proposed indexing mechanism is referred to Explicit Semantic Full-text Index. We propose various representation data structures and their effective integration strategies for building the explicit semantic full-text index. Furthermore, we introduce algorithms for performing query processing tasks such as Boolean and rank union and intersection on the proposed indices. In the second approach, we relax the traditional condition of search indices and allow documents associated with an index key to be semantically similar to the index key as opposed to explicitly including the key. We refer to this indexing strategy as the Implicit Semantic Full-text Index. We propose a mechanism to embedd keyword, semantic entity, semantic type information within a homogeneous representation space and hence be indexed in the same indexing data structure. Based on our experiments, we find that when neural embeddings are used to build inverted indices; hence, relaxing the requirement to explicitly observe the posting list key in the indexed document, (a) retrieval efficiency will increase compared to a standard inverted index, hence reducing the index size and query processing time, and at the same time (b) retrieval effectiveness retains competitive performance compared to the baseline in terms of retrieving a reasonable number of relevant documents from the indexed corpus.