Efficient text search with spatial constraints
University of New Brunswick
This thesis presents a search engine called TexSpaSearch that can search text documents having associated positions in space. We defined three query types Q 1 ( t) , Q2( t, r) and Q3(p, r) that can search documents with queries containing text t, position p and radius r components. We indexed a sample herbarium database of 40,791 records using a novel R*-tree and suffix tree data structure to achieve efficient text search with spatial data constraints. Significant preprocessing was performed to transform the herbarium database to a form usable by TexSpaSearch. We built unique data structures used to index text with spatial attributes that simultaneously support Ql, Q2 and Q3 queries. This thesis presents a novel approach for simplifying polygon boundaries for packing into R*-tree leaf nodes. Search results are ranked by a novel modified Lucene algorithm that supports free form text indexing in a general way. A novel ranking of Q2 search results combines the text and spatial scores. The architecture of a prototype Java based web application that invokes TexSpaSearch is described. A theoretical analysis shows that TexSpaSearch requires O(A 2lbl) average time for Ql search, where A is the number of single words in the query string t, and llbl is the average length of a subphrase in t. Q2 and Q3 require O ( A 2 Tbf + Z logM Vn + y) and O(logM Vn + y), respectively, where Z is the number of point records in the list P of text search results, Vn is the number of data objects indexed in the R*=tree for n records, M is the maximum number of entries of an internal node in the R*-tree, and y is the number of leaf nodes found in range in a Q3 query. Testing was performed with 20 test Ql queries to compare TexSpaSearch to a Google Search Appliance (GSA) for text search. Results indicate that the GSA is about 45.5 times faster than TexSpaSearch. The same 20 test queries were combined with a Q2 query radius r = 5, 50 and 500m. Results indicate Q2 queries are 22.8 times slower than Ql queries. For testing Q3 queries, 15 points were chosen in 15 different N .B. counties. The average Tc, T 8 and Te values of 191.5ms, 3603.2ms and 4183.9ms are given in the Q3 test, respectively, and the average value of Npt + Npl is 1313.4.