Efficient text search with spatial constraints

dc.contributor.advisorNickerson, Bradford
dc.contributor.authorHan, Dan
dc.description.abstractThis thesis presents a search engine called TexSpaSearch that can search text documents having associated positions in space. We defined three query types Q 1 ( t) , Q2( t, r) and Q3(p, r) that can search documents with queries containing text t, position p and radius r components. We indexed a sample herbarium database of 40,791 records using a novel R*-tree and suffix tree data structure to achieve efficient text search with spatial data constraints. Significant preprocessing was performed to transform the herbarium database to a form usable by TexSpaSearch. We built unique data structures used to index text with spatial attributes that simultaneously support Ql, Q2 and Q3 queries. This thesis presents a novel approach for simplifying polygon boundaries for packing into R*-tree leaf nodes. Search results are ranked by a novel modified Lucene algorithm that supports free form text indexing in a general way. A novel ranking of Q2 search results combines the text and spatial scores. The architecture of a prototype Java based web application that invokes TexSpaSearch is described. A theoretical analysis shows that TexSpaSearch requires O(A 2lbl) average time for Ql search, where A is the number of single words in the query string t, and llbl is the average length of a subphrase in t. Q2 and Q3 require O ( A 2 Tbf + Z logM Vn + y) and O(logM Vn + y), respectively, where Z is the number of point records in the list P of text search results, Vn is the number of data objects indexed in the R*=tree for n records, M is the maximum number of entries of an internal node in the R*-tree, and y is the number of leaf nodes found in range in a Q3 query. Testing was performed with 20 test Ql queries to compare TexSpaSearch to a Google Search Appliance (GSA) for text search. Results indicate that the GSA is about 45.5 times faster than TexSpaSearch. The same 20 test queries were combined with a Q2 query radius r = 5, 50 and 500m. Results indicate Q2 queries are 22.8 times slower than Ql queries. For testing Q3 queries, 15 points were chosen in 15 different N .B. counties. The average Tc, T 8 and Te values of 191.5ms, 3603.2ms and 4183.9ms are given in the Q3 test, respectively, and the average value of Npt + Npl is 1313.4.
dc.description.copyright© Dan Han, 2014
dc.format.extentxvi, 130 pages
dc.publisherUniversity of New Brunswick
dc.subject.disciplineComputer Science
dc.titleEfficient text search with spatial constraints
dc.typemaster thesis
thesis.degree.disciplineComputer Science
thesis.degree.fullnameMaster of Computer Science
thesis.degree.grantorUniversity of New Brunswick
Original bundle
Now showing 1 - 1 of 1
Thumbnail Image
24.39 MB
Adobe Portable Document Format