Efficient text search with spatial constraints

dc.contributor.advisorNickerson, Bradford
dc.contributor.authorHan, Dan
dc.date.accessioned2023-03-01T16:21:54Z
dc.date.available2023-03-01T16:21:54Z
dc.date.issued2014
dc.date.updated2023-03-01T15:01:51Z
dc.description.abstractThis thesis presents a search engine called TexSpaSearch that can search text documents having associated positions in space. We defined three query types Q 1 ( t) , Q2( t, r) and Q3(p, r) that can search documents with queries containing text t, position p and radius r components. We indexed a sample herbarium database of 40,791 records using a novel R*-tree and suffix tree data structure to achieve efficient text search with spatial data constraints. Significant preprocessing was performed to transform the herbarium database to a form usable by TexSpaSearch. We built unique data structures used to index text with spatial attributes that simultaneously support Ql, Q2 and Q3 queries. This thesis presents a novel approach for simplifying polygon boundaries for packing into R*-tree leaf nodes. Search results are ranked by a novel modified Lucene algorithm that supports free form text indexing in a general way. A novel ranking of Q2 search results combines the text and spatial scores. The architecture of a prototype Java based web application that invokes TexSpaSearch is described. A theoretical analysis shows that TexSpaSearch requires O(A 2lbl) average time for Ql search, where A is the number of single words in the query string t, and llbl is the average length of a subphrase in t. Q2 and Q3 require O ( A 2 Tbf + Z logM Vn + y) and O(logM Vn + y), respectively, where Z is the number of point records in the list P of text search results, Vn is the number of data objects indexed in the R*=tree for n records, M is the maximum number of entries of an internal node in the R*-tree, and y is the number of leaf nodes found in range in a Q3 query. Testing was performed with 20 test Ql queries to compare TexSpaSearch to a Google Search Appliance (GSA) for text search. Results indicate that the GSA is about 45.5 times faster than TexSpaSearch. The same 20 test queries were combined with a Q2 query radius r = 5, 50 and 500m. Results indicate Q2 queries are 22.8 times slower than Ql queries. For testing Q3 queries, 15 points were chosen in 15 different N .B. counties. The average Tc, T 8 and Te values of 191.5ms, 3603.2ms and 4183.9ms are given in the Q3 test, respectively, and the average value of Npt + Npl is 1313.4.
dc.description.copyright© Dan Han, 2014
dc.formattext/xml
dc.format.extentxvi, 130 pages
dc.format.mediumelectronic
dc.identifier.urihttps://unbscholar.lib.unb.ca/handle/1882/13649
dc.language.isoen_CA
dc.publisherUniversity of New Brunswick
dc.rightshttp://purl.org/coar/access_right/c_abf2
dc.subject.disciplineComputer Science
dc.titleEfficient text search with spatial constraints
dc.typemaster thesis
thesis.degree.disciplineComputer Science
thesis.degree.fullnameMaster of Computer Science
thesis.degree.grantorUniversity of New Brunswick
thesis.degree.levelmasters
thesis.degree.nameM.C.S.

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
item.pdf
Size:
24.39 MB
Format:
Adobe Portable Document Format