Efficient text search with spatial constraints

Han, Dan

Efficient text search with spatial constraints

dc.contributor.advisor	Nickerson, Bradford
dc.contributor.author	Han, Dan
dc.date.accessioned	2023-03-01T16:21:54Z
dc.date.available	2023-03-01T16:21:54Z
dc.date.issued	2014
dc.date.updated	2023-03-01T15:01:51Z
dc.description.abstract	This thesis presents a search engine called TexSpaSearch that can search text documents having associated positions in space. We defined three query types Q 1 ( t) , Q2( t, r) and Q3(p, r) that can search documents with queries containing text t, position p and radius r components. We indexed a sample herbarium database of 40,791 records using a novel R-tree and suffix tree data structure to achieve efficient text search with spatial data constraints. Significant preprocessing was performed to transform the herbarium database to a form usable by TexSpaSearch. We built unique data structures used to index text with spatial attributes that simultaneously support Ql, Q2 and Q3 queries. This thesis presents a novel approach for simplifying polygon boundaries for packing into R-tree leaf nodes. Search results are ranked by a novel modified Lucene algorithm that supports free form text indexing in a general way. A novel ranking of Q2 search results combines the text and spatial scores. The architecture of a prototype Java based web application that invokes TexSpaSearch is described. A theoretical analysis shows that TexSpaSearch requires O(A 2lbl) average time for Ql search, where A is the number of single words in the query string t, and llbl is the average length of a subphrase in t. Q2 and Q3 require O ( A 2 Tbf + Z logM Vn + y) and O(logM Vn + y), respectively, where Z is the number of point records in the list P of text search results, Vn is the number of data objects indexed in the R=tree for n records, M is the maximum number of entries of an internal node in the R-tree, and y is the number of leaf nodes found in range in a Q3 query. Testing was performed with 20 test Ql queries to compare TexSpaSearch to a Google Search Appliance (GSA) for text search. Results indicate that the GSA is about 45.5 times faster than TexSpaSearch. The same 20 test queries were combined with a Q2 query radius r = 5, 50 and 500m. Results indicate Q2 queries are 22.8 times slower than Ql queries. For testing Q3 queries, 15 points were chosen in 15 different N .B. counties. The average Tc, T 8 and Te values of 191.5ms, 3603.2ms and 4183.9ms are given in the Q3 test, respectively, and the average value of Npt + Npl is 1313.4.
dc.description.copyright	© Dan Han, 2014
dc.format	text/xml
dc.format.extent	xvi, 130 pages
dc.format.medium	electronic
dc.identifier.uri	https://unbscholar.lib.unb.ca/handle/1882/13649
dc.language.iso	en_CA
dc.publisher	University of New Brunswick
dc.rights	http://purl.org/coar/access_right/c_abf2
dc.subject.discipline	Computer Science
dc.title	Efficient text search with spatial constraints
dc.type	master thesis
thesis.degree.discipline	Computer Science
thesis.degree.fullname	Master of Computer Science
thesis.degree.grantor	University of New Brunswick
thesis.degree.level	masters
thesis.degree.name	M.C.S.

Files

Original bundle

Now showing 1 - 1 of 1

Name:: item.pdf
Size:: 24.39 MB
Format:: Adobe Portable Document Format

Download

Collections

Open Theses & Dissertations