Efficient text search with spatial constraints
dc.contributor.advisor | Nickerson, Bradford | |
dc.contributor.author | Han, Dan | |
dc.date.accessioned | 2023-03-01T16:21:54Z | |
dc.date.available | 2023-03-01T16:21:54Z | |
dc.date.issued | 2014 | |
dc.date.updated | 2023-03-01T15:01:51Z | |
dc.description.abstract | This thesis presents a search engine called TexSpaSearch that can search text documents having associated positions in space. We defined three query types Q 1 ( t) , Q2( t, r) and Q3(p, r) that can search documents with queries containing text t, position p and radius r components. We indexed a sample herbarium database of 40,791 records using a novel R*-tree and suffix tree data structure to achieve efficient text search with spatial data constraints. Significant preprocessing was performed to transform the herbarium database to a form usable by TexSpaSearch. We built unique data structures used to index text with spatial attributes that simultaneously support Ql, Q2 and Q3 queries. This thesis presents a novel approach for simplifying polygon boundaries for packing into R*-tree leaf nodes. Search results are ranked by a novel modified Lucene algorithm that supports free form text indexing in a general way. A novel ranking of Q2 search results combines the text and spatial scores. The architecture of a prototype Java based web application that invokes TexSpaSearch is described. A theoretical analysis shows that TexSpaSearch requires O(A 2lbl) average time for Ql search, where A is the number of single words in the query string t, and llbl is the average length of a subphrase in t. Q2 and Q3 require O ( A 2 Tbf + Z logM Vn + y) and O(logM Vn + y), respectively, where Z is the number of point records in the list P of text search results, Vn is the number of data objects indexed in the R*=tree for n records, M is the maximum number of entries of an internal node in the R*-tree, and y is the number of leaf nodes found in range in a Q3 query. Testing was performed with 20 test Ql queries to compare TexSpaSearch to a Google Search Appliance (GSA) for text search. Results indicate that the GSA is about 45.5 times faster than TexSpaSearch. The same 20 test queries were combined with a Q2 query radius r = 5, 50 and 500m. Results indicate Q2 queries are 22.8 times slower than Ql queries. For testing Q3 queries, 15 points were chosen in 15 different N .B. counties. The average Tc, T 8 and Te values of 191.5ms, 3603.2ms and 4183.9ms are given in the Q3 test, respectively, and the average value of Npt + Npl is 1313.4. | |
dc.description.copyright | © Dan Han, 2014 | |
dc.format | text/xml | |
dc.format.extent | xvi, 130 pages | |
dc.format.medium | electronic | |
dc.identifier.uri | https://unbscholar.lib.unb.ca/handle/1882/13649 | |
dc.language.iso | en_CA | |
dc.publisher | University of New Brunswick | |
dc.rights | http://purl.org/coar/access_right/c_abf2 | |
dc.subject.discipline | Computer Science | |
dc.title | Efficient text search with spatial constraints | |
dc.type | master thesis | |
thesis.degree.discipline | Computer Science | |
thesis.degree.fullname | Master of Computer Science | |
thesis.degree.grantor | University of New Brunswick | |
thesis.degree.level | masters | |
thesis.degree.name | M.C.S. |
Files
Original bundle
1 - 1 of 1