Efficient text search with spatial constraints
Loading...
Files
Date
2014
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
University of New Brunswick
Abstract
This thesis presents a search engine called TexSpaSearch that can search text
documents having associated positions in space. We defined three query types Q 1 ( t)
, Q2( t, r) and Q3(p, r) that can search documents with queries containing text t,
position p and radius r components. We indexed a sample herbarium database
of 40,791 records using a novel R*-tree and suffix tree data structure to achieve
efficient text search with spatial data constraints. Significant preprocessing was
performed to transform the herbarium database to a form usable by TexSpaSearch.
We built unique data structures used to index text with spatial attributes that
simultaneously support Ql, Q2 and Q3 queries.
This thesis presents a novel approach for simplifying polygon boundaries for packing
into R*-tree leaf nodes. Search results are ranked by a novel modified Lucene
algorithm that supports free form text indexing in a general way. A novel ranking
of Q2 search results combines the text and spatial scores. The architecture of a
prototype Java based web application that invokes TexSpaSearch is described. A
theoretical analysis shows that TexSpaSearch requires O(A 2lbl) average time for Ql
search, where A is the number of single words in the query string t, and llbl is the
average length of a subphrase in t. Q2 and Q3 require O ( A 2 Tbf + Z logM Vn + y) and
O(logM Vn + y), respectively, where Z is the number of point records in the list P
of text search results, Vn is the number of data objects indexed in the R*=tree for n records, M is the maximum number of entries of an internal node in the R*-tree,
and y is the number of leaf nodes found in range in a Q3 query.
Testing was performed with 20 test Ql queries to compare TexSpaSearch to a Google
Search Appliance (GSA) for text search. Results indicate that the GSA is about 45.5
times faster than TexSpaSearch. The same 20 test queries were combined with a Q2
query radius r = 5, 50 and 500m. Results indicate Q2 queries are 22.8 times slower
than Ql queries. For testing Q3 queries, 15 points were chosen in 15 different N .B.
counties. The average Tc, T 8 and Te values of 191.5ms, 3603.2ms and 4183.9ms are
given in the Q3 test, respectively, and the average value of Npt + Npl is 1313.4.