Similarity of weighted trees
University of New Brunswick
Similarity computation and matching has become an important topic in information retrieval, case based reasoning, data clustering, database integration, ontology alignment, image processing, natural language processing, schema matching, and e-Business/e-Learning. Given an object in an application domain, finding similar objects helps in obtaining solutions to problems in all of these areas. Objects can be represented by a set of key words/phrases. However, key words/phrases have limitations representing complex object attributes. This thesis proposes node-labeled, arc-labeled, and arc-weighted tree representations for applications such as product/service descriptions in e-Business/e-Learning. Arc labels represent attributes of products/services. The targets of arcs can be leaf nodes or arbitrarily complex sub-trees representing product/service partonomies. Arc weights indicate relative importance/preference values on product/service attributes. We propose a tree similarity algorithm to compute the similarity of weighted trees, consisting of syntactic and semantic components. The syntactic component of the tree similarity algorithm traverses a pair of trees top-down and aggregates similarity values bottom-up. We also propose a tree simplicity measure to compute the simplicity value of a single (sub-)tree. Tree simplicity is computed for each sub-tree missing in the other tree, and the simplicity value is used for similarity computation, where a simpler (less complex) sub-tree leads to a higher tree similarity. Weights are averaged to embody preferences from a pair of trees, e.g., buyer and seller trees in e-Business. We complement the syntactic component by a semantic global similarity, i.e. taxonomic class similarity for inner nodes, and local similarity, i.e. typed similarity for leaf nodes. Our tree similarity algorithm was applied to the Teclantic and eduSource projects for e-Business and e-Learning, respectively. Teclantic matches companies or investors in Atlantic Canada with researchers in various disciplines to share technologies. In eduSource, we match learning objects (i.e. courses) and learning object providers (i.e. course providers). Both projects provided a ranked list of search results based on a user’s requests. We carried out computational experiments on systematic variations of trees to analyze tree simplicity and similarity properties, and to compare our tree similarity algorithm with other tree similarity/distance algorithms in the literature. When compared to other algorithms, our tree similarity algorithm shows advantages on aggregating sub-tree similarity values to obtain the overall tree similarity value. Our tree simplicity algorithm, beyond its usage for missing sub-trees in our tree similarity algorithm, can also be used as a standalone algorithm to calculate the simplicity of a single tree.