Semantic enrichment and similarity approximation for biomedical sequence images
University of New Brunswick
Scientific publications are considered as the most up-to-date resource of ongoing research activities and scientific knowledge. Efficient practices for accessing biomedical publications are key to allowing a timely transfer of information from the scientific research community to peer investigators and other healthcare practitioners. Biomedical sequence images published within the literature play a central role in life science discoveries. Whereas advanced text-mining pipelines for information retrieval and knowledge extraction are now commonplace methodologies for processing documents, the ongoing challenges associated with knowledge management and utility operations unique to biomedical image data are only recently gaining recognition. Sequence images depicting key findings of research papers contain rich information derived from a wide range of biomedical experiments. Searching for relevant sequence images is however error prone as images are still opaque to information retrieval and knowledge extraction engines. Specifically, there is no explicit description or annotation of the sequence image content. Moreover, traditional biomedical search engines, which search image captions for relevant keywords only, offer syntactic search mechanisms without regard for the exact meaning of the query. As proposed in this thesis, semantic enrichment of biomedical sequence images is a solution which adopts a combination of technologies to harness the comprehensive information associated with, and contained in, biomedical sequence images. Extracted information from sequence images is used as seed data to aggregate and harvest new annotations from heterogeneous online biomedical resources. Comprehensive semantic enrichment of biomedical images incorporates a variety of knowledge infrastructure components and services including image feature extraction, semantic web data services, linked open data and crowd annotation. Together, these resources make it possible to automatically and/or semi-automatically discover and semantically interlink new information in a way that supports semantic search for sequence images. The resulting enriched sequence images are readily reusable based on their semantic annotations and can be made available for use in ad-hoc data integration activities. Furthermore, to support image reuse this thesis introduces a mechanism for identifying similar sequence images based on fuzzy inference and cosine similarity techniques that can retrieve and classify the related sequence images based on their semantic annotations. The outcomes of this research work will be relevant to a variety of user groups ranging from clinicians and researchers searching with sequence image data.