Phrase based pattern matching framework for topic discovery and clustering
University of New Brunswick
In text mining, one of the major challenges is to discover understandable topics of discussion, and at the same time statistically valid underlying document grouping. The word order and word co-occurence information are very crucial in understanding the meaning of a document. Vector space and bag of word models are poor candidates for topic discovery focused clustering algorithms. Phrase based models have proven to be promising in extracting meaningful topics in a given set of documents. In this thesis, a new framework has been proposed which simultaneously performs topic discovery and clustering in linear time. The core of this framework is the new document model and algorithm to perform effcient pattern matching for exact, prefix, postfix, and infix matching of phrases in linear time. The document model uses concepts from graph theory and the theory of automata to effciently and intelligently match, index, track, and analyze interesting patterns. The generic nature of the framework enables to perform various text mining applications such as query enhancement, keyword extraction, and indexing, to name a few. The primary focus has been on discovering meaningful topics in a set of documents and building a story or context around them. The model is also capable of tracking already discovered topics. The proposed model is effcient enough to be able to capture the essence of the present data and make a link between past and future data. To capture the natural language in the text, instead of just matching words or terms; phrases, entities, and word sense enrichment techniques are also used. With this, we were able to get the essence of the topic discussed in a document even if it did not have an exact string match. The idea of story building is new in this work. The concept of "Knowledge Graph" and "more than just keyword" search are also introduced. In various conducted experiments, the scalability, space, and time performance are compared with the benchmark phrase based document models and the industrial standards. The F-Measure, entropy, and human evaluation are used to validate the topics and stories obtained. The results are promising and highly encouraging.