Phrase based pattern matching framework for topic discovery and clustering

dc.contributor.advisorGhorbani, Ali
dc.contributor.authorSingh, Ramanpreet
dc.date.accessioned2023-03-01T16:23:59Z
dc.date.available2023-03-01T16:23:59Z
dc.date.issued2013
dc.date.updated2016-10-26T00:00:00Z
dc.description.abstractIn text mining, one of the major challenges is to discover understandable topics of discussion, and at the same time statistically valid underlying document grouping. The word order and word co-occurence information are very crucial in understanding the meaning of a document. Vector space and bag of word models are poor candidates for topic discovery focused clustering algorithms. Phrase based models have proven to be promising in extracting meaningful topics in a given set of documents. In this thesis, a new framework has been proposed which simultaneously performs topic discovery and clustering in linear time. The core of this framework is the new document model and algorithm to perform effcient pattern matching for exact, prefix, postfix, and infix matching of phrases in linear time. The document model uses concepts from graph theory and the theory of automata to effciently and intelligently match, index, track, and analyze interesting patterns. The generic nature of the framework enables to perform various text mining applications such as query enhancement, keyword extraction, and indexing, to name a few. The primary focus has been on discovering meaningful topics in a set of documents and building a story or context around them. The model is also capable of tracking already discovered topics. The proposed model is effcient enough to be able to capture the essence of the present data and make a link between past and future data. To capture the natural language in the text, instead of just matching words or terms; phrases, entities, and word sense enrichment techniques are also used. With this, we were able to get the essence of the topic discussed in a document even if it did not have an exact string match. The idea of story building is new in this work. The concept of "Knowledge Graph" and "more than just keyword" search are also introduced. In various conducted experiments, the scalability, space, and time performance are compared with the benchmark phrase based document models and the industrial standards. The F-Measure, entropy, and human evaluation are used to validate the topics and stories obtained. The results are promising and highly encouraging.
dc.description.copyright© Ramanpreet Singh, 2014
dc.description.noteElectronic Only. (UNB thesis number) Thesis 9345. (OCoLC) 961215606.
dc.description.noteM.C.S., University of New Brunswick, Faculty of Computer Science, 2014.
dc.formattext/xml
dc.format.extentxiv, 131 pages
dc.format.mediumelectronic
dc.identifier.oclc(OCoLC) 961215606
dc.identifier.otherThesis 9345
dc.identifier.urihttps://unbscholar.lib.unb.ca/handle/1882/13739
dc.language.isoen_CA
dc.publisherUniversity of New Brunswick
dc.rightshttp://purl.org/coar/access_right/c_abf2
dc.subject.disciplineComputer Science
dc.subject.lcshData mining.
dc.subject.lcshText processing (Computer science)
dc.subject.lcshDocument clustering.
dc.subject.lcshMachine theory.
dc.titlePhrase based pattern matching framework for topic discovery and clustering
dc.typemaster thesis
thesis.degree.disciplineComputer Science
thesis.degree.fullnameMaster of Computer Science
thesis.degree.grantorUniversity of New Brunswick
thesis.degree.levelmasters
thesis.degree.nameM.C.S.

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
item.pdf
Size:
5.17 MB
Format:
Adobe Portable Document Format