Phrase based pattern matching framework for topic discovery and clustering

Singh, Ramanpreet

Phrase based pattern matching framework for topic discovery and clustering

dc.contributor.advisor	Ghorbani, Ali
dc.contributor.author	Singh, Ramanpreet
dc.date.accessioned	2023-03-01T16:23:59Z
dc.date.available	2023-03-01T16:23:59Z
dc.date.issued	2013
dc.date.updated	2016-10-26T00:00:00Z
dc.description.abstract	In text mining, one of the major challenges is to discover understandable topics of discussion, and at the same time statistically valid underlying document grouping. The word order and word co-occurence information are very crucial in understanding the meaning of a document. Vector space and bag of word models are poor candidates for topic discovery focused clustering algorithms. Phrase based models have proven to be promising in extracting meaningful topics in a given set of documents. In this thesis, a new framework has been proposed which simultaneously performs topic discovery and clustering in linear time. The core of this framework is the new document model and algorithm to perform effcient pattern matching for exact, prefix, postfix, and infix matching of phrases in linear time. The document model uses concepts from graph theory and the theory of automata to effciently and intelligently match, index, track, and analyze interesting patterns. The generic nature of the framework enables to perform various text mining applications such as query enhancement, keyword extraction, and indexing, to name a few. The primary focus has been on discovering meaningful topics in a set of documents and building a story or context around them. The model is also capable of tracking already discovered topics. The proposed model is effcient enough to be able to capture the essence of the present data and make a link between past and future data. To capture the natural language in the text, instead of just matching words or terms; phrases, entities, and word sense enrichment techniques are also used. With this, we were able to get the essence of the topic discussed in a document even if it did not have an exact string match. The idea of story building is new in this work. The concept of "Knowledge Graph" and "more than just keyword" search are also introduced. In various conducted experiments, the scalability, space, and time performance are compared with the benchmark phrase based document models and the industrial standards. The F-Measure, entropy, and human evaluation are used to validate the topics and stories obtained. The results are promising and highly encouraging.
dc.description.copyright	© Ramanpreet Singh, 2014
dc.description.note	Electronic Only. (UNB thesis number) Thesis 9345. (OCoLC) 961215606.
dc.description.note	M.C.S., University of New Brunswick, Faculty of Computer Science, 2014.
dc.format	text/xml
dc.format.extent	xiv, 131 pages
dc.format.medium	electronic
dc.identifier.oclc	(OCoLC) 961215606
dc.identifier.other	Thesis 9345
dc.identifier.uri	https://unbscholar.lib.unb.ca/handle/1882/13739
dc.language.iso	en_CA
dc.publisher	University of New Brunswick
dc.rights	http://purl.org/coar/access_right/c_abf2
dc.subject.discipline	Computer Science
dc.subject.lcsh	Data mining.
dc.subject.lcsh	Text processing (Computer science)
dc.subject.lcsh	Document clustering.
dc.subject.lcsh	Machine theory.
dc.title	Phrase based pattern matching framework for topic discovery and clustering
dc.type	master thesis
thesis.degree.discipline	Computer Science
thesis.degree.fullname	Master of Computer Science
thesis.degree.grantor	University of New Brunswick
thesis.degree.level	masters
thesis.degree.name	M.C.S.

Files

Original bundle

Now showing 1 - 1 of 1

Name:: item.pdf
Size:: 5.17 MB
Format:: Adobe Portable Document Format

Download

Collections

Open Theses & Dissertations

Phrase based pattern matching framework for topic discovery and clustering

Files

Original bundle

Collections

General

Libraries

Departments

Join the conversation: