Semi-supervised Bayesian learning
University of New Brunswick
In traditional supervised learning, classifiers are learned from a number of training data examples where each data example has a label showing the category that the example falls into. Bayesian network classifiers, representing the joint probability and the conditional independencies among a set of random variables, have been widely used in traditional supervised classification. As a special case of Bayesian Networks, Naive Bayes has also been popularly used in many applications such as text classification. However, in many real-world machine learning applications, it may be expensive or time consuming to obtain sufficient labeled data because it needs human efforts to categorize them. On the other hand, it is easier to collect a large amount of unlabeled data. Learning from a small number of labeled data may not generate good classifiers. Semi-supervised learning is one method to deal with the problem by utilizing the unlabeled data to help learn better classifiers. Although various semi-supervised methods have been studied by researchers, there are still some problems remaining unknown in semi-supervised learning. One of the problems is the performance of different Bayesian network classifiers in semi-supervised learning scenario. An extensive study is needed to get a general picture of the performance of these methods. A second problem is their application on cost-sensitive learning where misclassifying different classes are associated with unequal costs. For example, misclassifying a person with cancer as healthy could be more serious than misclassifying a healthy person as cancerous. Particularly, the classification performance may degrade when the datasets have skewed class distributions. The insufficiency of labeled data makes it more difficult to build a good Bayesian classifier to identify classes having different misclassification costs. Specific techniques are required to make optimal cost-sensitive classification decisions. In supervised learning, many techniques have been applied to solve the cost-sensitive problem. However, only a little work has been conducted in semi-supervised learning and many questions remain unclear in the research area. A third problem is to learn Bayesian network structures in semi-supervised learning scenario. Sufficient training data examples are needed to learn a good Bayesian network structure. It is a big challenge to learn good Bayesian networks when the labeled data is scarce. This thesis firstly systematically studies the performance of several commonly used semi-supervised learning methods, when different Bayesian network classifiers are used as the underlying classifiers. Several novel and interesting observations are found from the experiments and the performance analysis. Motivated by the observations, an instance selection method is thus presented to improve the classification accuracy when using Naive Bayes in self-training and co-training methods. The presented method can roughly prevent adding errors in the training data so as to preventing the degradation of accuracy. Then, three cost-sensitive semi-supervised learning methods are presented to improve the performance of Naive Bayes when the labeled data is scarce and different misclassification errors are associated with different costs. Misclassification costs are incorporated into the three methods to generate cost-sensitive classifiers. Experimental results show that, the new methods can obtain lower total misclassification costs than the corresponding opponents. Finally, a new method is designed to learn Bayesian network classifiers with more complex graph structures for classification in semi-supervised learning. The experiments show that, the proposed method achieves higher accuracy while obtaining better Bayesian network structures to represent the relationship among the variables.