Resource type
Thesis type
(Thesis) M.Sc.
Date created
2005
Authors/Contributors
Author: Wang, Ping
Abstract
Building classification models based on databases is an exciting area in data mining research. In many classification tasks, only a small set of labelled training data are given. These data are not sufficient for a good classification. We need to sample and label more data as training data for better performance. However, labelling data is timeconsuming and costly. The challenge is to effectively select the most representative data for labelling. While most active leaming methods for this problem follow the incremental query learning paradigm in which the classifier is retained upon each newly labelled query, we present a distance-based method which samples the top-k representative data simultaneously and can be applied to any distance-based classifiers. Redundancy reduction makes classifier retraining unnecessary and makes it find more balanced examples with regard to class distribution in database. Experiment results from two data sets and two classifiers demonstrate the advantages of our method.
Document
Copyright statement
Copyright is held by the author.
Scholarly level
Language
English
Member of collection
Download file | Size |
---|---|
etd1523.pdf | 647.54 KB |