Skip to main content

Sampling the Top-K representative data for classification

Resource type
Thesis type
(Thesis) M.Sc.
Date created
2005
Authors/Contributors
Author: Wang, Ping
Abstract
Building classification models based on databases is an exciting area in data mining research. In many classification tasks, only a small set of labelled training data are given. These data are not sufficient for a good classification. We need to sample and label more data as training data for better performance. However, labelling data is timeconsuming and costly. The challenge is to effectively select the most representative data for labelling. While most active leaming methods for this problem follow the incremental query learning paradigm in which the classifier is retained upon each newly labelled query, we present a distance-based method which samples the top-k representative data simultaneously and can be applied to any distance-based classifiers. Redundancy reduction makes classifier retraining unnecessary and makes it find more balanced examples with regard to class distribution in database. Experiment results from two data sets and two classifiers demonstrate the advantages of our method.
Document
Copyright statement
Copyright is held by the author.
Permissions
The author has not granted permission for the file to be printed nor for the text to be copied and pasted. If you would like a printable copy of this thesis, please contact summit-permissions@sfu.ca.
Scholarly level
Language
English
Member of collection
Download file Size
etd1523.pdf 647.54 KB

Views & downloads - as of June 2023

Views: 0
Downloads: 0