Gao, Fei

Resource type

Thesis

Thesis type

((Thesis)) M.Sc.

Date created

2011-12-05

Authors/Contributors

Author: Gao, Fei

Abstract

Many kernel-based clustering algorithms do not scale up to high-dimensional large datasets. The similarity matrix, on which these algorithms rely, calls for O(N2) complexity in both time and space. In this thesis, we present the design of an approximation algorithm to cluster high-dimensional large datasets. The proposed design enables great reduction of the similarity matrix’s computing time as well as its space requirements without significantly impacting the accuracy of the clustering. The proposed design is modular and self-contained. Therefore, several kernel-based clustering algorithms could also benefit from the proposed design to improve their performance. We implemented the proposed algorithm in the MapReduce distributed programming framework and experimented with synthetic datasets as well as a real dataset from Wikipedia that has more than three million documents. Our results demonstrate the high accuracy and the significant time and memory savings that can be achieved by our algorithm.

Keywords

Identifier

etd6945

Copyright statement

Copyright is held by the author.

Permissions

The author granted permission for the file to be printed and for the text to be copied and pasted.

Scholarly level

Graduate student (Masters)

Supervisor or Senior Supervisor

Thesis advisor: Hefeeda, Mohamed

Member of collection

Computing Science Theses

Download file	Size
etd6945_FGao.pdf	360.3 KB

Distributed approximate spectral clustering for large-scale datasets

Keywords

Views & downloads - as of June 2023