Distributed approximate spectral clustering for large-scale datasets

Resource type
Thesis type
((Thesis)) M.Sc.
Date created
Author: Gao, Fei
Many kernel-based clustering algorithms do not scale up to high-dimensional large datasets. The similarity matrix, on which these algorithms rely, calls for O(N2) complexity in both time and space. In this thesis, we present the design of an approximation algorithm to cluster high-dimensional large datasets. The proposed design enables great reduction of the similarity matrix’s computing time as well as its space requirements without significantly impacting the accuracy of the clustering. The proposed design is modular and self-contained. Therefore, several kernel-based clustering algorithms could also benefit from the proposed design to improve their performance. We implemented the proposed algorithm in the MapReduce distributed programming framework and experimented with synthetic datasets as well as a real dataset from Wikipedia that has more than three million documents. Our results demonstrate the high accuracy and the significant time and memory savings that can be achieved by our algorithm.
Copyright statement
Copyright is held by the author.
The author granted permission for the file to be printed and for the text to be copied and pasted.
Scholarly level
Supervisor or Senior Supervisor
Thesis advisor: Hefeeda, Mohamed
Member of collection
Download file Size
etd6945_FGao.pdf 360.3 KB

Views & downloads - as of June 2023

Views: 0
Downloads: 0