Resource type
Thesis type
(Thesis) M.Sc.
Date created
2013-08-06
Authors/Contributors
Author: Du, Heng
Abstract
Text based web content categorization is an important area in web data mining. It may be time and bandwidth consuming to categorize a web site based on its text contents. On the other hand, 30%-40% of the daily new registered domain web sites are hosted for advertisement purpose, known as domain parking web sites. It is more resource efficient to exclude the domain parking web sites before a web content categorization algorithm is applied. However, our study shows that the existing web content categorization methods do not work well for recognizing the domain parking web sites. In this thesis, we propose a new domain parking recognizer (DPR) to find the domain parking web sites. Our DPR evolves from the text based web content categorization algorithms and has two key components: key features of domain parking web sites and a tailor-made algorithm. The experimental results show that our DPR has a much better performance than the well known web site categorization methods Naive Bayes and Support Vector Machine for recognizing domain parking web sites. Our DPR is also time efficient.
Document
Identifier
etd7929
Copyright statement
Copyright is held by the author.
Scholarly level
Supervisor or Senior Supervisor
Thesis advisor: Gu, Qianping
Member of collection
Download file | Size |
---|---|
etd7929_HDu.pdf | 1.95 MB |