Piaseczny, Wojciech

Resource type

Thesis

Thesis type

(Thesis) M.A.Sc.

Date created

2011-04-20

Authors/Contributors

Author: Piaseczny, Wojciech

Abstract

Vertical search engines attempt to aggregate all available online data for a specific vertical into a normalized and structured data model. There are two common strategies for aggregating data: 1) data feeds, and 2) web crawling. Data feeds use source-specific translation rules to collect structured data, but require the source to specifically expose the data. Web crawling collects data through the same interface that users view it, which requires additional work to identify and extract the relevant content from unstructured or semi-structured text. Generalizing these tasks across many websites is difficult because each website presents content in its own arbitrary way. This thesis proposes a strategy for identifying relevant content across many websites with improved accuracy. Many well known statistical document classification algorithms can distinguish between classes of documents with high accuracy. These algorithms fail when test data is significantly different than training data, as is often the case in the vertical search context. This thesis adaptively builds website-specific document classifiers to avoid common classification failure conditions. Training data is selected dynamically by exploiting common user interface patterns. The results obtained here demonstrate that using adaptive document classifiers improves accuracy with minimal performance costs.

Keywords

Identifier

etd6675

Copyright statement

Copyright is held by the author.

Permissions

The author granted permission for the file to be printed and for the text to be copied and pasted.

Scholarly level

Graduate student (Masters)

Supervisor or Senior Supervisor

Thesis advisor: Kaminska, Bozena

Member of collection

Engineering Science Theses

Download file	Size
etd6675_WPiaseczny.pdf	1.11 MB

Adaptive document discovery for vertical search engines

Keywords

Views & downloads - as of June 2023