Skip to main content

Adaptive document discovery for vertical search engines

Resource type
Thesis type
(Thesis) M.A.Sc.
Date created
2011-04-20
Authors/Contributors
Abstract
Vertical search engines attempt to aggregate all available online data for a specific vertical into a normalized and structured data model. There are two common strategies for aggregating data: 1) data feeds, and 2) web crawling. Data feeds use source-specific translation rules to collect structured data, but require the source to specifically expose the data. Web crawling collects data through the same interface that users view it, which requires additional work to identify and extract the relevant content from unstructured or semi-structured text. Generalizing these tasks across many websites is difficult because each website presents content in its own arbitrary way. This thesis proposes a strategy for identifying relevant content across many websites with improved accuracy. Many well known statistical document classification algorithms can distinguish between classes of documents with high accuracy. These algorithms fail when test data is significantly different than training data, as is often the case in the vertical search context. This thesis adaptively builds website-specific document classifiers to avoid common classification failure conditions. Training data is selected dynamically by exploiting common user interface patterns. The results obtained here demonstrate that using adaptive document classifiers improves accuracy with minimal performance costs.
Document
Identifier
etd6675
Copyright statement
Copyright is held by the author.
Permissions
The author granted permission for the file to be printed and for the text to be copied and pasted.
Scholarly level
Supervisor or Senior Supervisor
Thesis advisor: Kaminska, Bozena
Member of collection
Download file Size
etd6675_WPiaseczny.pdf 1.11 MB

Views & downloads - as of June 2023

Views: 0
Downloads: 0