Resource type
Thesis type
(Thesis) M.Sc.
Date created
2024-04-12
Authors/Contributors
Author: Khani, Aliasghar
Abstract
Deep learning has revolutionized computer vision, specifically image classification and segmentation, through the use of over-parameterized models. However, the utilization of over-parameterized models presents its own challenges. A fundamental issue when using large models for image classification is the risk of over-fitting to spurious features, rather than learning meaningful data representations. Segmentation, on the other hand, faces the challenge of requiring dense annotations, which are both costly and difficult to obtain. In this thesis, we propose two solutions, each addressing one of these challenges. First, we introduce a masking strategy named MaskTune, designed to mitigate the over-reliance of classification models on spurious features. This strategy forces the model to explore new features during fine-tuning through strategic masking. Second, we present SLiMe, a segmentation approach that frames the problem as a one-shot optimization task. By leveraging a novel concept called the weighted accumulated self-attention map and cross-attention map from the UNet of text-conditioned Stable Diffusion (SD), we optimize text embeddings to highlight areas corresponding to segmentation mask foregrounds, enabling segmentation of unseen images. Furthermore, utilizing additional annotated data, especially in a few-shot scenario, enhances SLiMe's performance. Our evaluations on various datasets, including CelebA, Waterbirds, ImagenNet-9, CIFAR-10, SVHN, PASCAL-Part, and CelebAMask-HQ, demonstrate the superiority of both MaskTune and SLiMe compared to state-of-the-art methods.
Document
Extent
67 pages.
Identifier
etd22997
Copyright statement
Copyright is held by the author(s).
Supervisor or Senior Supervisor
Thesis advisor: Hamarneh, Ghassan
Language
English
Member of collection
Download file | Size |
---|---|
etd22997.pdf | 35.23 MB |