Investigating data representations on efficacy of bioinformatics algorithms on biological data

Resource type
Thesis type
(Thesis) Ph.D.
Date created
During my PhD studies I have worked on two projects. The first is about speeding up a recursive algorithm, so-called FK-B, that certifies whether two given monotone Boolean functions in the form of Conjunctive Normal Form (CNF) and Disjunctive Normal Form (DNF) are dual or not and in case of not being dual it returns a conflicting assignment (CA), i.e. an assignment that makes one of the given Boolean functions True and the other one False. The FK-B algorithm is the core of the dualization procedure where it generates the dual of a given monotone Boolean function. In this regard, we propose six improve- ments/techniques applicable to the FK-B algorithm as well as the dualization process. Although these improvements/techniques do not reduce the time complexity, they con- siderably reduce the running time in practice that is important because of a wide range applications of the FK-B algorithm and dualization procedure. Here, to evaluate how ef- fective the proposed improvements are, we apply them to the metabolic network analysis field where we find the minimal cut sets given elementary flux modes. The obtained results show a considerable speed up in comparison with the original dualization procedure.
In the second project, we investigate different data representations to predict drug resis- tance in Tuberculosis (TB). TB is an airborne disease which mostly affects the lungs. TB is treated using antibiotics, however, it has been revealed that some TB strains have become resistant to the drugs. Drug resistance in TB is usually diagnosed using a time-consuming and expensive laboratory experiment that is not always available. Nowadays, it has been discovered that mutations are mostly responsible for emergence of drug resistance. Consid- ering this, a machine learning model, that is faster and cheaper than laboratory techniques, can be designed to predict drug resistance based on the detected mutations. In our study, we use deep neural networks to predict drug resistance in TB. To this end, we first detect the Single Nucleotide Polymorphisms (SNPs) in TB isolates. Then, we reconstruct gene and protein sequences and two other related data types. We design and experiment several neural networks with different input(s) and settings to get an insight on efficacy of each data type. The results show that protein sequence data as well as SNP data are the most in- formative data sources for predicting drug resistance. However, it is notable that processing sequence data requires so much computational resources.
93 pages.
Copyright statement
Copyright is held by the author(s).
This thesis may be printed or downloaded for non-commercial research and scholarly purposes.
Supervisor or Senior Supervisor
Thesis advisor: Chindelevitch, Leonid
Thesis advisor: Libbrecht, Maxwell
Member of collection
Attachment Size
etd21671.pdf 1.83 MB