One essential application in bioinformatics that is affected by the high-throughput sequencing data deluge is the sequence alignment problem, where nucleotide or amino acid sequences are queried against targets to find regions of close similarity. When queries are too many and/or targets are too large, the alignment process becomes computationally challenging. This is usually addressed by preprocessing techniques, where the queries and/or targets are indexed for easy access while searching for matches. When the target is static, such as in an established reference genome, the cost of indexing is amortized by reusing the generated index. However, when the targets are non-static, such as contigs in the intermediate steps of a de novo assembly process, a new index must be computed for each run. To address such scalability problems, we present DIDA, a novel framework that distributes the indexing and alignment tasks into smaller subtasks over a cluster of compute nodes. It provides a workflow beyond the common practice of embarrassingly parallel implementations. DIDA is a cost-effective, scalable and modular framework for the sequence alignment problem in terms of memory usage and runtime. It can be employed in large-scale alignments to draft genomes and intermediate stages of de novo assembly runs. The DIDA source code, sample files and user manual are available through http://www.bcgsc.ca/platform/bioinfo/software/dida. The software is released under the British Columbia Cancer Agency License (BCCA), and is free for academic use.
The improvements in high throughput sequencing technologies (HTS) made clinical sequencing projects such as ClinSeq and Genomics England feasible. Although there are significant improvements in accuracy and reproducibility of HTS based analyses, the usability of these types of data for diagnostic and prognostic applications necessitates a near perfect data generation. To assess the usability of a widely used HTS platform for accurate and reproducible clinical applications in terms of robustness, we generated whole genome shotgun (WGS) sequence data from the genomes of two human individuals in two different genome sequencing centers. After analyzing the data to characterize SNPs and indels using the same tools (BWA, SAMtools, and GATK), we observed significant number of discrepancies in the call sets. As expected, the most of the disagreements between the call sets were found within genomic regions containing common repeats and segmental duplications, albeit only a small fraction of the discordant variants were within the exons and other functionally relevant regions such as promoters. We conclude that although HTS platforms are sufficiently powerful for providing data for first-pass clinical tests, the variant predictions still need to be confirmed using orthogonal methods before using in clinical applications.
Colour constancy needs to be reconsidered in light of the limits imposed by metamer mismatching. Metamer mismatching refers to the fact that two objects reflecting metameric light under one illumination may reflect non-metameric light under a second; so two objects appearing as having the same colour under one illuminant can appear as having different colours under a second. Yet since Helmholtz, object colour has generally been believed to remain relatively constant. The deviations from colour constancy registered in experiments are usually thought to be small enough that they do not contradict the notion of colour constancy. However, it is important to determine how the deviations from colour constancy relate to the limits metamer mismatching imposes on constancy. Hence, we calculated metamer mismatching’s effect for the 20 Munsell papers and 8 pairs of illuminants employed in the colour constancy study by Logvinenko and Tokunaga and found it to be so extensive that the two notions—metamer mismatching and colour constancy—must be mutually exclusive. In particular, the notion of colour constancy leads to some paradoxical phenomena such as the possibility of 20 objects having the same colour under chromatic light dispersing into a hue circle of colours under neutral light. Thus, colour constancy refers to a phenomenon, which because of metamer mismatching, simply cannot exist. Moreover, it obscures the really important visual phenomenon; namely, the alteration of object colours induced by illumination change. We show that colour is not an independent, intrinsic attribute of an object, but rather an attribute of an object/light pair, and then define a concept of material colour in terms of equivalence classes of such object/light pairs. We suggest that studying the shift in material colour under a change in illuminant will be more fruitful than pursuing colour constancy’s false premise that colour is an intrinsic attribute of an object.
Increasing genetic and phenotypic differences found among natural isolates of C. elegans have encouraged researchers to explore the natural variation of this nematode species.
Here we report on the identification of genomic differences between the reference strain N2 and the Hawaiian strain CB4856, one of the most genetically distant strains from N2. To identify both small- and large-scale genomic variations (GVs), we have sequenced the CB4856 genome using both Roche 454 (~400 bps single reads) and Illumina GA DNA sequencing methods (101 bps paired-end reads). Compared to previously described variants (available in WormBase), our effort uncovered twice as many single nucleotide variants (SNVs) and increased the number of small InDels almost 20-fold. Moreover, we identified and validated large insertions, most of which range from 150 bps to 1.2 kb in length in the CB4856 strain. Identified GVs had a widespread impact on protein-coding sequences, including 585 single-copy genes that have associated severe phenotypes of reduced viability in RNAi and genetics studies. Sixty of these genes are homologs of human genes associated with diseases. Furthermore, our work confirms previously identified GVs associated with differences in behavioural and biological traits between the N2 and CB4856 strains.
The identified GVs provide a rich resource for future studies that aim to explain the genetic basis for other trait differences between the N2 and CB4856 strains.
Frequent subgraph mining is a useful method for extracting meaningful patterns from a set of graphs or a single large graph. Here, the graph represents all possible RNA structures and interactions. Patterns that are significantly more frequent in this graph over a random graph are extracted. We hypothesize that these patterns are most likely to represent biological mechanisms. The graph representation used is a directed dual graph, extended to handle intermolecular interactions. The graph is sampled for subgraphs, which are labeled using a canonical labeling method and counted. The resulting patterns are compared to those created from a randomized dataset and scored. The algorithm was applied to the mitochondrial genome of the kinetoplastid species Trypanosoma brucei, which has a unique RNA editing mechanism. The most significant patterns contain two stem-loops, indicative of gRNA, and represent interactions of these structures with target mRNA.
Controlling bias is key to successful randomized controlled trials for behaviour change. Bias can be generated at multiple points during a study, for example, when participants are allocated to different groups. Several methods of allocations exist to randomly distribute participants over the groups such that their prognostic factors (e.g., socio-demographic variables) are similar, in an effort to keep participants’ outcomes comparable at baseline. Since it is challenging to create such groups when all prognostic factors are taken together, these factors are often balanced in isolation or only the ones deemed most relevant are balanced. However, the complex interactions among prognostic factors may lead to a poor estimate of behaviour, causing unbalanced groups at baseline, which may introduce accidental bias.
We present a novel computational approach for allocating participants to different groups. Our approach automatically uses participants’ experiences to model (the interactions among) their prognostic factors and infer how their behaviour is expected to change under a given intervention. Participants are then allocated based on their inferred behaviour rather than on selected prognostic factors.
In order to assess the potential of our approach, we collected two datasets regarding the behaviour of participants (n = 430 and n = 187). The potential of the approach on larger sample sizes was examined using synthetic data. All three datasets highlighted that our approach could lead to groups with similar expected behavioural changes.
The computational approach proposed here can complement existing statistical approaches when behaviours involve numerous complex relationships, and quantitative data is not readily available to model these relationships. The software implementing our approach and commonly used alternatives is provided at no charge to assist practitioners in the design of their own studies and to compare participants' allocations.
Turkey is a crossroads of major population movements throughout history and has been a hotspot of cultural interactions. Several studies have investigated the complex population history of Turkey through a limited set of genetic markers. However, to date, there have been no studies to assess the genetic variation at the whole genome level using whole genome sequencing. Here, we present whole genome sequences of 16 Turkish individuals resequenced at high coverage (32 × -48×).
We show that the genetic variation of the contemporary Turkish population clusters with South European populations, as expected, but also shows signatures of relatively recent contribution from ancestral East Asian populations. In addition, we document a significant enrichment of non-synonymous private alleles, consistent with recent observations in European populations. A number of variants associated with skin color and total cholesterol levels show frequency differentiation between the Turkish populations and European populations. Furthermore, we have analyzed the 17q21.31 inversion polymorphism region (MAPT locus) and found increased allele frequency of 31.25% for H1/H2 inversion polymorphism when compared to European populations that show about 25% of allele frequency.
This study provides the first map of common genetic variation from 16 western Asian individuals and thus helps fill an important geographical gap in analyzing natural human variation and human migration. Our data will help develop population-specific experimental designs for studies investigating disease associations and demographic history in Turkey.
When designing and implementing an intelligent energy conservation system for the home, it is essential to have insight into the activities and actions of the occupants. In particular, it is important to understand what appliances are being used and when. In the computational sustainability research community this is known as load disaggregation or Non-Intrusive Load Monitoring (NILM). NILM is a foundational algorithm that can disaggregate a home’s power usage into the individual appliances that are running, identify energy conservation opportunities. This depth report will focus on NILM algorithms, their use and evaluation. We will examine and evaluate the anatomy of NILM, looking at techniques using load monitoring, event detection, feature ex- traction, classification, and accuracy measurement.
Expanding access to highly active antiretroviral therapy (HAART) has become an important approach to HIV prevention in recent years. Previous studies suggest that concomitant changes in risk behaviours may either help or hinder programs that use a Treatment as Prevention strategy.
We consider HIV-related risk behaviour as a social contagion in a deterministic compartmental model, which treats risk behaviour and HIV infection as linked processes, where acquiring risk behaviour is a prerequisite for contracting HIV. The equilibrium behaviour of the model is analysed to determine epidemic outcomes under conditions of expanding HAART coverage along with risk behaviours that change with HAART coverage. We determined the potential impact of changes in risk behaviour on the outcomes of Treatment as Prevention strategies. Model results show that HIV incidence and prevalence decline only above threshold levels of HAART coverage, which depends strongly on risk behaviour parameter values. Expanding HAART coverage with simultaneous reduction in risk behaviour act synergistically to accelerate the drop in HIV incidence and prevalence. Above the thresholds, additional HAART coverage is always sufficient to reverse the impact of HAART optimism on incidence and prevalence. Applying the model to an HIV epidemic in Vancouver, Canada, showed no evidence of HAART optimism in that setting.
Our results suggest that Treatment as Prevention has significant potential for controlling the HIV epidemic once HAART coverage reaches a threshold. Furthermore, expanding HAART coverage combined with interventions targeting risk behaviours amplify the preventive impact, potentially driving the HIV epidemic to elimination.
Cognitive science has long shown interest in expertise, in part because prediction and control of expert development would have immense practical value. Most studies in this area investigate expertise by comparing experts with novices. The reliance on contrastive samples in studies of human expertise only yields deep insight into development where differences are important throughout skill acquisition. This reliance may be pernicious where the predictive importance of variables is not constant across levels of expertise. Before the development of sophisticated machine learning tools for data mining larger samples, and indeed, before such samples were available, it was difficult to test the implicit assumption of static variable importance in expertise development. To investigate if this reliance may have imposed critical restrictions on the understanding of complex skill development, we adopted an alternative method, the online acquisition of telemetry data from a common daily activity for many: video gaming. Using measures of cognitive-motor, attentional, and perceptual processing extracted from game data from 3360 Real-Time Strategy players at 7 different levels of expertise, we identified 12 variables relevant to expertise. We show that the static variable importance assumption is false - the predictive importance of these variables shifted as the levels of expertise increased - and, at least in our dataset, that a contrastive approach would have been misleading. The finding that variable importance is not static across levels of expertise suggests that large, diverse datasets of sustained cognitive-motor performance are crucial for an understanding of expertise in real-world contexts. We also identify plausible cognitive markers of expertise.