Attention Conference Presenters - please review the Speaker Information Page available here.
If you need assistance please contact submissions@iscb.org and provide your poster title or submission ID.
Short Abstract: Next generation sequencing has boosted the amount of available genomic data, meaning cancer researchers have access to petabytes of multidimensional data from tens of thousands of patients. However, as the amount of data continues to increase, the emphasis is very much on the analysis. An example is data generated by the efforts of The Cancer Genomics Atlas (TCGA) network. Downloading the complete TCGA repository would require several weeks with a highly optimized network connection, while the cost of its annual storage is measured in millions of US dollars. Once downloaded, integrated analysis of these data remains out of reach for any researcher without access to the high-performance compute clusters. The Cancer Genomics Cloud Pilots project was born out of the recognition that as life science research becomes ever-more computationally intensive, new approaches are required to support effective data query, storage, computation, and crucially scientific collaboration. Funded by the National Cancer Institute (Contract No. HHSN261201400008C), the Cancer Genomics Cloud Pilot project enables researchers to leverage the power of cloud computing to gain biologically relevant and actionable insights from massive public datasets including TCGA. We demonstrate our approach to optimized computation, data mining, and visualization solutions that enable cancer researchers to address the challenges associated with analysis of petabyte-scale datasets and beyond. Researchers can access the system and receive free computation and storage credits at www.cancergenomicscloud.org.
Short Abstract: Human breast cancer has been classified into several distinct molecular intrinsic subtypes based on gene expression profile for molecular-based treatment. Although there are many well-known gene-expression-based classification such as PAM50, subtyping using these methods are affected by the composition of patients and data preprocessing method, which is less reproducible mainly due to the relativity of gene expression within the cohort and different normalization method. While recent studies try to build an absolute classifier that utilizes internal gene expression comparisons for single sample classification, there has been no trial to build an RNA-seq based absolute classifier to meet the recent demand for NGS-based diagnosis. Here, we present a new classification method based on pairs of absolute gene expression within a single sample. In the performance test using four machine learning techniques, we showed that our RNA-seq specific classifier achieved up to 87% accuracy increasing from 74% of conventional methods. We also extracted a novel intrinsic gene set using RNA-seq data that may demonstrate the biology of breast cancer subtypes and the discrepancy between microarray and RNA-seq based gene expression characteristics.
Short Abstract: In cancer genomics, next generation sequencing data are usually used to detect somatic driver mutations for identifying the cause of tumorigenesis. However, non-driver somatic mutations, or passenger mutations, can also play an important role in cancer cell survival and treatment by generating aberrant short peptide sequences as known as ‘neopeptides’. Neopeptides are combined with the major histocompatibility (MHC) I and presented on the cell surface through a series of intercellular process to finally act as antigen. MHC-peptide complex is immunogenic and expected to be recognized by cytotoxic T cell. Recent promise of immunotherapy in cancer treatment warrants systematic evaluation of somatic mutation with regard to immunogenicity for personalized adoptive T cell therapy or cancer vaccination. Indeed, high-throughput epitope discovery enabled identifying novel neopeptide for personalized therapy, and neopeptides inducing immunogenicity correlated with better patient survival. Here, we developed Neopepsee, a method that applies a machine learning to predict personal neoantigen with next generation sequencing data. Neopepsee not only automates conventional neoantigen prediction processes, but also increases accuracy with novel features including isoform-specific amino acid conversion, sequence similarity search on pathogenic epitopes and locally weighted Naïve Bayes classification. Tests with validated neoantigen dataset and independent experimental data confirmed the improved performance. By providing a convenient platform with better accuracy, Neopepsee will be of many uses in cancer immunotherapy research, such as in developing predictive biomarkers and in designing personalized cancer vaccines.
Short Abstract: Identification of compound-target interactions helps not only develop novel drugs and repositioning of drugs but also understand biological interactions. However, time consuming and cost problem cannot be disregarded for identification of compound-target interactions. The in silico screening methods can provide important information to us in a reasonable time. In many in silico screening methods, similarity based methods which use similarity score of each compound and target as features and construct the prediction model using machine learning such as support vector machine (SVM) show promising effects. However, the methods often show lower prediction performance in external validations.
In this study, we constructed a deep-learning prediction model of compound-target interactions by using a convolutional neural network (CNN) approach. CNNs have the best performance of prediction studies such as speech and image recognition. Although drug-target prediction methods which use general machine learning methods have quite good performance, CNNs concept gives an advantage to the prediction of compound-target protein interactions. In this work, we used compound structural properties and protein properties as features, and constructed a classification model. Our model demonstrated AUC and AUPR values greater than 0.9 in 10-fold cross-validation, also in the external validation with unseen data sets, the predicted result achieved more than 0.8 accuracy.
Short Abstract: Duchenne muscular dystrophy (DMD) is a common and devastating genetic disease characterized by muscle wasting. Exon skipping uses small DNA-like molecules, antisense oligos (AOs), that act like stitches to modulate gene products and rescue the mutations. The efficacy of exon skipping at different target positions can vary by more than 20-fold, thus the selection of the target site could make the difference between success and failure of clinical trials. However, no effective method has been developed to choose the optimal target site. We propose to develop an in silico (computational) method, which is considered a fast, inexpensive, and effective way to guide the screening. We have recently developed such framework, and identified a "DNA-stitch" that is improved by more than 10 times compared to current clinical trial molecules. We wish to improve it further and identify new drug candidates that can treat a majority of DMD patients with various mutations.
We plan to pursue the following objectives: 1) to identify influential features in exon skipping, and use bioinformatics techniques to develop an efficient algorithm to predict the efficacy of exon skipping of AOs; 2) to improve the efficacy of both single- and multi-exon skipping, extend our framework to predict efficacy of multiple AOs, using a new algorithm that addresses interaction of random sets of oligos and RNAs. 3) to verify the correlation of predicted and actual efficacy of exon skipping in vitro and in vivo. 4) to launch the web software and incorporate community feedback to improve its quality.
Short Abstract: Determining cancer signatures, cancer genes and pathways deregulated by them, is a challenging task in human cancer research. Over the last few years, different algorithms are developed to predict signatures related to human cancer by computing their association to the disease outcome. In 2013, David Venet et.al. reported that gene signatures unrelated to cancer are significantly associated with breast cancer outcome. They compared 47 published breast cancer signatures to the random signatures of identical size and found that 60% of these signatures are not significantly better outcome predictors than random signatures.
In this research, we show that significant random signatures have information. Based on these informative random signatures, a score is assigned to each gene. Then a PPI network is obtained from String Database. Combined scores from String are determined as interaction weights between proteins and gene scores are assigned to related protein nodes. This network is used to diffuse the gene scores using the diffusion kernel approach proposed by Kondor and Lafferty. To determine the significance of diffusion scores, permutation procedure is used. For defining significant signatures, 10% of first diffusion score genes are enriched into pathways using ConsensusPathDB. The significant pathways are selected and the enriched genes within these pathways are considered as significant signatures. For evaluation, we use the ACES database defined by Staiger et al. in 2012, which is a cohort of 1606 breast cancer samples collected from 12 studies in NCBIs Gene Expression Omnibus. Results show that predicted signatures are significantly associated with the outcome.
Short Abstract: Hypoplastic left heart syndrome (HLHS) is a congenital heart defect in which the left ventricle is severely underdeveloped. Patient-derived induced pluripotent stem cells (iPSCs) with high-throughput sequencing technology in RNA-seq and whole genome sequencing (WGS) provides an unprecedented opportunity to investigate the disease-specific transcription profiles linked to potential genetic causes in HLHS.
We have identified 4000 and 6000 differential genes between the family members in iPSCs and differentiated cells respectively. Most the differential genes show high expression pattern in iPSC from proband. However, the pattern from differentiated cells showed both high and low expression in proband compared with parents. Data analysis of WGS according to rarity, functional impact and mode of inheritance identified 62 genes with recessive or de novo variants potentially involved in the pathogenesis of HLHS. Fourteen of them displayed transcriptional differences in undifferentiated iPSCs from the HLHS-affected individual while 25 out of 62 mutated genes showed significantly different expression levels in differentiated cells. Eleven genes differentially expressed in both iPSCs and differentiated cells and were further characterized in a time-course of guided cardiac differentiation. Four of 11 genes, ELF4, HSPG2, SGMS2, and SDHD, displayed significantly different profiles between differentiating HLHS-iPSC and control counterparts (p=0.04, 0.008, 0.03 and 0.04). Notably, none of these genes had been previously linked to HLHS.
Data integration of WGS and RNA-seq using in vitro iPSCs disease modeling platform prioritized candidate genes that may contribute to HLHS and could be a target for future mechanistic studies for disease-specific clinical applications.
Short Abstract: Mycobacterium tuberculosis, the causative agent of tuberculosis (TB), infects an estimated two billion people worldwide and is the leading cause of mortality due to infectious disease. Targeting the host-pathogen interaction as a new therapeutic paradigm has its advantages over antibacterials against TB. Here we performed a comprehensive meta-analysis of host gene expression profiles in response to both latent and active TB infections in a total of 7 public Gene Expression Omnibus (GEO) datasets. Common genes and pathways in response to tuberculosis infections were identified through a rigorous, iterative analysis pipeline. A total of 241 genes were significantly differentially expressed across at least 4 out of 7 active TB comparisons and were significantly enriched in 41 canonical human pathways. The role of protein kinase R (PKR or EIF2AK2) in antiviral cell response was the most significant pathway. This pathway is known to play a critical role in host response to TB infections by regulating MAPK, NF-kB and inflammatory cytokines. Several novel pathways were also identified, including two pathways related to thrombocytosis that might occur in active pulmonary tuberculosis infections, and two pathways in Parkinson’s disease. Interestingly, genetic analyses of the 241 genes identified 28 genetic variants with significant association with Parkinson’s disease based on public datasets of genome-wide association studies, thus further demonstrate a strong connection between the two diseases. Based on the gene and pathway analyses we propose several drug-repurposing opportunities, such as intravenous immunoglobulin and Palivizumab for anti-TB therapeutics, supported by literature or connectivity map analysis.
Short Abstract: In this study, we described a keyword extraction technique which uses Latent Semantic Analysis (LSA) to identify semantically important single topic words or keywords. We compared our method against two other automated keyword extractors, Tf-idf (Term frequency-inverse document frequency) and Metamap on PubChem BioAssay text descriptions, using human annotated keywords as reference. Our results suggest that the LSA-based keyword extraction method performs comparable to the other techniques therefore LSA based keyword extraction method can be effectively and efficiently used to extract keywords from text descriptions when compared to existing keyword extraction methods in an incremental update setting.
Short Abstract: Lung adenocarcinoma possesses distinct patterns of EGFR/KRAS mutations between East Asian and Western, male and female patients. However, beyond the well-known EGFR/KRAS distinction, gender and ethnic specific molecular aberrations and their effects on prognosis remain largely unexplored. Association modules capture the dependency of an effector molecular aberration and target gene expressions. We established association modules from the copy number variation (CNV), DNA methylation and mRNA expression data of a Taiwanese female cohort. The inferred modules were validated in four external datasets of East Asian and Caucasian patients by examining the coherence of the target gene expressions and their associations with prognostic outcomes. Modules 1 (cis-acting effects with chromosome 7 CNV) and 3 (DNA methylations of UBIAD1 and VAV1) possessed significantly negative associations with survival times among two East Asian patient cohorts. Module 2 (cis-acting effects with chromosome 18 CNV) possessed significantly negative associations with survival times among the East Asian female subpopulation alone. By examining the genomic locations and functions of the target genes, we identified several putative effectors of the two cis-acting CNV modules: RAC1, EGFR, CDK5 and RALBP1. Furthermore, module 3 targets were enriched with genes involved in cell proliferation and division and hence were consistent with the negative associations with survival times. We demonstrated that association modules in lung adenocarcinoma with significant links of prognostic outcomes were ethnic and/or gender specific. This discovery has profound implications in diagnosis and treatment of lung adenocarcinoma and echoes the fundamental principles of the personalized medicine paradigm.
Short Abstract: The development of multiple drug resistance in bacteria like Escherichia coli (E. coli) has posed a tremendous burden on pharmaceutical industries to find novel drugs. The resistance mechanism of bacteria can be realized as fine tuning of plethora of both intrinsic and extrinsic resistance gene cascades like β – lactamases, outer membrane proteins (OMP), two component systems etc. To provide a comprehensive view of its drug resistance, a database viz. uCARE: www.e-bioinformatics.net/ucare was developed consisting of drugs with reported resistance, resistance genes, strain and segment specific operons and pathogenic islands. The pharmacological activity of drug may also be perceived as function of its physiochemical properties and structural framework. Therefore resistant drugs apart from being a catalog to physicians may also contain fingerprints of factors/patterns associated to susceptibility of drugs towards resistance. With this rational, the existing work was an attempt to evaluate the feasibility of prediction of these patterns by implementing quantitative structure-property relationship (QSPR) modeling. Data fitting was carried out for 25 physiochemical attributes of drugs by implementing 28 different machine learning algorithms with 10-cross validation. To improve statistical significance, the process was iteratively carried out with 1-10 random seed values and taking each attribute as class variable followed by calculation of mean and standard deviation of correlation coefficient of all the iterations for each class attribute. The model generated was further manually scrutinized for its biological support. The models generated have expanded our understanding of drug resistance – drug structure relationship.
Short Abstract: Virus discovery from high throughput sequencing data often follows a bottom-up approach where taxonomic annotation takes place prior to association to disease. Albeit effective in some cases, the approach fails to detect novel pathogens and remote variants not present in reference databases. We have developed a species independent pipeline that utilises sequence clustering for the identification of nucleotide sequences that co-occur across multiple sequencing data instances. We applied the workflow to 686 sequencing libraries from 252 cancer samples of different cancer and tissue types, 32 non-template controls, and 24 test samples. Recurrent sequences were statistically associated to biological, methodological or technical features with the aim to identify novel pathogens or plausible contaminants that may associate to a particular kit or method. We provide examples of identified inhabitants of the healthy tissue flora as well as experimental contaminants. Unmapped sequences that co-occur with high statistical significance potentially represent the unknown sequence space where novel pathogens can be identified.
Short Abstract: The article reviews the literature of specialty for relevant articles about the association of anti-depressants during pregnancy and the appearance of Autistic Spectrum Disorder symptoms in the offspring. After the selection of the relevant articles and the calculation of the effect side, the meta-analysis was done using R code. At the end of our preliminary work we retained only 3 articles that were pertinent to the purpose of our study. We extracted the available data in Excel files and then did a meta-analysis. The final results showed a positive association between the exposure to antidepressants in uterus and ASD.
Short Abstract: The GeneCards suite knowledgebase integrates ~150 diverse biomedical data sources, providing an extensive network of relationships and annotations leveraged in next generation sequencing (NGS) analyses. GeneCards, the human gene compendium, enables researchers to effectively navigate and inter-relate the wide universe of human genes, variants, proteins, cells, biological pathways, expression, diseases, and drugs. MalaCards, the human disease database, unifies copious disease-related terms and annotations, and provides a high quality scored gene-disease network with varying stringency levels, defining 4,500 “elite” genes for 8,000 diseases. VarElect, our NGS prioritizer, scores and ranks variant-containing genes based on relevant diseases/phenotypes, exploiting the rich information in the GeneCards suite, and taking advantage of its sophisticated search facility. This allows one to infer both direct and indirect gene-phenotype links, thereby facilitating comprehensive NGS interpretation. The indirect mode benefits from GeneCards’ diverse gene-to-gene relationships to add “guilt by association” connections. Comprehensive annotation is crucial for browsing, filtering and selecting candidate variants. Thus, VarElect enables rapid push-button NGS interpretation, replacing laborious manual database scrutiny. To further leverage this tertiary NGS analysis capacity, we developed TGex, the translational genomics expert, which incorporates the VarElect interpretation engine in a VCF-to-report pipeline. TGex performs extensive VCF annotation and filtering, consolidating numerous attributes and meta-data into a single view. It features distinct tabs dedicated to relevant genetic models, facilitating the elucidation of clinical projects involving thousands of exome/whole genome NGS analyses, and providing fast comprehensive results, a tangible asset for future trends in genome analysis.
Short Abstract: Precision oncology aims to improve cancer patient outcomes by tailoring drug treatments to each individual. In a search for genetic markers that predict response, several large cancer cell line (CCL) screens have been performed measuring the growth of CCLs when treated with a panel of drugs at varying doses. The current computational tools used in this area reduce these data to a single value indicating response for each CCL/drug combination. Although this simplification greatly reduces the data, it may overlook important modes of response and contribute to the difficulty in finding agreement across studies.
In this work I have developed a new supervised machine learning method called BaTFLED (Bayesian Tensor Factorization Linked to External Data) to predict the full dose-response curves for CCL/drug combinations. BaTFLED uses subtype, mutation, copy number and expression data for CCLs as well as target and structural features for drugs to estimate projection matrices that link these data to a three-dimensional array (tensor) of responses. Distributions on the values in the projection matrices are estimated using variational approximation and can be used to predict response for new samples. When applied to the largest data set of this type recently released by the Cancer Target Discovery and Development (CTD2) Network (907 cell lines, 545 drugs and 16 doses) the method accurately predicts responses for CCLs and drugs not used for training. Additionally, by utilizing sparsity-inducing prior distributions, the model can select predictors and highlight relationships between the CCL and drug features that govern response.
Short Abstract: Numerous studies and clinical trials have been done using various microarray platforms for last two decades. Huge amount of microarray data with valuable clinical information have been accumulated. Meanwhile, NGS-based gene expression profiling (RNA-seq) has supper performance and poises to replace microarray-based assays in near future. Therefore, for both economic and scientific reasons, there is a tremendous need to establish a mechanism to bridge these two technologies in order to re-utilize the legacy microarray data.
Our previous study indicated that with appropriate gene mapping and data transformation, gene signatures can be transformed between microarray and RNA-seq gene expression assays. However, application of RNA-seq data trained models to microarray data largely depends on the gene mapping and the training algorithms.
In this study, we developed BRIDGES to construct predictive models that can be applied across-platforms. BRIDGES uses K-S statistic test on Hub-Genes, the genes that show consistent expression order in microarray and RNA-seq data for the samples. We used the 498 neuroblastoma samples profiled with both microarray and RNA-seq to estimate the transferability of the prediction models trained with three different algorithms on four clinic endpoints. The predictive models using Hub-Genes had similar prediction performance between intra-platform and inter-platforms, regardless of gene mapping and modeling algorithms. Furthermore we validated BRIDGES with an unrelated acute myeloid leukemia data set of 170 samples to predict Cyto-Risk of patient. Since selection of Hub-Genes is independent to technology platforms, cross-platform gene mapping and model building algorithms, BRIDGES can be generally applied to any cross-platform predictions.
Short Abstract: A major barrier to successful integration of information from biology and the bedside (i2b2) is often the lack of intuitive user-friendly interactive tools that allow researchers in clinical and basic science to access, understand, and analyze data readily. To address this limitations, we developed HUeMR (Howard University electronic Medical Records), a secure web-application that enables researchers to investigate de-identified medical records derived from Howard University Hospital (HUH) electronic medical record system (EMR). Notably, HUH is a tertiary academic medical center with a level 1 trauma center handling over 50,000 emergency department visits and over 8,000 inpatient admissions per year. HUH cares primarily for the minority population in the District of Columbia metropolitan area and it provides a broad spectrum of clinical services which generate a significant amount of data that is peculiar to this population. Researchers can investigate disease severity, progression, and treatment response during one or more hospitalizations. To this end, researchers construct queries using a highly intuitive query builder and then visualize data using interactive charts that support drill down functionality. Through the HUeMR application interface, we expose the underlying database schema to the user; thus empowering them with the requisite knowledge to facilitate query construction and promote knowledge discovery. Unlike the i2b2 platform which has received wide adoption within academic health centers, we believe HUeMR reduces the barriers to i2b2 research entry for novice and experienced investigators regardless of computing background. Therefore, this tool will accelerate clinical and translational science research.
Short Abstract: Fast and affordable benchtop sequencers (e.g. PGM from Ion Torrent) are becoming more important in improving personalized medical treatment. Still, distinguishing genetic variants between healthy and diseased individuals from sequencing errors remains a challenge.
Here we present VARIFI, a pipeline for finding reliable genetic variants (SNPs and INDELs). We optimized parameters in VARIFI by analyzing more than 170 amplicon sequenced cancer samples produced on the PGM. In contrast to existing pipelines, VARIFI combines different analysis methods and, based on their concordance, assigns a confidence score to every identified variant. In addition, VARIFI includes methods to identify low-frequency variants, which is necessary for early stage cancer diagnostics. Furthermore, VARIFI applies variant filters for biases associated with the sequencing technologies (e.g. incorrectly called homopolymer-associated indels with Ion Torrent). VARIFI automatically extracts variant information from publicly available databases and incorporates methods for variant effect prediction.
VARIFI requires only little computational experience and no in-house compute power since the analyses are done on our server. Running VARIFI assures that the sequenced data are all processed in a coherent and standardized way, which in turn facilitates reproducibility and comparability between different samples. VARIFI is an open source web-based tool available at varifi.cibiv.univie.ac.at.
Short Abstract: There are a number of pharmacologically interesting targets, but only a small portion of them is exploited in drug market. Thus identifying novel therapeutic target is a promising field of study, ultimately, it should be helpful to development of pharmaceutical industry.
In order to computationally measure the alteration of disease phenotype induced by target gene, we used a gene regulatory network.
Based on the direction of regulation, a novel classification model was constructed to identify new therapeutic targets. Known targets and disease-related genes (DisG) were extracted from TTD and OMIM, respectively. We built a directed gene regulatory network using interaction information between genes from PID. The target genes which are known to be therapeutic targets for Alzheimer’s disease, Breast cancer and Prostate cancer, respectively were used as positive sets. For each disease, 10 negative sets were made out of randomly extracted genes. Shortest paths from target genes to DisGs were obtained by exploring the network. As a result, a vector composed of regulations of DisGs was created for each target. The reciprocals of the distance was assigned to DisGs as regulation values. If multiple shortest paths exist, all the values were added up and the sum was assigned to destination point. The DisGs disconnected with known target genes were excluded from the vector. Finally, we applied random forest to the lists of vectors consisted of positive and negative set. The model was validated using LOOCV, and the AUCs for Alzheimer, Breast and Prostate cancer are 0.975, 0.972 and 0.957, respectively.
Short Abstract: Cost-effective sequencing-based assays have become dominant methods for studying gene expression and epigenetic regulations. The development of analysis methods, however, lags far behind still quickly advancing sequencing technologies and their applications. Some existing tools are no longer sufficient for new and more advanced sequencing applications. For example, many early tools, designed for data with few replicates, cannot take full advantage of sequencing data with more replicates; and some of these tools rely on some unrealistic assumptions (e.g., Poisson distribution) that typically lead to an inflated false positive rate.
Taking advantage of increasing number of biological replicates from more recent sequencing data, and parallel computing power from high-performance computing (HPC) clusters, we introduce a novel balanced permutation method in parallel computing to identify deferentially expression or epigenetic changes from sequencing data. Our method uses balanced permutation to estimate distribution under true-null model, and then uses the estimated distribution to calculate the statistical significance of observed data. Our method can work even if the numbers of replicates between two groups are different, and can automatically switch from systemic permutations to random permutation when possible permutations is larger than a defined threshold.
We did the performance evaluation of our method with both simulated and real RNA-seq data. The results showed that our method, in comparison with traditional permutation methods, not only increases power in detecting significant changes, but also reduces the computation cost significantly. Our method is implemented in R, and will be freely available to the public once it is released.
Short Abstract: Trials involving genomic driven treatment selection require the coordination of many teams interacting with a great variety of information. The need of better informatics support to manage this complex set of operations motivated the creation of OpenGeneMed.
OpenGeneMed is a standalone and customizable version of GeneMed (Zhao et al., 2015), a web-based interface developed for the Molecular Profiling based Assignment of Cancer Therapy (NCI-MPACT) clinical trial coordinated by the NIH.
OpenGeneMed streamlines clinical trial management and it can be used by clinicians, lab personnel, statisticians and researchers as a communication hub. It automates the annotation of genomic variants identified by sequencing tumor DNA, classifies the actionable mutations according to customizable rules, and facilitates quality control in reviewing variants. The system generates summarized reports with detected genomic alterations that a treatment review team can use for treatment assignment.
OpenGeneMed allows collaboration to happen seamlessly along the clinical pipeline; it helps reducing errors made transferring data between groups and facilitates clear documentation along the pipeline.
OpenGeneMed is distributed as a standalone virtual machine, ready for deployment and use from a web browser; its code is customizable to address specific needs of different clinical trials and research teams. Examples on how to change the code are provided in the technical documentation distributed with the virtual machine. In summary, OpenGeneMed offers an initial set of features inspired by our experience with GeneMed, a system that has been proven to be efficient and successful for coordinating the application of next-generation sequencing in the NCI-MPACT trial.
Short Abstract: Retrieving significant genes from gene expression data has been one of the most important topics in biomedical the field for a long time. With the prevalence of high-throughput technologies, genomic data such as RNA-seq has become widely available for the study. Although many feature selection techniques have been applied to detect significant genes from these kinds of datasets, conventional methods often have problems in the reproducibility due to the large number of features in datasets.
In this paper, we suggest an ensemble feature selection for the high-dimensional gene expression data. We apply L1-norm support vector machine to filter out irrelevant features efficiently, considering the robustness of features using instance perturbation. By applying recursive feature elimination on filtered feature set, the optimal set of features is acquired. The proposed method is compared with some well-known feature selection methods using cancer RNA-seq datasets, and we prove the superior performance of proposed method in terms of classification accuracy and the feature robustness. As the proposed approach performs moderately on datasets consisting of a large number of features, it is expected to be applicable to various kinds of researches for biomarker discovery.
Short Abstract: Many human diseases share risk factors and involve overlapping biological processes. Moreover, multiple diseases often co-occur in vulnerable individuals, and we need a knowledge discovery framework that can simultaneously handle combinations of morbidities, molecular and physiological features and environmental factors across diverse study designs to obtain more accurate disease subtypes.
We have developed an R package, Numero, specifically designed for defining subtypes of samples with partially overlapping features or continuum of phenotypic characteristics. In our framework, the self-organizing map (SOM), an unsupervised pattern recognition technique, is adopted to organize high-dimensional data on a 2D canvas according to rank-based similarity criteria, with pre-defined divisions on the canvas for estimating regional descriptive statistics. The obtained map is coloured according to locally averaged values for a particular variable, thus revealing the differences in the phenotypic profiles between specific subpopulations in an easily observable visual format.
Here, we introduce our new implementation of the framework and demonstrate its use in several case studies (such as the Framingham Cohort and the FinnDiane Study). The Numero package provides biomedical scientists with the means to combine heterogeneous data types in a highly intuitive and flexible way with the necessary suite of statistics to verify significant multivariate patterns in an era of big data.
Short Abstract: The vast amounts of high-throughput patient data become more accessible in the last decade. The Cancer Genome Atlas (TCGA) Project publishes various patient data for 34 cancer types and regularly enlarges the repository. Survival time prediction is quite important to personalize treatment strategies for the patients. Our study aims to classify cancer patients based on their survival time (long- or short-term) by utilizing RNA-sequencing (mRNA) and reverse phase protein array (RPPA) data obtained from the TCGA project. We apply the personalized PageRank algorithm on a protein-protein interaction network to uncover the most predictive features in the RPPA data. Later the mRNA data of the selected features are used to train machine-learning methods. The learning is evaluated via a 100-fold cross-validation scheme. We tested the proposed method on the 35 glioblastoma multiforme (GBM) and 243 kidney renal clear cell (KIRC) cancer patients. The proposed method correctly predicts the survival classes with the average accuracy of 73% and 77% for GBM and KIRC patients, respectively. The method performs significantly better than the individual classifier, which is trained with either mRNA (66% for GBM, 72% for KIRC) or RPPA (65% for GBM, 70% for KIRC) data. Thus, the integration of two types of patients’ data with PPI information leads the better results for the survival time prediction. In the next phase of the study, biological functions of the predictive features will be analyzed.
Short Abstract: There are many studies to elucidate therapeutic similarity between drugs. Among them, text mining is one of promising approaches because it can obtain valuable information from vast amount of unstructured data. In this study, co-occurrence of drug and disease in sentences of abstracts in literatures is counted, and also co-occurrence of drug and gene is counted. Then drug-disease co-occurrence matrix and drug-gene co-occurrence matrix are generated. For each drug pair, drug similarities are calculated using MI(Mutual Information) and matrices. Also side-effect, chemical, and GO similarities for each drug pair are calculated respectively. For side-effect similarity, a binary vector is generated for each drug where this vector indicates membership of all the side effects. Jaccard coefficient is used to calculate similarity between two binary vectors. Class label of “same” is given to each drug pair if ATC code(level 4) for the two drugs of the pair is equal and “different” is given to the pair otherwise. One group has side-effect, chemical, GO similarities as features, and the other group includes MI similarity in addition. In order to see the trend change of the two groups, we build classifier and compare AUCs of the groups. We can see that the AUC has improved with addition of MI similarity, and validate that text mining can be exploited to identify similar drugs.
Short Abstract: Ebola Virus (EBOV), a member of Filoviridae family, causes severe hemorrhagic fever known as Ebola Virus Disease (EVD) with a mortality rate of up to 90%. We aimed to identify conserved proteins parts since the first outbreak, in order to gain more insight into the molecular biology of EVD. Secondly, the goal of this study was to map functional information to those conserved residues. We employed an array of computational biology tools to i) create a collection of large number of proteins sequences from based on EBOV genomes sequenced during recent and previous outbreaks and correlated proteins conservation to functions ii) collect known and predict novel post-translational modifications on EBOV proteins ii) collect protein-protein interactions between virus and host proteins iii) map conserved residues onto three-dimensional structures and proteins complexes to identify modified residues present at interaction interfaces and iv) find motifs that may mediate protein-protein interactions. We identified the most conserved residues in EBOV proteins and complexes and explored their functional attributes by predicting (a) post-translationally modified sites (b) presence of eight conserved PTMs in protein-protein interactions and (c) two linear motifs. Phosphorylation is the most frequent PTM-type in our analysis and we predicted three potential kinases responsible for these modifications. The presence of ATM kinase motifs in all EBOV proteins is the most important finding in our analyses. Based on our results and current understanding of EVD dependent activated pathways, an association is anticipated of ATM kinase with related pathways and kinases through which Ebola pathology is achieved.
Short Abstract: Enteropathy-associated T cell lymphoma (EATL) is an intestinal tumor, with a median survival time of less than 1 year. It is a rare disease with two main subtypes described. Very little is known about the genetic mutations and gene expression signatures that define this disease, or the extent to which the two types of EATL are genetically distinct.
In this study, we performed whole exome sequencing to 100-fold depth of 69 EATL tumors including 41 type I cases and 23 type II cases. We defined somatic mutations, copy number alterations, and HLA genotypes in these cases from sequencing data. Additionally, we generated RNA sequencing data on the same EATL tumors. Corresponding clinical and outcome data was collected on the same cohort.
We found that both type I and type II EATLs had overlapping patterns of mutations and similar overall survival. The most commonly mutated genes were chromatin modifier genes (34%) including ATRX and ARID1B. We also identified recurrent somatic mutations in signal transduction genes, including JAK1 and BCL9L. TP53 mutations were also recurrent (12%). Copy number amplifications in 9q, 1q, and 5q occurred most frequently and were present in both subtypes. RNAseq identified the gene expression signatures that distinguish the two types of EATL. The DQ2 or DQ8 HLA genotype is present in the majority of type I cases (90%) while occurring at population level frequency in type II cases (36%). Our study defines the genetic landscape of EATL and highlights the genetic and clinical overlap between the two types.
Short Abstract: Viruses and bacteria are known factors in lymphomagenesis. For instance, Epstein-Barr Virus (EBV) is recognized as an important trigger for Burkitt Lymphoma(BL), while Helicobacter pylori infections are strongly associated with MALT lymphoma. Defining the microbiomes of these cancers is a necessary first step in order to understand the interactions between pathogens and tumor cells.
In this study, we performed total RNAseq on 60 cases of Burkitt lymphoma, 400 cases of diffuse large B-cell lymphoma (DLBCL), and 40 cases of enteropathy-associated T-cell lymphoma (EATL). All samples were sequenced on Illumina platform with a read length of 100 bp. In addition, the HPV-infected HeLa cell line was used as a control sample for testing the sensitivity and specificity of our analysis method.
We developed a bioinformatics pipeline to generate the microbiome profile of each sample. We aligned the sequencing reads to the reference human genome (GRCh37) and the GenBank transcriptome. The pathogen reference genomes for all viral and bacterial genomes were extracted from the NCBI database. We found that 22% of Burkitt lymphomas were EBV positive compared to 8% of DLBCL and zero EATL. EBV positive cases were verified using EBER expression with a concordance of over 90%. We also identified HPV18 in the HeLa cell line.
These data represent a starting point for understanding the interactions between pathogens and tumor cells in lymphomas.
Short Abstract: Introduction
Colorectal Cancer is one of the most common forms of cancer and is the second leading cause of cancer deaths in the world. While patient-derived xenograft models have emerged as an important tool to study tumor growth, progression and response to therapy, the extent to which they recapitulate the genetic features of the primary tumors is unknown. In this study, we compare colorectal patient tumors and patient-derived tumor xenografts (PDX) obtained from the same patient to identify recurrently mutated genes and their overlap in colorectal cancer.
Method
We generated patient derived xenografts from 8 different colorectal cancer tumors. We sequenced exomes of these tumors, paired germline DNA and PDXs to identify somatic and germline mutations from 8 patients with colorectal cancer. We compared somatic mutations along with copy number alterations between the tumors and PDXs. We further applied copy number analyses along with somatic allele frequencies to infer tumor purity. The integration of allelic fraction and copy number information also helped us to identify tumor sub populations.
Results
We identified significant recurrent mutations in PI3K pathway gene PIK3CA, ERBB-RAS pathway gene NRAS and Wnt pathway genes TCF7L2 and APC. We found hotspot mutations in tumor suppressor gene TP53, transcriptional modifier gene SMAD2 in the patients and PDXs. We observed significant subclonal heterogeneity in frequently mutated genes in colorectal cancer both in patient tumors and PDXs.
Our study demonstrates that tumor-specific PDX models faithfully recapitulate the genetic heterogeneity and clonality in tumors and are viable models for targeted therapies.
Short Abstract: A rate limiting step in the development of clinical diagnostics is the identification of biomarker targets to which specific diagnostic reagents can be produced. IDRIS allows users to explore potential biomarker targets for sensor-based diagnostic tools. The system has a database that comprises all bacterial sequences and annotations in NCBI's RefSeq. The raw sequence data has been processed by a bioinformatics pipeline consisting of several tools, resulting in a large integrated dataset that can be queried for biomarker targets.
Short Abstract: Recently, there has been a major effort by neuroscientists to systematically organize and integrate vast quantities of brain data. Here, as part of the NeuroElectro project (www.neuroelectro.org), we employ large scale text-mining, supplemented with manual curation, to extract quantitative measurements from >100K Neuroscience full-text articles. Our initial analysis revealed that a portion of the variance in the electrophysiology measurements can be explained using experimental conditions metadata (animal age, recording electrode type, temperature…), thus we decided to expand the types of metadata we collect and analyze their effect.
Specifically, we use a combination of regression algorithms, including random forests, linear regression, principal component analysis and heuristic approaches to create statistical models for prediction of electrophysiological values. We found that electrophysiology properties could be predicted with an R^2 value of ~0.65 (10x cross-validation). Using the curated articles as a gold standard we are also able to rank the metadata entities by their predictive power for each electrophysiological property. This allows us to discover previously unknown effects of certain experimental conditions on the results.
Ultimately, our models will enhance the existing NeuroElectro database by enabling normalization of neuronal electrophysiology values for differences in experimental conditions. It would further allow us to link electrophysiological diversity across neuron types to corresponding differences in gene expression levels and disease phenotypes thus creating new ways of diagnosing neurological disorders and providing new targets for drug treatments.
Short Abstract: Dynamic pathway interaction in cancer may provide insights of cancer progression, and cancer pathogenesis in systematic biology. However, most high-throughput datasets from human are obtained statically but not dynamically, which makes hard to observe dynamical changes of systemic disease progression. Here we suggest a method for convert static data sets to dynamic data sets by grouping the survival time and/or cancer subtypes such as grade or stages. We tested 485 mRNA expression data sets of glioblastoma multiforme with 595 clinical information downloaded from TCGA and 186 KEGG pathways.The four pairs of dynamic pathways were found with cutoff 0.87 Spearman correlation coefficients. The first pair is ‘GRAFT VERSUS_HOST_DISEASE’ and ‘AUTO_IMMUNE_THYROID_DISEAE’ with Spearman correlation 0.9083. The second pair is ‘INTESTINAL_IMMUNE_NETWORK_FOR_IGA_PRODUCTION’ and ‘ASTHMA’ with Spearman correlation .8857. The third pair is ‘LEISHMANIA_INFECTION’ and ‘ASTHMA’ with Spearman correlation 0.8782 and the last pair is ‘SYSTEMIC_LUPUS_ERYTHEMATOSUS’ and ‘INTESTINAL_IMMUNE_NETWORK_FOR_IGA_PRODUCTION’ with Spearman correlation 0.8752.
Acknowledgement. This work was supported by the National Research Foundation Grants funded by the Korean Government (NRF-2015R1D1A1A01060287) and (NRF-2013S1A2A2034953).
Short Abstract: Recent developments have allowed the 3D visualization of biomolecular structures in web browsers. Beyond the ability to visualize, manipulate, compare or edit 3D molecular structures without the need for a separate app, data can be exchanged dynamically between the 3D viewer and a 1D genome browser. This enables both automatic and user-controlled annotation of genomes with 3D molecular structure Information. A straightforward application consists in mapping and visualizing SNPs from dbSNP and ClinVar in protein coding regions onto 3D protein structures directly from a genome browser. The residues in contact with a SNP position can be returned as a set of sequence marks in the genome browser (as a track). This is important as little is known on the impact of these variations, in particular on protein structure and their assemblies. These positions can then be correlated with any track of information available in the genome browser, including any other structural information, such as ligand/drug binding, oligomeric interface, protein DNA, protein-RNA interfaces, post translational marks and their binding sites, structurally and evolutionarily conserved residues. This web interface to map sequence based information to structure, and more importantly to interrogate structure directly, with little to no expertise, to annotate sequences with molecular structure marks and interactions opens new perspectives. Molecular interactions can be fingerprinted, visualized and compared in 1D. We present here two applications:one on mapping pathogenic SNPs on 3D structures for selected proteins; the second example maps ligand-protein interactions from GPCR structures to 1D and compares their fingerprints.
Short Abstract: With over twenty-four million articles and an exponential growth rate, it has become difficult to stay abreast of the PubMed literature. To address this problem, we have created a novel biological network that aggregates data from millions of PubMed articles. This network, called MeTeOR (MeSH Term Objective Reasoning network), converts manually curated MeSH terms that tag most PubMed articles into a global, structured summary of biological information that is then available for data-driven discovery. When compared to the current knowledge representations in many standard curated databases regarding associations among genes, drugs, and diseases, MeTeOR contains both confirmatory as well as novel information. Furthermore, when a hypotheses-generating algorithm is applied to the MeTeOR network, it suggests new potential disease or drug associations for most genes. In the most realistic test of performance—a time-stamped analysis, hypotheses generated from a MeTeOR network based on the literature prior to 2014, were shown to have significant predictive power for discoveries that were published after 2014. These preliminary data support MeTeOR as a promising representation of the biomedical literature, that may provide ready access to high-quality information about the relationships linking genes, drugs, and diseases, and also that support novel hypotheses towards systems analysis and precision medicine.
Short Abstract: Background: The genesis of Alzheimer's disease is associated with the dysregulations at different levels. The availability of large-scale RNA-seq expression data allows us to study the transcriptional dysregulations associated with the genesis of Alzheimer's disease.
Methods: We collected RNA-seq expression from the AMP-AD program and constructed the expression profiles of AD patients and normal people. All the gene pairs were investigated for the coexpression changes in the transmission from normal to disease. The disease related dysregulated genes were selected by three criteria, including the differential expression, association with disease genes and number of dysregulation partner.
Results: 64 genes were predicted to be dysregulated and Alzheimer's disease related genes. Both the co-expression and differential co-expression analysis suggest these genes to be interconnected as a regulatory network. Functional annotation suggest the predicted genes to have strengthened connections with the synaptic function related genes. We also check the evolutionary conservation of the gene co-expression and found human and mouse brain to have divergent transcriptomic coregulations.
Conclusion: Our study discover a transcriptional regulatory network, which prefers to have dysregulated interactions with other genes. This network is associated with the genesis of Alzheimer's Diseases by affecting the genes related synaptic related functions.
Short Abstract: Alzheimer's disease (AD) is a common neurodegenerative disease. Age is a known main risk factor for AD. We analyzed the epigenetic mark histone 3 lysine 9 acetylation (H3K9ac) in the human prefrontal cortex of 676 samples from the ROSMAP study. Participants were not cognitively impaired upon study entry. After death, AD pathologies including neurofibrillary tangles were measured and anti-H3K9ac ChIP-seq experiments were conducted. We identified 26384 H3K9ac domains in the ChIP-seq data. The numbers of sequence reads falling into each domain were determined for each sample, and normalized by regressing out technical nuisance variables.
We split the dataset into training (n=446) and test data (n=230). An L1 penalized regression model was fitted on the training data with age of death as outcome and H3K9ac domains as penalized explanatory variables. Gender was added as unpenalized covariate. The penalty parameter was determined by maximizing the cross-validated likelihood on the training set. The coefficients of 10 domains were unequal to 0. This model was used to predict the epigenetic age of the test samples. Predicted epigenetic age showed a moderate correlation of 0.25 with age of death. We defined accelerated aging as the residuals resulting from regressing epigenetic age on age of death and gender. Accelerated aging was positively associated with neurofibrillary tangles (p=0.022).
We further discuss accelerated aging in AD and limitations of our study. We also calculate accelerated aging based on DNA methylation from the same samples [Levine et al., 2015] and compare those estimations to the H3K9ac-derived estimations.
Short Abstract: Alternative splicing (AS) can critically affect gene function and disease, yet mapping splicing variations remains a challenge, particularly for complex splicing patterns that do not fit the mold of classically defined AS.
We propose a new computational tool that defines and quantifies AS in units of local splicing variations (LSVs). LSVs capture both classical AS events as well as more complex patterns of splicing previously ignored by other tools. Our LSV analysis of over 250 RNA seq experiments reveals that complex splicing variations, involving more than two alternative junctions, are much more prevalent than previously appreciated. Such complex variations comprise approximately 30% of human LSVs and are significantly enriched among tissue dependent splicing changes. This suggests that complex LSVs are an important aspect of gene regulation.
Our tool MAJIQ allows us to detect de novo junctions, assess differentially spliced LSVs between groups of experiments, and create a visual summary of how different pairs or groups agree or disagree on differentially spliced LSVs. We find MAJIQ is significantly more sensitive and more accurate than other methods for detecting classical binary splicing events, in addition to its detection of more complex splicing patterns.
Improving disease studies, enhancing predictive models for splicing or mapping the effect of genetic variants are just some of the immediate applications of our novel LSV framework and the MAJIQ software.
Short Abstract: In text mining for regulatory mission, it is a big challenge to digest and interpret prodigious quantities of largely or poorly structured textual information. The means to sift out information germane to regulatory questions is paramount. It is difficult to define the specific terms for the retrieval of relevant documents. Probabilistic topic modeling offers a viable approach, where unstructured documents are characterized as probability distributions of latent topic themes that, in turn, are probability distributions of words. With such a model, the untenable process of searching and reading for answers to a regulatory question in a vast corpus reduces to more careful scrutiny of a small set of documents thematically related to the question. To test the effectiveness and validity of topic modeling, we constructed a ground truth data set by random mixture of 59201 abstracts downloaded from PubMed that contained 39 tobacco use-related themes, and 2 entirely unrelated negative control themes. Latent Dirichlet allocation (LDA) was applied to building topic models, which segregated documents into proper thematic truth categories, even those containing small fractions (<0.1%) of the documents, demonstrating high specificity and sensitivity of thematic characterization. After that, topic modeling is applied to discover the latent associations in the menthol tobacco documents combined from 4 companies with a pseudo balance. The discovered latent topics are not only helpful to understanding the knowledge within the documents, but also useful to define the specific terms to the retrieval of relevant documents for addressing the regulation question.
Short Abstract: Personalized medicine aims to provide more precise guidelines for drug treatment based on the patient’s genomic makeup. Pharmacogenomics is the study of genetic variants such as single nucleotide polymorphisms (SNPs) that are associated with drug response. In addition to DNA sequence variants, there are chemical modifications of DNA bases, e.g. DNA methylation, that do not alter the sequence. DNA methylation regulates gene expression, and so could influence drug response. An important mechanism in gene regulation is alternative splicing (AS). AS is the means by which a cell generates diverse proteins from a single gene by selecting different combinations of exons from the gene for inclusion into the mature mRNA. SNPs and DNA methylation are tightly connected to splicing regulatory elements that control AS, and so contribute to the choice of exons transcribed from a gene. We propose a public resource for pharmacogenomics centered on AS. We will describe the Pharm-AS website to document SNPs and DNA methylation loci that may select exons, via AS, that can alter the translated protein’s response to drug treatment. This will assist identification of transcript isoforms and loci variants that could improve guidelines for drug treatment based on the patient’s genomic information, namely their SNP alleles and DNA methylation profile.
Short Abstract: The immune system detects and attacks cancerous cells, imposing a selective force on tumor cell populations that promotes the emergence of clones capable of escaping immune surveillance. Immunoediting, the process by which tumor genomes evolve to escape the immune system, frequently results in genetic changes that circumvent major histocompatibility complex (MHC) based activation of immune cells. To quantify the impact of MHC-based antigen presentation on immunoediting at the population and individual level, we predicted MHC-I alleles for thousands of tumors and analyzed their effect on the frequencies of somatic mutations. We find that peptide sequences containing residues that are frequently mutated in cancers are significantly less presentable by human MHC-I complexes either because they have poorer binding affinity to MHC-I proteins or are less likely to be generated by proteasomal cleavage. Using a residue-level presentability score that integrates binding affinity and proteasomal processing, we show that somatic mutation frequency is anti-correlated with presentability. Individuals are less likely to acquire specific mutations if they have multiple MHC-I complexes capable of presenting them. Moreover, age at diagnosis of a tumor with a specific mutation increases with the number of MHC-I alleles that are capable of presenting that mutation. Thus, the landscape of somatic mutations in cancer is influenced by immunogenicity through MHC-I presentation.
Short Abstract: Identifying regulatory regions that differ in activity between cancerous and noncancerous cell lines and tissues holds promise for identifying new mechanisms involved in cancer progression. Enhancers and promoters are the regulatory regions involved in regulation of gene transcription – they control genes and pathways that can be investigated as putative therapeutic targets, or these regions may serve as targets themselves. Assays such as ATAC-seq and DHS-seq can identify regulatory region activity by identifying open, transcriptionally active chromatin. To streamline regulatory region data analysis we are developing a Bioconductor R-package. The package will bundle the various analysis steps into a single, compact tool with a well-documented workflow. Additionally, we plan to develop an R-shiny interface to the R-package – R-shiny eliminates the barrier of command-line proficiency, which will make our package accessible to a wider audience. Our example dataset is DHS-seq data for cancerous and normal liver cell lines downloaded from ENCODE. Our workflow guides users through visualizing areas of open chromatin as peaks in the UCSC genome browser, identifying hotspots (location of chromatin peaks), dividing hotspots into enhancer and promoter regions, and identifying significantly altered enhancer and promoter regions. There are 146,556 hotspots in the example dataset: 38,103 classified as promoters (+/- 1,500 bp from transcription start site); the remaining 108,453 classified as enhancers. Methods of identifying altered regulatory regions (based on intensity or peak presence) will be discussed and compared. Finally, overrepresentation analysis with genes linked to these regions will identify pathways linked to liver cancer.
Short Abstract: The Plant Pathways Elucidation Project (P2EP) seeks to identify and map plant pathways in food crops to better understand how they function and can benefit human health. The components of this research project include significant empirical work sequencing, phenotyping, genotyping, gene and pathway annotation, and nutrient composition measurement for multiple food crops and varieties. We combine these data with existing research on human nutrition and disease pathways into an integrated knowledgebase.
Our query and analysis interface is suitable for both domain experts and computational biologists alike. The knowledgebase encompasses hundreds of millions of associations of varying data types and semantic relationships. It uses modular analysis components and third-party data hosting, which allows for future expansion and a scalable solution as the needs of the project grow.
As a public-private partnership between industry and academia, the P2EP project has paid special attention to data security while maintaining performance and ease-of-use with the integrated knowledgebase. Our partners can integrate in-house marker identifiers and other information without risk of exposing intellectual property over an external network.
This knowledgebase empowers personalized nutrition research by increasing the data available to both sides of the nutrition cycle. Breeders use the data to select and cross nutrient-rich varieties, increasing the biodiversity of edible crops. Individuals could select or avoid foods that affect their health status through personalized nutrition programs. Together these tools may highlight non-invasive dietary interventions to improve quality of life and disease management.
Short Abstract: Motivation: Nowadays, knowledge extraction methods from Next Generation Sequencing data are highly requested. In this work, we focus on RNA-seq gene expression analysis and specifically on case-control studies with rule-based supervised classification algorithms that build a model able to discriminate cases from controls. State of the art algorithms compute a single classification model that contains few features (genes). On the contrary, our goal is to elicit a higher amount of knowledge by computing many classification models, and therefore to identify most of the genes related to the predicted class.
Results: We propose CAMUR, a new method that extracts multiple and equivalent classification models. CAMUR iteratively computes a rule-based classification model, calculates the power set of the genes present in the rules, iteratively eliminates those combinations from the data set, and performs again the classification procedure until a stopping criterion is verified. CAMUR includes an ad-hoc knowledge repository (database) and a querying tool.
We analyze three different types of RNA-seq data sets (Breast, Head and Neck, and Stomach Cancer) from The Cancer Genome Atlas (TCGA) and we validate CAMUR and its models also on non-TCGA data. Our experimental results show the efficacy of CAMUR: we obtain several reliable equivalent classification models, from which the most frequent genes, their relationships, and the relation with a particular cancer are deduced.
Availability: dmb.iasi.cnr.it/camur.php
Short Abstract: Hematopoietic stem cell (HSC) gene therapy (GT) applications exploit retroviral vectors, such as HIV-derived lentiviral vectors (LVs), to transduce relevant target cells and insert therapeutic transgenes within the host cell genome. We recently developed a new bioinformatics method that can handle vector integration sites (ISs) landing in repetitive elements (~30% of the overall ISs in GT patients), previously discarded by other tools. Repetitive DNA sequences cover ~50% of the human genome and are associated with specie evolution. The identification of ISs in repeats, and consequently the corresponding clones, has a major impact on the clonal abundance estimations, increasing the power of clonal tracking analyses.
In our ongoing clinical trial for metachromatic leukodystrophy, we retrieved more than 5 million of ISs from 3 gene therapy patients with a follow up of 18 months after treatment. The analysis of ISs landing in repetitive elements compared with the whole genome distribution in GT patients revealed an enrichment in Alu-SINE repeats (~23% of total ISs, p<0.0001). Moreover, we observed an under-representation of ISs landed in L1-L2 LINE repeats (5% of the ISs versus 22%, p<0.0001) and in LTR repeats (observed 2.5%, expected 9%, p<0.0001). We also observed an INT-motif (~5% of Alu) TG-(N)5-7-CA in the genomic sequences flanking the IS. The distribution of ISs within the prototype Alu sequence showed the highest pick (14.8%) surrounding the A-Box of the RNA-polymerase-III promoter corresponding to the INT-motif. These results suggest that novel viral integration specificities can be unraveled by the analysis of genomic repeats.
Short Abstract: Genomic studies and high-throughput experiments often produce large lists of candidate genes among which only a few are truly relevant to the disease, phenotype, or biological process of interest. Gene prioritization tackles this problem by ranking candidate genes by profiling candidates across multiple genomic data sources and integrating this heterogenous information into a global ranking. We describe an extended version of our gene prioritization method, Endeavour, now available for 6 species and integrating 75 data sources. Validation of our results indicate that this extended version of Endeavour efficiently prioritizes candidate genes. The Endeavour web server is freely available at https://endeavour.esat.kuleuven.be/
Short Abstract: Biological systems employ multiple levels of regulation that enable them to respond to genetic, epigenetic, genomic, and environmental perturbations. Advances in high-throughput technologies have generated comprehensive datasets measuring multiple aspects of biological regulations. Public databases, such as TCGA (The Cancer Genome Atlas), have been created for depositing diverse types of omics data for public dissemination. However, sample errors, such as sample-swapping or mis-labeling, are inevitable during the process of data generation and management. Because data errors could lead to wrong scientific conclusions, it is critical to properly match different types of omics data pertaining to the same individual before applying integrative analysis.
We applied a systematic alignment method into TCGA datasets. For example, in the breast cancer dataset (BRCA) consisting of ~1000 samples, we detected multiple sample errors in different types of molecular data. In each type of data, about 3-8% of profiles were not consistent with the labels based on the sample barcodes (16 profiles in microarray, 4 in HM27, 18 in HM450, 9 in GAmiRNA, 84 in HiSeq-miRNA, 31 in CNV). Multi-omics alignments identified sample-swapping of the 16 samples in microarray and mis-labeling of the 8 miRNA samples. Errors in genders or labeling of samples were also observed in other cancer datasets in TCGA (such as glioblastoma, lung, prostate, stomach). These results suggest that sample errors are not a dataset specific problem but more global problem in public databases and, therefore, our approach will provide a critical QC step to clean data for integrative analysis using large-scale dataset.
Short Abstract: Understanding the molecular mechanisms disrupted by mutations that cause human genetic disease is a major challenge in translational bioinformatics. Toward this goal, many studies investigate the structural and functional impacts of amino acid substitutions from protein features and then infer the molecular disruptions caused by disease-causing variants. However, most of these studies either ignore protein 3D structure, or are restricted to specific molecular functions. Therefore, there is a need for large-scale studies that use the structural environment of a residue to analyze the effects of genetic variation that disrupt or introduce a protein functional site. In this study, we collect a data set of germline disease-causing and putatively neutral human sequence variants mapped to protein 3D structures and perform a systematic study of loss and gain of multiple types of functional site as the underlying molecular changes in disease. In particular, we propose a new model to probabilistically reason about function-impacting variants, develop several structure-based functional residue predictors, and assess the impact of disease-associated substitutions on metal binding, post-translational modifications, catalytic activity, macromolecular binding, ligand binding and allosteric regulation. Our results show that a significant fraction of disease-associated human variants are function-altering. Additionally, we use mutagenesis experimental data to demonstrate the feasibility of computationally predicting loss of function events. Finally, we report that our methodology generates confident biological hypotheses for 15% of disease-causing amino acid substitutions and argue that it can be used to guide experimental validation.
Short Abstract: Mouse studies have been instrumental in forming our current understanding of early cell-lineage decisions, however similar insights into the early human development are severely limited. Here we present a comprehensive transcriptional map of human embryo development, including the sequenced transcriptomes of 1,529 individual cells from 88 human preimplantation embryos. These data show that cells undergo an intermediate state of co-expression of lineage-specific genes, followed by a concurrent establishment of the trophectoderm, epiblast and primitive endoderm lineages, which coincide with blastocyst formation. Female cells of all three lineages undergo X-chromosome dosage compensation prior to implantation. In contrast to the mouse, XIST is transcribed from both alleles throughout the progression of this expression dampening, and X-chromosome genes maintain biallelic expression while dosage compensation proceeds. We envision broad utility of this transcriptional atlas in future studies on human development as well as in stem cell research.
Short Abstract: Computerized decision-support systems can improve diagnostic and prognostic accuracy for lung cancer tumor behavior, incorporating routine medical images (CT) and gene expression data. We present the Radiomics Database System (RDS), developed to transform medical images into mineable and meaningful data via image analysis and integration with external data sources. Radiomics generates imaging “biomarkers” that, alongside traditional genomic biomarkers, can connect biological measurements to tumor morphology and characteristics. RDS integrates a research imaging archive, image metadata, image features, tumor data, patient outcomes and demographics, and gene expression profiles. RDS uses an ETL (Extract Transform Load) pipeline to extract and integrate data from different sources containing molecular data correlated with derived imaging features and clinical outcomes. In-house tools were developed to extract image metadata from DICOM headers and to identify 24 2D and 32 3D unique tumor features that represent tumor behavior.
We present a novel system that both stores integrated molecular and radiomics data, and allows users to define cohorts to compare imaging features with clinical characteristics through a Web-based query tool. Further, imaging features can be summarized and visualized as quantitative biomarkers, similar to cBioPortal tools for visualizing gene expression. Currently, the RDS holds integrated information on 235 patients diagnosed with lung adenocarcinoma, including gene expression array data, clinical outcomes, 56 tumor features, and ~400 imaging biomarkers. This dataset allows users to visualize the imaging, compare imaging features to outcomes, and identify relevant cohorts for further study.
Short Abstract: We present an original roadmap to perform the bio-mechanical characterization of proteins thanks to the Static Mode method and the FleXible software. We will show, through various examples, how this approach allows the mapping of residues that are relevant for the enzymatic activity and how we can characterize and predict mutations that impact this activity. We will focus in particular on Ras, which has been determined to be a key signalisation protein in cellular growth pathways. In its wild type form, it acts as a transducer, switching from on to off conformation through GTP hydrolysis. A mutation of a Glutamine in position 61, drastically diminishes its GTP hydrolysis rate which has been associated to an abnormal cellular growth leading to the development of cancer cells. In particular, Q61R Ras mutant is found in more than 25% of malignant melanomas. We aim to target Ras oncomutants numerically in order to determine the mechanisms that lead to its malfunctioning and the way to chemically counteract them. For this purpose, using ab-initio data from QM/MM simulations, Ras oncoproteins are each one separately bio-mechanically characterized and, thus, relevant criteria for GTP hydrolysis inferred with FleXible, a LAAS home made software based on the Static Mode method.
Short Abstract: Prostate Cancer is the most prevalent cancer disease among men in the US and patients often face unnecessary surgeries, because biomarkers and their according classification models are often of poor discrimination accuracy. One layer of gene-regulation which has been undervalued for years but now emerges to be associated with prostate cancer is lncRNA-mediated gene regulation. Individual long non-protein coding RNAs (lncRNAs) have been linked to cancer-specific deaths [Prensner et al. 2013 Nature Genetics], but a robust system biological model for abnormal gene regulation in prostate cancer including lncRNAs is yet to be found. Assessing expression variation of protein-coding and non-protein coding RNAs with high-throughput methods is a promising approach into the direction of such models, but is accompanied by the problem of high-dimensionality. The number of differentially regulated genes exceeds the number of available samples by several orders of magnitude. Reliable models can only be derived through a reasonably small parameter set. Therefore genes showing correlated expression patterns are grouped into clusters reducing dimensionality. One prominent method for such an approach is weighted correlation network analysis as implemented in the R package WGCNA [Zhang et al. 2005 Statistical Applications in Genetics and Molecular Biology].
We used WGCNA to generate a co-expression network model of protein-coding and non-protein coding genes in 124 prostate cancer samples and were able to assign selected lncRNAs to cancer-related pathways. Modules of co-expressed genes were correlated to different clinical metadata (traits) of interest and modules were detected that are significantly correlated with cancer-specific deaths.
Short Abstract: Heterogeneous tissues, like blood, contain of a mixture of cells. Thus, measuring differences in gene expression (e.g., between healthy controls vs. disease cases) in heterogeneous tissues represents a challenge in deciding which cell type contributes to the observed gene expression differences. Given heterogeneous gene expression measures and a matrix of cell proportions, contribution of gene expression from each cell type to the heterogeneous gene expression level can be modeled using linear regression (LR). Modeling cell type-specific gene expression separately for a group of healthy controls and a group of patients (cases) allows identifying cell type-specific gene expression differences. However, several parameters affect the power of LR in detecting cell type-specific gene expression differences. Specifically, group size, multicollinearity (interdependence) among cell proportions (quantified by variance inflation factor, VIF), and “goodness of fit” of linear regression for each gene (quantified by mean squared error, MSE) affect the power of cell type-specific differential expression analysis. Our findings suggest that the power of LR-based cell-type specific differential expression needs to be assessed on a gene-by-gene basis. Using Random Forest classification we outline a set of rules for LR-based differential expression analysis given gene-specific parameters (VIF, MSE). We compared our method with csSAM and DSection algorithms. Our method, implemented as an R package, allows thorough but transparent quality control of the LR power in detecting significant cell type-specific differential expression. Applied to any high-throughput data (e.g., methylation level of CpG probes), our method allows maximizing biological understanding of health and disease differences in heterogeneous data.
Short Abstract: The protozoan parasite Leishmania donovani causes visceral leishmaniasis (VL), a disease which is lethal without treatment affecting 500 000 people each year. With only four drugs available and rapidly emerging drug resistance, knowledge about the parasite’s molecular resistance mechanisms is essential to boost the development of new drugs. However, only little is known about the gene regulation of Leishmania and the few findings indicate major differences to known gene expression systems. Since integration of different ‘omics could shed light on these regulatory systems, we here present an integratory database that contains and connects all genomics, transcriptomics, proteomics and metabolomics experiments that are currently available for Leishmania donovani and related trypanosomes. Relations between the different ‘omics layers were explicitly defined and provided with a level of confidence. A Python framework was developed to analyse and import all data again from scratch to avoid any data analysis biases. Additionally, we developed a user interface which contains analysis tools for new datasets. These tools use smart data mining strategies like frequent itemset mining that link together results from different ‘omics layers. Using the compendium and its tools, we characterized the development and drug-resistance of Leishmania donovani in a system biology context. The genomes of more than 200 strains were examined for associations with phenotypical features and a subset was linked to transcriptomics, proteomics and metabolomics results. The compendium and its associated tools could be used for other organisms with only minor changes.
Short Abstract: Constant improvements in the development of Next Generation Sequencing (NGS) allow exploring whole tumor genomes in different contexts like chromosomal rearrangements and epigenetic markers. One key aspect of tumor genomics is the occurrence of genomic breakpoints (BPs), which can lead to gene fusions resulting in chimeric proteins. Therefore, an association between chromatin states and -BPs has been hypothesized (Berger et al. 2011). In this study, we aim at investigating the association of genetic breakpoints and open chromatin markers in prostate cancer.
The implemented pipeline for this analysis consists of three connected, interacting modules: feature analysis, statistical analysis, and simulation analysis. A systematic evaluation of the enrichment and depletion of BPs in vicinity of chromatin immunoprecipitation sequencing (Chip-seq) peaks was performed using BEDTools (Quinlan and Hall 2010) and the binomial hypothesis test. Interestingly, our results showed that binding sites of the androgen receptor are enriched in the subgroup containing the fusion gene TMPRSS2-ERG, whereas the other subgroup showed depletion in the vicinity of closed chromatin.
We validated the implemented pipeline on prostate cancer cell lines as well as simulated data. In summary, our analysis pipeline revealed strong associations between chromatin markers and genetic breakpoints in tumor cell lines. It can also be extended to analyze BPs in different contexts such as CpG islands and repeats or analyzing epigenetic properties of viral insertion sites.
Short Abstract: Genome-wide association studies (GWAS) require very large number of samples to identify single nucleotide polymorphisms (SNP) significantly linked with the trait under study. Most of these studies leave moderate signals unexplored. Set-enrichment analysis can reveal SNPs and pathways among peri-significant associations which otherwise may have been ignored. Here, we present SNP-based genomicper: a genomic permutation algorithm which uses the SNP association p-values to evaluate the significance of pre-defined pathways. The GWAS SNP p-values are ordered in a “circular genome” according to their genomic position. In each permutation, the complete set of SNP p-values rotates with respect to their genomic locations; assigning new simulated association p-values for each permutation. The permutations allow us to calculate the empirical SNP-set p-value through a count distribution, while the genomic structure and functional correlations among the SNPs are preserved.
As a proof of concept, we applied the genomic permutation approach to a GWAS in hypertension. This allowed us to identify associations at a pathway level, which may have not been identified by traditional methods. We were also able to identify associations in genes known to be strongly associated with blood pressure traits. Genomicper allows us to investigate the enrichment of moderate GWAS signals in pathways, leading to a better understanding of the complex genomic architecture of the trait.
Short Abstract: Chronic inflammatory autoimmune disease show an increasing incidence and still unknown etiopathogenesis. However, the microbiome is known to have a role in these disease, and is associated to immune responses gone awry. As this function is carried forward also by blood cells, we here present a methodology to explore the crosstalk between the gut intestinal (GI) microbiome and the immune response in blood. We apply our method to rheumatoid arthritis (RA) modeled with rats and collagen induced arthritis (CIA). This allows us to highlight that the beneficial control on the sphingolipid metabolism observable in our experiment (with sphingosine-1-phosphate being a known up-regulated factor in RA) is synergistically promoted by host blood and microbial factors namely Ppap2c and Akkermansia.
Short Abstract: Background
Synergistic drug combinations can increase the number of selective therapies using the current pharmacopeia, and overcome unwanted off-target effects that limit the utility of many potential drugs. Identification of synergistic compound pairs has been one of the essential tasks in cancer therapy research. Despite some computational methods to predict compound synergy can potentially complement high-throughput synergy screens, the few that have been published lack rigorous experimental validation or are appropriate only for compounds that modulate well-studied molecular pathways or that are equivalent to previously established combinations. A key but inadequately addressed issue is how to predict compound synergy from molecular profiles of single compound activity and generalize to arbitrary compound combinations even if the lack of mono therapy data. We therefore proposed a new strategy to improve the prediction performance.
Methods and results
Both the area under dose-response curve and drug targets were used to select the drug-effect-relevant molecular features. Then the features were combined as the drug combinations similarity profiles. Random Forest was applied to train regression models using synergy-relevant gene expression, methylation, CNVs and mutations cross cell lines, respectively. Stacking regression was further conducted to integrate the Random Forest models to improve the prediction precision and stability. We predict the synergy of 537 combinations in 85 cell lines. The predictions were high correlated with the known experimental results.
Conclusion
The proposed strategy is valid for the de novo prediction of drug combination synergy for unobserved pair of compounds, thus potentially accelerate the development of precision anti-cancer therapy.
Short Abstract: Epidermal growth factor receptor (EGFR) mutation is a pathogenic factor of non-small cell lung cancer (NSCLC). Tyrosine kinase inhibitors (TKIs), such as gefitinib, are widely used in NSCLC treatment. In this work, we investigated the relationship between the number of EGFR residues connected with gefitinib and the response level for each EGFR mutation type. Three-dimensional trimmed Delaunay triangulation was applied to construct connections between EGFR residues and gefitinib atoms. Through molecular dynamics (MD) simulations, we discovered that when the number of EGFR residues connected with gefitinib increases, the response level of the corresponding EGFR mutation tends to descend.
Short Abstract: Prediction problems in biomedical sciences, such as those posed in DREAM challenges, are notoriously difficult. This is due to incomplete knowledge of the target biomedical phenomena, the appropriateness and quality of the variables and measurements used for prediction, and a lack of consensus regarding the ideal predictor(s) for specific problems. These issues are reflected in the diversity of prediction techniques, datasets, domain knowledge and other ingredients used to develop submissions to DREAM challenges. In such scenarios, a powerful approach to improving prediction performance is to construct ensemble predictors that combine individual predictors. Traditional ensemble methods like boosting and random forest are insufficient for this task as they (generally) assume that the individual/base predictors are of the same type. We propose the use of heterogeneous ensemble methods, such as stacking, for building ensembles for DREAM challenges and other biomedical prediction problems, using a large diverse set of potentially independent base predictors. On several protein function and genetic interaction prediction datasets, we illustrate that such ensembles can provide significant gains over individual predictors, including boosting and random forests. Deeper analysis shows that the superior predictive ability of these methods, especially stacking, can be attributed to: effective balancing of ensemble diversity and performance, effective calibration of outputs, and robust incorporation of additional individual predictors. Motivated by these results, we built stacking-based ensembles of the Rhuematoid Arthritis anti-TNF drug response prediction challenge. Using only six of the individual predictors, these ensembles (AUPR=0.5228) provided prediction gains over the two best individual predictors (AUPR=0.5099 and 0.5071).
Short Abstract: Targeting kinases for cancer therapy is an active area of research made difficult because the binding sites of kinase inhibitors share a highly similar folding scaffold that binds ATP. Thus for lead discovery with desired selectivity, the first priority is to determine ligand-binding profiles across the whole kinome. Experimentally, different kinase-profiling assay technologies have been developed. However they are both time-consuming and expensive to screen vast amounts of small molecules. Alternatively, we present an in silico approach to predict ligand binding selectivity across the whole human kinome for any given compound using structural systems pharmacology. We utilize all human kinase-ligand co-crystal structures(208 kinase, 2383 structures)to develop a function-site interaction fingerprint technique. Then we build a support vector machine (SVM) predictive model using the fingerprint as features. Based on a prediction which also has experimental data for 51 type-I inhibitors to 344 kinases, the results show that our SVM model has a promising predictive performance for lead discovery and polypharmacology of kinase inhibitors. We also assembled 344 kinase structures (344 kinases) from the Protein Data Bank and Protein Model Portal as a dataset of kinome targets for screening. We believe that our in silico method provides us with an efficient and inexpensive tool for kinome-wide high-throughput screening. A similar strategy can be extended to other gene families beyond kinases [1].
[1]. Zheng Zhao, Li Xie, Lei Xie, Philip E. Bourne. Delineation of polypharmacology across the human structural kinome using a functional site interaction fingerprint approach. J Med Chem. 2016. DOI: 10.1021/acs.jmedchem.5b02041.
Short Abstract: So far, bioinformatics has focused on omics that are mainly related to the human host (e.g., genomics, transcriptomics, proteomics), especially when studying disease. However, recent studies have highlighted the relevance of environmental exposure and its influence on disease. Therefore, in order to understand the causes and eventually be able to prevent disease, the effect of the environment on health must be investigated.
In 2005, Dr. Christopher Wild defined for the first time the term exposome. The original concept has evolved to comprise “every exposure to which an individual is subjected from conception to death, requiring consideration of the nature of the exposures and their changes”, conceptualizing the environment’s contribution to health. In order to describe the Exposome, accurately measuring these exposures and their effect on human health is crucial. Omic technologies, such as metabolomics or metagenomics, seem to have the potential to further our understanding of disease causation and progression. However, this is not a trivial task. An individual’s exposome is extremely complex and dynamic throughout lifetime, and the impact of exposures can also change depending on an individual’s stage of life.
With the aim of considering all the factors that characterize disease predisposition and progression a standard modeling procedure is required. Hence, an inductive approach for exposome data mapping must be developed. Integrating the Exposome in the current genome-phenome picture will pave the way towards biomarker discovery and, as a future result, prevention and better treatment of disease.
Short Abstract: A cornerstone of modern biomedical research is the use of mouse models to explore basic disease mechanisms, evaluate new therapeutic approaches, and make decisions to carry new drug candidates forward into clinical trials. However, few of these human trials have shown success. We reported previously that, although acute inflammatory stresses from different etiologies result in highly similar genomic responses in humans, the response of murine ortholog genes poorly correlated with their human counterparts in inflammatory diseases.
A vibrant discussion of the merits and limitations of animal models is long overdue. In order to help the research community to better explore the similarities and differences between human diseases and murine models and to better translate findings from disease models, we developed a model validation database of human inflammatory diseases (MOVD). MOVD compares the genomic response of 6 human inflammatory diseases and conditions (Burns, Trauma, Infection, Sepsis, Endotoxin and Acute respiratory distress syndrome) and matched mouse models, using 2,257 curated samples from our studies and other representative studies in Gene Expression Omnibus. A researcher can browse, query, visualize, and compare the response patterns of genes, pathways and functional modules across different diseases and corresponding murine models. In addition, a set of interactive visualization and analysis web-tools allow the comparisons of user’s own datasets of animal models with the datasets of human diseases.
Short Abstract: The utility of tumor-derived cell lines is dependent on their ability to recapitulate the underlying genomic aberrations found in primary tumor biology. Here, we analyze the exome sequences of 25 bladder cancer (BCa) cell lines and compared mutations, copy number alterations, gene expression and drug response to BCa patient samples in The Cancer Genome Atlas (TCGA). We show that the genomic aberrations found in BCa cell lines mimic patient samples, including similar mutation patterns associated with altered CpGs and APOBEC-family cytosine deaminases, activating mutations in the TERT promoter, mutations in known BCa-associated genes (TP53, RB1, CDKN2A and TSC1), and alterations in chromatin associated proteins (MLL3, ARID1A, CHD6 and KDM6A). We confirmed non-silent sequence alterations in 76 cancer-associated genes. Next, we used PARADIGM to infer pathway activities for cisplatin treated BCa cell lines based on the cell lines’ gene expression and copy number data. We used the inferred pathway activities to build a predictive model of platinum drug response. The predictive model was based on an elastic net regression, which provided an implicit feature selection that identified important pathway concepts relevant to cisplatin response. When applied to BCa patients gathered from TCGA, the model predicted overall response, showing a clear separation in survival of predicted nonresponders vs predicted responders in the platinum-treated patient cohort (p=0.05) and no separation in the untreated patient cohort (p=0.62). Together, these data and predictive models represent a valuable community resource to model basic tumor biology and to study the pharmacogenomics of BCa.
Short Abstract: Recursion has more than 40 TB of high-resolution images of human cells, some diseased, some healthy, which we use to seek treatments for rare genetic diseases. There are more than 5,000 untreated rare genetic diseases, which together affect nearly ten million people in the US alone. Each of these diseases affects too few people for traditional pharmaceutical companies to approach them, so we're building a way to seek treatments for hundreds of these diseases in parallel.
We grow human cells, rapidly build in-vitro models of thousands of rare diseases using RNAi and CRISPR to make genetic perturbations, take pictures of them using high-throughput automated microscopes, computationally extract 1000 structural features like shapes and textures from every cell, and quantify the structural differences that separate diseased from healthy cells. We then apply thousands of drugs to the cells corresponding to each disease, take pictures, and identify drugs that make the cells look healthy again using statistical and machine learning approaches. All this would be impossible without the recent advancements in the fields of imaging, computer vision, and machine learning we have applied to this problem, which will be visually illustrated in the poster.
We currently have one candidate moving towards clinical trials, several other leads in validation, and drug screens for dozens more diseases scheduled for 2016.
Short Abstract: We developed an analysis protocol for individual genome interpretation and used its distinctive features to diagnose numerous clinical cases. We applied the protocol to exomes from newborn patients with undiagnosed primary immune disorders. To yield high quality sets of possible causative variants, we used multiple callers with multisample calling and integrated variant annotation, variant filtering, and gene prioritization.
Our protocol has been similarly revealing in other SCID and CID cases including early diagnosis of Ataxia Telangiectasia, Nijmegen Breakage Syndrome, as well as several novel syndromes. These cases highlight unique features of the analysis framework that facilitate genetic discovery. These help provide crucial information to offer prompt appropriate treatment, family genetic counseling, and avoidance of diagnostic odyssey.
Short Abstract: Integrating cancer genomic data from diverse experimental sources (e.g. in vitro, in vivo, or ex vivo biological samples) and platforms (e.g. microarrays, RNA-seq) poses specific technical challenges due to batch effects, tumor heterogeneity, and ambiguous subtyping. Triple Negative Breast Cancer (TNBC) is especially difficult to treat because of rapid propagation and clonal variance. A potential avenue for improving TNBC treatment is to determine more finely grained subtypes of TNBCs that respond differently to treatments. While a large corpus of TNBC data is available from heterogeneous data sources, conventional unsupervised analysis approaches cannot distinguish between noise related to experimental and platform disparities and biologically relevant signal. We have built a computational framework that uses statistically sound data mining techniques to analyze and compare gene expression signatures gathered from multiple studies of TNBC including patient derived xenograft (PDX) models, cell lines, and primary patient biopsies. By utilizing these methods our pipeline elucidates better potential subtyping on diverse compendia of TNBC data. Our results demonstrate the challenges of heterogeneous data integration, but also reveal potential insights into techniques that can allow comparisons across samples generated by multiple labs using different experimental protocols and data collection platforms. In addition to comparisons among TNBC subtypes, we were also able to cluster and compare TNBC data with other tumor types from 462 PDX models to assess more global patterns and trends, possibly related to tumor microenvironment and the PDX process. Continuing work focuses on correlating TNBC subtypes with tumor response profiles for different therapeutics.
Short Abstract: Malaria is among the most burdensome diseases, with a high rate of resistance outbreaks. The recent report on the emergence of resistance to highly effective artemisinin-based combination therapy (ACT) regimen makes it important to identify chemical classes and parasite targets that have not previously been exploited in antimalarial chemotherapy. Malarial aminopeptidases have emerged as promising new drug targets for the development of novel antimalarial drugs. Studies showed that M18AAP of Plasmodium falciparum malaria has highly restricted specificity for peptides with an N-terminal Glu or Asp residue. Thus enzyme may function effectively in complete degradation or turnover of proteins, such as host hemoglobin, alongside of other aminopeptidases which provides a free amino acid pool for the growing parasite. Inhibition of PfM18AAP's function using antisense RNA is harmful to the intra-erythrocytic malaria parasite and therefore, it has been proposed as a potential novel drug target. In our work, structure activity relationship of the PfA-M18 inhibitors, by identifying structurally novel diverse compounds through structure based virtual screening using Autodock software was done. Natural product molecules from various publically available databases were screened against PfM18AAP and 20 ligands with highest docking scores were selected. Further protein ligand interaction studies were carried out using Ligplot & Pymol. Finally the screened compounds were analyzed on the basis of energetics, stereo-chemical considerations and ADMET properties. The proposed novel lead molecules might cure malaria by blocking the patristic growth by inhibiting M18 enzyme.
Short Abstract: The laboratory mouse is used extensively as a model system to investigate the etiopathogenesis of human disease based on its extensive genetics, fully-sequenced genome, published mouse mutations and large-scale mutagenesis programs that have created point mutations (by ENU and CRISPR) and knock-out (null) mutations covering most of its genome. In addition, programs to develop genetically-defined variant populations mimicking human population variation provide new means to study quantitative traits and complex inherited syndromes. This genomic knowledge, together with increasing phenotyping precision add intricate detail to knowledge about morphological and physiological variation in mouse.
Mouse Genome Informatics (MGI, www.informatics.jax.org) has recently revised the gene-level presentation of this phenotype and disease model data. In addition to information about the number gene alleles and variants, the number of phenotype annotations from single and multigenic genotypes, images and references, we now provide ‘at-a-glance’ yet comprehensive graphical displays into systems affected by variants in genes, with quick links to more detailed information. A section on Human Diseases gives a quick summary of mouse models in orthologs of human genes.
In addition to the more than 46,000 mutant alleles in mice representing almost 12,000 genes, together with QTL and transgenes that have been published or submitted to MGI, the large scale mouse phenotyping projects are set to deliver phenotype data for thousands more genes. MGI provides a new data model for clear and comprehensive gene summary views of simple mutant genotypes to conditional and more complex genome rearrangements.
Funding: NHGRI HG000330 and OD011190-05
Short Abstract: The cellular functions of the molecular elements (genes, proteins, noncoding RNAs,..etc) of biological cells are strongly associated with cell development, malfunctions, and genetic disorder pathways. Moreover, those molecular components cooperate with each other establishing a complex intertwined regulatory network that governs regular cellular pathways as well as the dysregulation or malfunction in pathological processes. Therefore, revealing these critical regulatory interactions in complex living systems is being considered as one of the main goals of modern systems biology.
Advances in next generation sequencing technologies enable the generation of high-throughput datasets that allow for genome-wide association studies and facilitate dissecting the regulation patterns between various molecular elements. In the light of the availability of genomic, transcriptomic, and epigenomic data from different sources and experiments, new integrative approaches are needed to boost the probability of identifying genetic key players and critical regulatory pathways that could drive complex diseases and tumorigenesis.
To this end, we present here three recently published computational approaches that were implemented as freely available software tools to integrate heterogeneous sources of large-scale omics data and unravel the combinatorial regulatory links between different molecular elements. We demonstrated the efficacy of our tools on breast cancer omics data that were downloaded from the TCGA portal. Finally, the provided topological and functional analyses of our approaches promote them as reliable bioinformatics tools for researchers across the life science communities.
Short Abstract: Concern about the reproducibility and reliability of biomedical research has been rising. A bedrock principle of research conduct is that the samples analyzed are correctly identified and not mixed up during processing, but this has rarely been assessed formally.
Here we studied the prevalence of sample misannotation in a large corpus of genomics studies by comparing meta-data annotations of sex to predictions from expression of sex-specific genes. We identified apparent misannotated samples in 46% of the datasets sampled. Extrapolating beyond our corpus, we estimate that at least 33% of all studies have at least one such mix-up (99% confidence interval). Because this method can only identify a subclass of potential misannotations, this provides a conservative estimate for the breadth of the problem. In an additional set of studies that used samples from the same subjects, 2/4 had misannotatated samples. These misannotations are likely to result from laboratory mix-ups rather than subject meta-data collection errors.
Our findings emphasize the need for genomics researchers to implement more stringent sample tracking and data quality control steps, and suggests that re-use of published data should be done in conjunction with careful re-examination of meta-data.
Short Abstract: The NIH’s BRAIN (Brain Research through Advancing Innovative Neurotechnologies) initiative is part of a presidential directive to harness and develop technologies to better understand the brain. An important component is description of transcript activity at a single-cell level within a spatial and temporal context. An essential part of leveraging new technologies is the centralization and dissemination of research data and results. The Broad Institute’s Single-Cell Portal is designed to create a space where the community can interact with and contribute to scientific studies. We describe the currently available functionality and initial workflows including uploading, downloading, and interacting with single-cell datasets. Studies hosted by the portal have several levels of visualization. Initially, a selected study is presented as a global ordination of cells (t-SNE on PCA components from single-cell gene expression) with annotated clusters and a short description to orient the user. From this view of the study, clusters can be selected and explored as clustered heat maps of single-cell gene expression highlighting groups of gene expression that may drive and discriminate cell clusters. For additional resolution, individual gene expression across the study’s cell cluster can then be explored, as well as, the expression levels of the gene in the original study ordination. The metadata, expression matrices, and visualizations associated with the study can be uploaded and downloaded through the portal. For further upstream analysis, links to the primary sequencing data are made available.
Short Abstract: Zika virus (ZIKV) is an emerging mosquito-borne flavivirus, first isolated in 1947 from the serum of a pyrexial rhesus monkey caged in the Zika Forest (Uganda/Africa). In 2007 ZIKV was reported to be responsible for an outbreak of relatively mild disease on Yap Island in the western Pacific Ocean. In the past year, ZIKV has been circulating in the Americas, probably introduced through Easter Island (Chile), by French Polynesians. In early 2015, a new outbreak was recognized in northeast Brazil, where concerns over its possible links with infant microcephaly have been discussed. Providing a definitive link between ZIKV infection and birth defects is still a big challenge. Small noncoding RNAs (small ncRNA) play important roles in biological processes, mainly regulating post-transcriptional gene expression through mechanisms of translation repression and gene silencing. It is well known that some classes of small ncRNA are able to influence viral pathogenesis and brain development. The potential for flavivirus-mediated small ncRNA signaling dysfunction in brain-tissue development provides a compelling mechanism underlying perceived linked between ZIKV and microcephaly. A collaborative database called ZIKV-CDB has been assembled that could help target mechanistic investigations of this possible relationship between ZIKV symptoms and small ncRNA mediated human gene expression control, helping to foster potential targets for therapy. The database is under development, but already includes predicted miRNAs involved in ZIKV/human-host interaction, being available at http://zikadb.cpqrr.fiocruz.br.
Short Abstract: Cell-types of the brain are often poorly characterized. Most gene expression studies that attempt to do so focus on a small number of related cell-types and the differences between them. To be able to have a precise characterization of the cell-type, it has to be compared to all other cell-types in the relevant region. Here, we compiled a database or cell-type specific expression profiles from ~20 sources in order to select marker genes for each of them. The datasets included are manually screened to eliminate contaminated and low quality samples. To select reliable and relevant marker genes, we developed a clustering based method that takes spatial co-existance of the cell-types into account.
Marker genes that we found through our analysis include ones believed to be absent from the brain, and ones that offer biological insights into the inner workings of cell-types. Additionally we found that several widely used markers lack specificity and sensitivity hence, should be used with caution.
We used a variety of methods and independent datasets (bulk tissue expression data, single cell RNA-seq) from the literature and in-situ hybridization to validate our marker genes in both mice and man to ensure robustness of the selected genes.
Finally we show that the marker gene expression in whole tissue samples can be used to estimate cell-type proportions by analysing whole tissue datasets of different brain regions with known cell type proportion differences and of neurological diseases with known effects on cell type proportions.
Short Abstract: The PDGFRβ gene is expressed in mural cells of the vasculature, including the mesangium of the kidney. The mesangium is of particular interest due to its role in renal scarring. Very little is known about the mesangium, and the characteristics of this cell type are difficult to investigate as no known marker specific to the mesangium exists. The PDGFRβ gene expresses a transmembrane protein that is used to label mesangial cells, however, many other cells types in the kidney also express PDGFRβ and are consequently labeled using this approach. For this reason, a marker specific to the mesangium is desired to assist investigations of the characteristics and function of the mesangium.
To better understand the regulation of PDGFRβ, candidate cis-regulatory modules (CRMs) were identified upstream and in the first intron of PDGFRβ using the computational tool GAMI (Genetic Algorithms for Motif Inference). These candidate CRMs were then compared to epigenetic data made available by the ENCODE project. This allowed for the identification of CRMs that were predicted to be active in the embryonic kidney of Mus musculus. CRMs were synthesized and reporter constructs were sub-cloned downstream of CRMs. These constructs were then micro-injected into M. musculus embryos, where they were integrated into the host genome. Transgenic embryos were then harvested at E17.5 and screened for reporter activity. This revealed reporter activity specific to the glomeruli. These results suggest that the computationally derived CRMs for PDGFRβ accessed here are biologically active and specific to the glomeruli of the M. musculus kidney.
Short Abstract: Despite advances in cancer diagnosis and treatment strategies, it has been difficult to identify robust prognostic signatures in cancer. Cell proliferation has long been recognized as a prognostic marker in cancer, but has not been investigated across multiple cancers using tissue-based RNA sequencing. Here we explore the role of cell proliferation across 19 cancers (n=6,312 patients) from The Cancer Genome Atlas project by employing a ‘proliferative index’ derived from gene expression associated with PCNA expression. This proliferative index is significantly associated with patient survival (Cox, p-value<0.05) in 8/19 cancers, which we have defined as ‘proliferation-informative cancers’ (PICs). In PICs the proliferative index is strongly correlated with tumor stage and nodal invasion. Furthermore, PICs demonstrate lower proliferation machinery expression relative to other cancers (Spearman, p=1.76E-23). Transcriptome-wide predictive survival modeling using multivariate Cox regression with L1-penalized log partial likelihood (LASSO) for feature selection outperformed the ‘proliferative-index’ in 18/19 cancers. Survival associated expression patterns were relatively unique between cancers, however PICs have a common survival signature of 86 genes (Cox, p<0.05 across all 8 cancers). Additionally, we find that proliferative index is significantly associated with somatic mutation burden (Spearman, p=1.76E-23). This study presents new cancers for which cell proliferation may be an important prognostic marker, but demonstrates that modern machine learning techniques can identify survival models more predictive than and independent of proliferative index for most cancers. We also prevent evidence for cell proliferation as a proxy for clinical parameters and confirm an association between cell proliferation and somatic mutation burden across cancers.
Short Abstract: Despite the recent resurgence in productivity, drug development remains an incredibly costly task. There is a general acceptance that there is a need for complementary approaches to the current paradigm of R&D. One such complementary approach is that of drug repositioning, which focuses on the identification of novel uses for existing drugs. Marketed examples of repositioned drugs include those identified through serendipitous or rational observations, highlighting the need for systematic methodologies. Systems approaches have the potential to enable the development of novel methods to understand the action of therapeutic compounds, but require an integrative approach to biological data. Here, we present DReNIn, an integrated RDF dataset for drug repositioning. DReNIn integrates data from over 20 sources describing drugs in relation to their effect on targets and diseases associated with Homo sapiens. A SPARQL endpoint for querying DReNIn is also provided. Furthermore, we introduce DReSMin, an exact exhaustive algorithm developed for the identification of connected sub-components within a target graph, semantic subgraphs. Instances of semantic subgraphs allow for the inference of novel interactions not immediately evident in a network. Finally, we introduce two applications that make use of DReSMin. The first of these approaches infers novel drug-target associations, whilst the second uses gene-disease associations, along with other relevant data, to infer novel drug-disease associations. It is hoped that the work presented here will provide useful data sources and tools for the wider community, enabling the identification of drug repositioning opportunities for novel disease treatments.
Short Abstract: Background:
Large-scale efforts to measure genomic patterns across several cancer types have helped to identify the genetic diversity of cancer. When variance within a single condition is large, or samples are highly correlated, as in cancer, fixed-effects models incur more false-positives.
Description:
Our method identifies differentially-expressed genes between cancer types with high expression variability by using a mixed-effects model that incorporates relatedness between samples to account for variance within cancer types. Relatedness is the correlation of somatic and germline variants.
We validated on simulated and real data and compared against a baseline fixed-effects model. When simulated samples are highly correlated and have >20 samples within a sample group, the mixed-effects model achieves a FPR of 0.012 and a FNR of 0.19. Comparatively, the fixed-effects model has a FPR of 0.02 and a FNR of 0.63. This shows that a mixed-model approach is able to account for structured variability within cancer types and identifies less FP and significantly less FN.
We applied this on TCGA samples of uterine carcinosarcoma and uterine corpus endometrial carcinoma. Our mixed-effects model identifies 5505 genes that are significantly different between cancers. GO-analysis revealed enrichment for cell adhesion processes, a known difference between epithelial cancers and carcinosarcomas. Conversely, the top differentially-expressed genes from the fixed-effects model were not enriched for cell adhesion related processes.
Conclusion:
We validated our model on simulated data and successfully applied it to real data to identify differentially-expressed genes. We recommend accounting for sample relatedness in comparative analyses, especially in cancer.
Short Abstract: An orphan disease is any disease that affects a small percentage of the population. There are an estimated 8000 orphan diseases in the USA, and most of them are genetic. Orphan diseases are a great burden to patients and society because they commonly afflict people early in life and persist throughout the lifetime. Discovering genes causing these diseases would help biomedical researchers understand pathogenic mechanisms of disease and may enable better diagnosis and treatment. However, the experimental approaches to identify disease-causing genes are time-consuming and labor-intensive. Therefore, developing effective computational algorithms for the prioritization of candidate genes is a critical step in the research pipeline. Fortunately, high-throughput techniques have generated a lot of genome-wide protein-protein interaction (PPI) data. Several computational approaches have been proposed to use PPI data to identify and prioritize candidate disease-causing genes. Nevertheless, these previous approaches have limited accuracy. In this presentation, we report a method using PPI network-based features to discover and rank the candidate disease-causing genes. Furthermore, since genes with similar phenotypes tend to be functionally related, we integrate PPI data with gene ontology (GO) annotations and protein complex data to further improve the performance. Results of 128 orphan diseases with 1184 known disease genes collected from the Orphanet show that our proposed methods outperform existing methods for discovering candidate disease-causing genes. Importantly, in a case study of several orphan eye diseases, the top predictions are consistent with literature reports, suggesting that other prioritized genes from our approach may be excellent candidates for further investigation.
Short Abstract: Motivation:
Recent studies show evidence that non-coding variants may play an important role in disease etiology, including psychiatric disorders. However, prioritizing these less-understood variants and selecting the right candidates for further investigation remains a central challenge in the field. Current tools for annotating non-coding genetic variations provide a general indicator of their deleteriousness, but lack, for example, tissue-specific context that could better illuminate their role in a particular disease.
Results:
In this work, we propose a new machine learning-based approach that relies on tissue-specific data to estimate variant impact on brain tissues. By integrating information from several genome-scale databases, including GTEx and RoadMap Epigenomics, we derive tissue-related features. Using this data representation, we train a predictive model to discriminate variants with prior evidence for brain relevance from variants unlikely to affect the brain. The resulting model predictions, which we call the Brain Relevance Score (BRS), are an estimate of how related a genome position is to the brain. After computing BRS for every nucleotide position in the human genome, we validate it on genomic regions known to be related to psychiatric disorders, such as autism spectrum disorder. We then use BRS as a filter and combine it with state of the art deleteriousness score (CADD and DANN) and report higher sensitivity in detecting brain-related damaging variants in the Simons Simplex Collection data for autism. The learning framework we demonstrate here is broadly applicable and could be easily adapted for any other tissue beyond the human brain.
Short Abstract: Variation of immune cells across patient blood and biopsy samples provides insights to the immunobiology of autoimmune disorders and infectious diseases. Research has demonstrated the importance of DNA methylation of CpG dinucleotides in defining cellular identity. Here we present a method that utilizes bisulfite sequencing to measure millions of CpG sites to yield a robust platform for interpreting blood profiles into their constituent cell types. Through the selection of cell type-specific DNA methylation signatures and multiple linear regression, we demonstrate the reconstruction of cell type proportions from in silico and in vitro cell mixture experiments. We also demonstrate our method on clinical blood samples. Our method provides an approach to quantify cell type quantities with potential clinical applications beyond immune-biology.
Short Abstract: To date, the main challenge of Molecular Simulations is to reach the time and length scales of biologically relevant processes. To bridge the current gap of several orders of magnitude, reduction of the solvent's degrees of freedom is a promising strategy, particularly for large systems. Explicit solvent simulations can provide the reference data for the parametrisation of an efficient implicit solvent model representing the water forces. By considering the solvation forces as a purely Gaussian stochastic process, the aim is to model the mean and variance of the observed force distributions. We present here a novel approach for the determination of atomic solvation parameters by fitting implicit solvent forces representing a SASA-based mean field to atomic solvent forces from explicit water simulations of 188 proteins with different folds [1,2]. The resulting atomic solvation parameters \sigma_i couple the accessible atomic surface area to the solvation mean force.
The variance of the explicit solvent mean force was modeled by the atom-specific friction parameters \gamma_i that couple the solvation force variance to the atomic velocities. These friction parameters were derived using the explicit force variance and the characteristic force correlation decay time \tau_cor.
The derived implicit solvation model reproduces water mean forces and force fluctuations, which is useful not only for the simulation or refinement of large macromolecular structures and assemblies, but also for the prediction of free solvation energies associated with specific mutations at the protein surface.
Short Abstract: Modern multi’omic screens of biological samples readily produce enormous numbers of measurements, yet finding statistically significant association patterns among features within these data remains challenging, in part due to the loss of statistical power inherent with testing large numbers of hypotheses. Here, we present and validate a novel hierarchical framework, HAllA (Hierarchical All-against-All association testing), for general purpose and well-powered association discovery in high-dimensional heterogeneous datasets. HAllA combines hierarchical nonparametric hypothesis testing with false discovery rate correction to enable high-sensitivity discovery of linear and non-linear associations in high-dimensional datasets (which may be categorical, continuous, or mixed). HAllA operates by 1) discretizing data to a unified representation, 2) hierarchically clustering paired high-dimensional datasets, 3) applying dimensionality reduction to boost power and potentially improve signal-to-noise ratio, and 4) iteratively testing associations between blocks of progressively more related features. We validated and optimized HAllA using synthetic datasets of known correlation structure. At a fixed false discovery rate, HAllA is consistently better-powered than naive all-against-all association testing across a range of association types. As an example application, we used HAllA to identify associations between high-throughput profiles of microbial genera and metabolites of the human gut microbiome. In addition to recapitulating known associations, we identified 60 previously unobserved associations, including between Ruminococcus and Lithocholic acid. Our implementation of HAllA is highly modular, enabling addition or substitution of alternative methods at each step, and is available with documention at http://huttenhower.sph.harvard.edu/halla.
Short Abstract: The Gram-negative Outer Membrane Lipopolysaccharide (LPS) is the first molecular barrier for antimicrobial peptides (AMP).The Non-stoichiometric addition of phosphoetalonamine (PEtN) at the LPS causes a bacterial resistance increase to Polymyxin-B (PolB) and others AMP. To understand how PEtN addition contributes to the antimicrobial resistance we perform experimental and theoretical assays using artificial vesicles of LPS extracted from E. coli strains with PEtN added into the Lipid-A, KDO and HepI. Interestingly, the PEtN incorporation at the HepI modify the Polymyxin-B (PolB) activity when compared with the other PEtN incorporations, observed by the time-dependent changes at the z-potential, size stability and LPS Vesicle-vesicle interactions. These observations suggest that PEtN stabilize the LPS layer, modifying its surface potential and supramolecular properties. To understand how the PEtN modify the LPS properties we describe its intramolecular and intermolecular interactions using Molecular dynamics simulations. We observed that PEtN at Hep I interacts with the KDO carboxyl group, without modifications of the LPS area per lipid or LPS-LPS interactions. On the other hand, modification at the KDO reduce the LPS inner core length, shielding the described binding site for Polymyxin-B. These biophysical characterization gives a new insight into the AMP-LPS interactions and its contribution to the outer membrane destabilization process and how the bacteria modulate the LPS physicochemical properties as an antibiotic resistance mechanism, which is essential to develop future antibiotics strategies. Granted by FONDECYT de Inicio en la Investigacion (Daniel Aguayo V) 11130576.
Short Abstract: Oncologists experience an enormous volume of often-contradictory evidence regarding the therapeutic context of chemopredictive biomarkers and have to search, comprehend and apply knowledge from various data sources about predictive biomarkers when developing therapy plans for patients. This motivates the need to design and develop Big Data wrangling approaches including Natural Language Processing (NLP) and information retrieval methods to automatically extract personalized-therapy information from PubMed abstracts, open access articles and conference proceedings.
This work describes MACE2K, a natural language processing tool to extract evidence to determine the predictive effect of cancer biomarkers on therapy response. The tool extracts entities including cancer types, gene/protein names, mutations and various types of other genetic anomalies, therapies and patient outcomes. To aid in the development of the web portal that displays text mining results to the end users, we have established a data exchange mechanism for NLP output using the community standard BioC and JSON formats. The exchange formats provide tagged bio-entities and their relationships, along with character locations in text passages for evidence attribution. We have applied human factors engineering (HFE) methods to design interfaces for displaying summary results from studies with most compelling evidence.
Through this work we seek to create improved software systems to harvest the wealth of information contained in biomedical Big Data in order to advance our understanding of human health and disease, and advance the field of precision medicine.
Short Abstract: Background: DNA double-stranded breaks (DSBs) can result from endogenous processes, such as replication stress, or exogenous ones, like chemotherapeutics. Here, we study DSBs induced by a cancer drug, hydroxyurea (HU), in the budding yeast Mec1 mutant. Both Mec1 mutation and HU treatment induce replication stress.
Methods: We used our method (BLESS) to label DSBs that created in yeast samples with different levels of the replication stress (Mec1 mutation and HU treatment). The cells without HU and in G1-phase were used as controls. Statistical methods, mathematical modeling and Fourier analysis were used to distinguish DSB pattern related to replication stress from background pattern.
Results: We found that the DSBs occur preferentially around DNA replication origins in the yeast HU-treated samples. We identified BLESS read patterns, in which the mapped reads showed the strand bias around replication origins, as resulting from collapsed replication forks. Using the model we constructed based on the most active known replication origins, we predicted 169 early origins in the budding yeast genome. Our predictions were confirmed by BrdU data (indicating active origins) in the treated samples and lack of signal in the samples not undergoing replication. We also used filtering based on Fourier methods to identify less efficient origins that were not apparent in the data before filtering.
Conclusions: We provide a computational method to identify replication stress-induced breaks in BLESS-Seq data. Understanding mechanisms of DSB creation will advance our understanding of the underlying cause of cancer and ultimately may guide the therapy.
Short Abstract: Large scale projects such as ENCODE and FANTOM5 have produced a wealth of sequencing data which can be utilised to study epigenetic features associated with gene regulation. My aim is to use the omics data in these publicly available resources, to develop novel strategies to identify regulatory elements that directly or indirectly contribute to mutant mouse phenotypes and their associated human diseases.
To systematically analyse the mouse regulatory landscape, we built a comprehensive map of various chromatin states in 22 mouse tissues. We performed chromatin segmentation using a hidden Markov model on nine histone marks data from ENCODE to predict genome wide promoters and enhancers in 22 tissues. Using a k-means algorithm, we clustered different regulatory regions across multiple tissues, to reveal distinct groups of tissue specific regulatory elements including enhancers and promoters. To determine if these regions directly influence tissue specific functions, we integrated the Mouse Genome Database and obtained the mammalian phenotypes associated with the nearest genes in each cluster. We discovered mouse phenotypes associated with genes in each cluster to correlate with the tissues in which the regulatory element is active. Finally, on comparing the neighbouring genes of tissue specific promoter and enhancers with mouse phenotype annotations in the International Mammalian Phenotyping Consortium (IMPC) database, we have identified several enhancer-gene partnerships that independently support novel mouse phenotypes in IMPC, for numerous previously un-annotated genes.
Short Abstract: Discovery of novel prognostic biomarkers is very important to identify high risk cancer patients. Glioblastoma (GBM) is an aggressive brain tumor with a 5-year survival rate of <10%. It is more common in males, but the molecular basis of this gender bias is not understood. To this end, we looked for genes that have a gender-specific effect on GBM survival using two microarray (Agilent-UNC and Affymetrix-Broad) expression datasets from The Cancer Genome Atlas (TCGA). We ran a Cox model using the gene expression, gender, interaction between gene expression and gender, age and batch as covariates. Our study showed that the platform was an important factor and that the results were partially reproducible at the individual gene level. There were over 300 common gender-specific genes significant in both datasets, with the Agilent-UNC set having a larger number of significant genes. Further, on performing Pre-ranked Gene Set Enrichment Analysis (GSEA), we found overlapping relevant molecular signatures that had the most significant enrichment across the TCGA datasets. Interestingly, the GBM proneural subtype signature was significantly enriched in both datasets for Hazard ratio (HR) greater than 1, whereas the GBM mesenchymal subtype signature was significant for HR less than 1. This suggests that males with high gene expression for the proneural signature have a worse prognosis compared to males with lower expression, while males with high expression for the mesenchymal signature have better survival. A follow-up study will identify specific genes.
Short Abstract: Large scale cancer molecular profiling, which includes the measurement of mutation variants, copy number variations, RNA, protein expression and immunohistochemistry levels, is becoming a more accessible strategy in treating patients. The understanding of these molecular profiles may advance the treatment of cancer patients by the application of precision medicine. In this study, advanced, metastatic cancer patients seen at the Georgetown-MedStar hospital system were offered tumor molecular profiling through CLIA-CAP certified commercial diagnostic labs. In total, we have obtained and analyzed the test results of ~1000 patients using the Caris Molecular Intelligence™ service with the eventual goal of informing treatment decisions.
Through integrating molecular profiling data with clinical data, including therapy and outcome information, we create an interactive R module to visualize the data at the marker and patient level and perform exploratory and survival analyses. In addition, we applied statistical correlation and pathway analysis to the integrated clinical/molecular dataset to identify patterns and enables hypothesis generation.
Results revealed important gene changes in transcription regulation, immune responses, as well as DNA repair. Furthermore, network analysis showed crucial genes and pathways central to cell proliferation, immune response and these highly connected “hub” genes represent likely control points in biological system.
In conclusion, the statistical modules we developed are accessible to other biomedical researchers, to allow them to intuitively interact with important datasets for reproducible research, and the summary view of the tumor molecular profiling may yield important insights of patient information and to improve treatment plans.
Short Abstract: Insertions of transposable elements (TEs) can disrupt genes, and cause dysregulation of gene expression. The model for TE suppression has been studied in detail in model organisms, but in humans, the detailed mechanism of TE regulation remain unknown. Our objective is to understand TE dysregulation with cancer as a model system, and identify miRNAs and genes that control the transcript level of TEs. We measured TE transcript levels in the RNAseq data of cancer samples in the Cancer Genome Atlas (TCGA), and identified patients that have differential expression of L1HS transcripts. We tested the correlation between smallRNA transcript levels and L1HS, and gene transcript levels and L1HS. We found that unlike other transposon families L1HS transcripts are always overexpressed in cancer compared to the normal tissue, although the degree of overexpression varied across patients and cancer types. We have identified several miRNAs and genes that are significantly associated with L1HS expression across 512 patients. Known host factors that co-immunoprecipitate with L1 ORF protein in human are not over-represented in the list of genes significantly correlated with TE expression. We cannot distinguish whether the associated expression pattern are genes and miRNAs controlling or responding to TE over-expression. But, we have identified a list of candidate genes and miRNAs functioning in the TE control pathways in human somatic cells.
Short Abstract: Assigning cancer patients to the most effective treatments requires an understanding of the molecular basis of their disease. While DNA-based molecular profiling approaches have flourished over the past several years to transform our understanding of driver pathways across a broad range of tumors, a systematic characterization of key driver pathways based on RNA data has not been undertaken. Here we introduce a new approach for predicting the status of driver cancer pathways based on signature functions derived from RNA sequencing data. To identify the driver cancer pathways of interest, we mined DNA variant data from TCGA and nominated driver alterations in seven major cancer pathways in breast, ovarian, and colon cancer tumors. The activation status of these driver pathways were then characterized using RNA sequencing data by constructing classification signature functions in training datasets and then testing the accuracy of the signatures in test datasets. The signature functions differentiate well tumors with nominated pathway activation from tumors with no signs of activation: average AUC equals to 0.83. Our results confirm that driver genomic alterations are distinctively displayed at the transcriptional level and that the transcriptional signatures can generally provide an alternative to DNA sequencing methods in detecting specific driver pathways.
Short Abstract: Diet and nutrition affect the development and amelioration of human phenotypes and disease. Vitamins and essential nutrients are necessary for survival, but secondary metabolites in plant-based foods are linked to human health benefits such as anti-oxidant and anti-inflammatory activity.
Association databases and the scientific literature provide evidence supporting these links but the number and volume of these sources poses a problem. Current search tools return a specific subset of relevant documents but cannot extract important concepts and relationships.
Text mining was used to extract relationships from the scientific literature to supplement manually curated relationships found in association databases.
In this study, we aggregated concepts and relationships of plant components and human health from association databases and text mining. Concepts and relationships were integrated into a high density semantic network developed using Neo4j. This network can assist in the elucidation of health benefits conferred by plants. Researchers can extract paths from this network that explain how bioactive components in plants affect human health at the molecular level. We developed a statistical metric that ranks these paths based on their plausible biological interest as bioactive components that prevent or ameliorate human health phenotypes. This work allows for data driven hypothesis generation for nutrigenomics, allowing for targeted, efficient validation experiments.
Short Abstract: One of the major challenges in realizing the promise of cancer precision medicine is the identification of new therapeutically targetable driver genes. Many tumors show no alterations in known driver genes, and not all currently known driver genes can be therapeutically targeted. Consequently many cancer types have few/no FDA-approved targeted therapies administered on the basis of genomic alterations – e.g. breast cancer has only one target (HER2) with approved therapies based on somatic alterations, while ovarian and endometrial cancer have none. Here, we take advantage of multiple approaches to nominate novel candidate targets. Using copy number alterations, we focus on genes with statistically significant enrichment of amplification versus deletion (compared to other genes in a given cancer type), as well as statistically significant overexpression when amplified. We prioritize targets by focusing on candidate kinases and GTPases, and investigating their biological properties with respect to known or potential druggability, utilizing multiple sources of information including databases and knowledge-bases such as DGIdb, Drug Bank, Gene Drug Knowledge Database and canSAR. Applying our approach to the TCGA datasets of ovarian, breast and endometrial cancers, we recovered several known drivers (such as PIK3CA, MYC and JAK3) and drug targets with FDA-approved therapies (HER2 in breast cancer, BRAF, with approved drugs in melanoma), and candidate targets for which clinical trials are already underway in other cancers (including BRD4 and AURKA). Novel druggable targets include ABL2, JAK3, EPHB6, MAP4K1, TUBB, TUBB1 and MYLK2. Our approach can be readily adapted to any cancer type characterized by aneuploidy.
Short Abstract: MicroRNA-mediated gene regulation is a complex dynamic process involved in fine-tuning fundamental cellular processes and human disease development. Interactions between microRNA and gene are highly dependent on the interplay of several factors that may change over time, including microRNA expression, availability of target mRNA, binding affinity, and competing activity of other endogenous RNAs. Here, we investigated microRNA regulation as a combinatory stochastic process through examining condition-specific microRNA regulatory modules in tumorigenesis and progression. To stratify the gene regulation network, we employed an integrated meta-regression analysis based on a large-scale genomic data from nine types of solid cancer, which includes multi-level information such as microRNA and mRNA expression, transcription factor regulation, DNA (CNV) variation, and methylation, from 4,206 cancer patients. Our analysis has identified a total of 10,934 unique microRNA-mRNA interactions in these cancers. While only 0.4% of interactions were consistent across four stages of the same cancer, 77% were highly specific to a certain condition, e.g. occurring in only one stage of the respective cancer. Based conditional microRNA-mRNA interactions, we uncovered 17 microRNA regulatory modules that involved in multiple functional pathways under various conditions. For example, miR-92a, -193b and -186 co-regulated ErbB and Wnt signaling pathways in kidney, lung, and stomach cancers. As opposed to the varied patterns in different cancer conditions, our method demonstrates high fidelity functional roles in cancer, which may possibly indicate a function driven mechanism underlying microRNA regulation. We developed the first tool for studying the dynamic microRNA regulation, which can be accessed freely at http://sbbi.unl.edu/miRNADynamicReg.
Short Abstract: Evidence that humoral immune response contribute to the protection against HIV infection has accumulated. The statistical analysis from primate and human clinical vaccine trials suggest an association between reduced risk of infection and Fc-mediated antibody effector functions. A number of in vitro cell line based assays, have been developed to identify their divergent mechanisms during the protection of HIV infection. Despite the fact that, various immune cells are involved in different effector functions, how antibody Fc and FcγReceptor interactions tune specific effector functions remain little understood. In this study, we aimed at characterizing the critical similarities and differences among antigen specific antibodies in HIV infected subjects, vaccine recipients and appropriate controls. A common set of polyclonal antibodies were well characterized by both cellular effector function and biophysical assays. Both supervised and unsupervised approaches were used to evaluate the association between effector functions and antibody antigen interactions. Unique sets of biophysical features have been identified to be capable of predicting effector functions with various degrees of success, which provides meaningful insights into antibody effector functions’ mechanisms. Our results successfully captured unique antibody functionalities in cellular effector functions, and might facilitate a better understanding of fc-mediated immune responses in future HIV studies.
Short Abstract: Type 1 Diabetes mellitus (T1D) is caused by autoimmune destruction of insulin-producing beta cells in the pancreas. Chronically elevated blood glucose levels lead to metabolic and inflammatory changes that can perturb the function of the immune system and multiple organs. In this study, we analyzed the urinary proteome of 100 T1D patients and their respective healthy siblings in an age range from 3-18 in order to identify candidate protein biomarkers. We used mass spectrometry-based shotgun proteomics.
The comparative urinary proteome analysis was performed on a dataset of 618 proteins, 110 of which were differentially abundant with statistical significance (adjusted p-values < 0.05). Thirty-two proteins had a fold change greater than +/-1.4; LRG1, CD14, AZGP1, GM2A, and CTSD were the most upregulated proteins, and SDC4, ICOSLG, CD320, CD44, and INAFM2 were the most downregulated proteins in T1D patients.
Protein-protein interaction analysis using the 32 proteins was performed in silico using the PINA2 server. The network was uploaded into Cytoscape with the cerebral layout plugin for visualization of interaction. Many of the proteins in this network were localized in one subcellular organelle, the lysosome, and revealed a computationally inferred T1D pathway from CPE (process proinsulin) to INS (insulin), then to CTSD (acid protease), FN1 (fibronectin), LRG1 (granulocyte differentiation), and SDC4 (intracellular signaling). LRG1 and SDC4 were the most upregulated and downregulated proteins, respectively, in the T1D patient cohort. This pathway and the functions of many differentially abundant proteins suggest that lysosomal metabolic activities are increased in T1D patients.
Short Abstract: More accurate diagnostic methods are pressingly needed to diagnose breast cancer, the most common malignant cancer in women worldwide. Blood-based metabolomics is a promising diagnostic method for breast cancer. However, many metabolic biomarkers are difficult to replicate among studies. We propose that higher-order functional representation of metabolomics data, such as pathway-based metabolomic features, can be used as robust biomarkers for breast cancer. Towards this, we have developed a new computational method that uses personalized pathway dysregulation scores for disease diagnosis. We applied this method to predict breast cancer occurrence, in combination with correlation feature selection (CFS) and classification methods. The resulting all-stage and early-stage diagnosis models are highly accurate in two sets of testing blood samples, with average AUCs (Area Under the Curve, a receiver operating characteristic curve) of 0.968 and 0.934, sensitivities of 0.946 and 0.954, and specificities of 0.934 and 0.918. These two metabolomics-based pathway models are further validated by RNA-Seq-based TCGA (The Cancer Genome Atlas) breast cancer data, with AUCs of 0.995 and 0.993. Moreover, important metabolic pathways, such as taurine and hypotaurine metabolism and the alanine, aspartate, and glutamate pathway, are revealed as critical biological pathways for early diagnosis of breast cancer. In conclusions, we have successfully developed a new type of pathway-based model to study metabolomics data for disease diagnosis. Applying this method to blood-based breast cancer metabolomics data. This modeling approach may be generalized to other omics data types for disease diagnosis.
Short Abstract: Recent advances in high-throughput technologies have enabled the comprehensive characterization of various cancer types at multiple omic levels. Extracting relevant biological knowledge from this huge amount of information represents a remarkable opportunity in cancerology. However, this achievement is limited by the presence in the data of various overlapping biological factors linked to the tumor cells or to the tumor microenvironment and non biological factors linked to sample processing or data generation.
To deconvolute these factors, Independent Component Analysis (ICA), originally developed to solve the blind source separation problem, is perfectly suited. In this work ICA is applied to transcriptomic and methylomic data obtained from 32 different tumor types. Each data matrix is thus decomposed into a number of components, each of which is characterized by an activation pattern both across genes and across samples.
Our analysis identified multi-cancer-shared and single-cancer-specific components. Using colorectal cancer (CRC) as a paradigm, we showed that our approach can significantly contribute to the puzzling problem of CRC subtypes identification and characterization. Indeed, the recently published CRC consensus subtypes were consistently retrieved in our analysis and new molecular insights concerning these subtypes were highlighted. Notably, other signals of promising interest, not included in the already known CRC subtypes, were detected. Among them, of particular interest is a component jointly regulated by STAT1 and IRF4.
Ongoing analysis aims at comprehensively characterize all the identified pan-cancer components and at integrating the results obtained with methylomic and transcriptomic data to get more insights on cancer complexity.
Short Abstract: Loss of sensory hair cells is the major cause of hearing and balance disorders[1]. Sensory hair cells reside in sensory epithelia inside the ears of all vertebrates. Although mammals are unable to regenerate these epithelia once they have matured, some lower vertebrate species including birds are capable of sustained hair cell regeneration and thus reversal of hearing loss. As expected, the regeneration process is tremendously important and with large implications for medicine and human health, but unfortunately the underlying genetic mechanism is currently unknown.
As a first step in unravelling this fascinating process, we settle for finding a small set of candidate responsible genes. To this aim, we perform gene expression analysis on two original datasets consisting of short time series of gene expression measurements in two relevant tissues, the chick utricle and cochlea. The two time series comprise each measurements of 17,685 genes across 7 time points 24 hours apart, roughly the time course necessary for regeneration following drug-induced damage of cultured hair cells. For each tissue, we have both treated and control samples and at least two biological replicates which we use to perform a preliminary gene filtering, reducing the original gene set to around 3000 genes. We then use our filtered set of genes into a two-step clustering analysis, combining ideas from [2] with a spectral clustering step. We are able to obtain a small set of genes (around 30) which show significant enrichment[3] for the biological process of interest.
Short Abstract: We present INDRA, a novel framework for building mathematical models of biochemical mechanisms. INDRA is integrated with natural language parsers and with pathway databases. INDRA allows the user to define models using natural language descriptions of molecular mechanisms (e.g. “GRB2 binds EGFR that is phosphorylated on a tyrosine residue.”). It can also extract mechanisms from the literature and databases. INDRA aggregates mechanistic information in an intermediate knowledge representation called INDRA-statements. It then automatically assembles a rule-based dynamical model from these statements. Automated model assembly involves synthesizing a set of molecular agents and their interaction rules from the collected mechanisms using assembly policies and biochemical rule templates.
INDRA produces models in the PySB programmatic rule-based modeling language from which both BioNetGen and Kappa models can be obtained. This workflow supports rapid and extensible model building in which the user is allowed to focus on defining the content of the model rather than its implementation. Grounding (i.e. database IDs for proteins) and provenance (source text or database entry from which a mechanism was extracted) are also maintained by INDRA and propagated into the final model as annotations. We demonstrate the capabilities of INDRA with a model automatically assembled from a natural language description of growth factor and MAP kinase signaling. The INDRA-assembled model is able to explain, through simulation experiments, early resistance mechanisms to the cancer drug Vemurafenib in BRAF-V600E mutation driven cancers.
Availability: https://github.com/sorgerlab/indra
Short Abstract: This study investigated the effect of training intensity on a high-fat diet-induced fatty liver in male C57BL/6 mice. Mice at 5 weeks old (N=40) were assigned to a standard chow diet (n=10) or a high fat diet (HFD) (n=30) for 8 weeks. After the 8-wk dietary treatment, mice in the HFD were further assigned to HFD or HFD plus moderate exercise training or HFD plus vigorous exercise training for additional 8 weeks. A global mRNA expression profiling in hepatic tissues was performed with cDNA microarray. Regardless of intensity, exercise training was effective in alleviating HFD-induced metabolic complications and in suppressing hepatic inflammatory and fibrotic symptoms. HFD dysregulated the expression of 1075 genes involved in de novo lipogenesis, oxidative stress, inflammation and fibrosis. Exercise training effectively normalized 291 out of the disregulated 1075 hepatic genes. Both moderate and interval exercises were equally effective in normalizing the dysregulated expression of several hepatic genes, including cholesterol metabolism (i.e.,Tm7sf2, Nsdhl, cyp51, and Squle), inflammation (i.e., pscam1 and cdh5), and fibrosis (i.e., col4a1, mmp23, and anxa2). The current findings of the study suggest that regardless of intensity, exercise training is protective against HFD-induced fatty liver and fibrosis in this animal model of NAFLD.
Acknowledgement. This work was supported by the National Research Foundation Grants funded by the Korean Government (NRF-2013S1A2A2034953) and (NRF-2015R1D1A1A01060287).
Short Abstract: Tremendous amount of chemical and biological data are being generated by various high-throughput biotechnologies that could facilitate modern drug discovery. However, lack of integration makes it very challenging for individual scientists to access and understand all the data related to a specific protein of interest. To overcome this challenge, we developed PyMine, a PyMOL plugin that retrieves chemical, structural, pathway and other related biological data of a receptor and small molecules from a variety of high-quality databases and presents them in a graphic and uniformed way. Developed as an interactive and user-friendly tool, PyMine can be used as a central data-hub for users to access and visualize multiple types of data and to generate new ideas intuitively for structure-based molecule design. This work was supported in part by Grant Number UL1RR024134 from the National Center for Research Resources and in part by the Institute for Translational Medicine and Therapeutics' (ITMAT) Transdisciplinary Program in Translational Medicine and Therapeutics at University of Pennsylvania.
Short Abstract: The Library of Integrated Network-based Cellular Signatures (LINCS) program generates diverse, multidimensional datasets. The LINCS consortium has developed LINCS metadata standards for the material entities to describe their data. The LINCS metaData Ontology (LINDO) classifies material entities such as cells, proteins, genes, and small molecules into class hierarchies. LINDO works concurrently with BioAssay Ontology (BAO), Drug Target Ontology (DTO). The three ontologies are designed to work together and enable contextual data integration to help scientists analyze and query diverse datasets, such as those generated at LINCS. LINDO also makes use of existing ontologies such as Disease Ontology, Protein Ontology, Gene Ontology, CHEBI, and many others. The development of LINDO as well as BAO and DTO is based on modularization approach to simplify development, maintenance and re-use of ontology modules. Here we present how the ontologies are modeled and can work together to describe and integrate complex, diverse datasets such as LINCS; for example linking diseases, protein targets, phenotypic responses, and drug information.
Short Abstract:
Breast cancer has been identified as the most common cancer present in women worldwide1. Thus far, more than 60 genetic loci have been linked to breast cancer formation2. A number of studies have shown the diversity at which BRCA 1 and 2 mutations are present in different ethnic groups. This study address mechanims in the development of breast cancer by comparing somatic (tumor) to germline (blood) samples in a selection of patients from a local academic hospital in Pretoria, South Africa.
To obtain information regarding variants in germline samples which make patients more prone to develop cancer; to analyze the changes that occurred in the somatic samples in relation to single nucleotide variants, insertions and deletions, copy number variations and also gene expressions profiles.
Whole exomes of germline and somatic samples have been / are being sequenced using Illumina sequencing technology together with Agilent exome selection to a depth of 30x – 50x. Variant detection has been / is being performed using Bowtie2, the Picard Tools and GATK with Mutect2 being employed for the selection of somatic variants. Variants are filtered and annotated using Snpeff, CANDRA, Oncotator and MutSigCV. Additionally, samples are being analyzed with Affymetrix OncoScan FFPE Express for copy number variant detection and with Nanostring nCounter technology for transcription levels.
A set of pilot samples have been analyzed for somatic single nucleotide variants as well as insertions and deletions, and a summary of results will be presented.
Short Abstract: The effort to determine the role played by specific genes in the clustering of cancer susceptibility within family groups has been bolstered by rapid advances in various sequencing technologies, as well as variant detection and analysis software. This has resulted in the production of a number of cancer panels, each consisting of multiple genes known or suspected to play a role in disease formation. These panels have been built using predominantly European population genomic data however. Considering the high degree of genetic diversity within and between African populations, there is a potentially large degree of unknown variation contributing to disease formation in these groups.
This project attempts to make a comparison of several African and European populations from the 1,000 genomes project against genes from a combination of cancer panels, using the GRCh37 reference genome in order to determine any differences in the roles that certain variants may play in increasing disease risk in African populations.
1000 Genomes data relating to 5 African and 5 European populations was collected and modified using the fastaalternatereferencemaker tool (GATK) in relation to the hs37d5 custom reference genome, along with a bed file comprised of the exonic sequences for a comprehensive number of cancer panel genes to specify gene regions of interest.
The aligned files will be analysed using BEAST software to perform Bayesian analysis and construct sequence trees to test the evolutionary relationships between possible causal genes across multiple populations and consequently their respective roles in disease formation and potential as therapeutic targets.
Short Abstract: We have explored new approaches for determining the impacts of disease-associated mutations on protein structure and function. Different types of disease causing mutations have been studied including germline diseases, somatic cancer mutations in oncogenes and tumour-suppressors, along with known activating and inactivating mutations in kinases.
The proximity of disease-associated mutations has been analysed with respect to known functional sites reported by CSA, IBIS and ELMS, along with predicted functional sites derived from the CATH classification of domain structure superfamilies. The latter are called FunSites, and are highly conserved residues within a CATH functional family (FunFam) – which is a functionally coherent subset of a CATH superfamily. Such sites include key catalytic residues as well as specificity determining residues and interface residues. Clear differences were found between oncogenes, tumour suppressor and germ-line mutations with oncogene mutations more likely to locate close to FunSites.
We have also identified functional families that are highly enriched in disease mutations and exploited structural data to identify clusters within proteins in these families that are enriched in mutations (using our MutClust program). We examined the tendencies of these clusters to lie close to the functional sites discussed above.
For selected genes, the stability effects of cancer mutations have also been investigated with particular focus on activating mutations in FGFR3. These studies, which were supported by experimental validation, showed that activating mutations implicated in cancer tend to cause stabilisation of the active FGFR3 form, leading to its abnormal activity and oncogenesis (Patani et al. OncoTarget 2016).
Short Abstract: Lysine (K)-Specific Methyltransferase 2D (KMT2D) encodes a methyltransferase responsible for mono-methylation of the fourth lysine of histone 3 (H3K4me1), marking active enhancers. Although KMT2D is one of the most frequent targets of all types of somatic mutation in human cancers, the cellular consequences of KMT2D mutations and their role in cancers are not fully understood. While recent literature has provided evidence to suggest a link between KMT2D mutation and genome instability in cell lines, this association has not been explored in primary tumor samples. To investigate the impact of KMT2D mutations on genome stability in primary tumor samples, we performed in-depth bioinformatics analyses of whole exome sequencing data (8,366 samples, 32 cancer types) and genome-wide SNP array data (10,059 samples, 32 cancer types) from The Cancer Genome Atlas (TCGA) project. Genome instability of primary tumor samples with loss-of-function mutations in KMT2D was inferred through comparison of mutational load (defined as the total number of mutations in a sample) and frequency of copy number alterations with KMT2D wildtype samples within each individual cancer type. Complimentary to the notion of KMT2D mutations giving rise to genome instability in cell lines, our results suggest this phenomenon occurs within three types of primary tumors, across distinct tissue contexts.
Short Abstract: As the availability of genetic and genomic data and analysis tools from large-scale genomics initiatives continues to increase, the need has become more urgent for a software environment that supports the entire “idea to dissemination” cycle of an integrative genomics analysis. Such a system would provide access to a large number of tools without the need for programming, be sufficiently flexible to accommodate non-programming biologists as well as bioinformaticians, and would allow researchers to encapsulate their work into a single “executable document” that combined the analytical workflow with the associated descriptive text, graphics, and supporting research. To address these needs, we have developed GenePattern Notebook, based on the GenePattern environment for integrative genomics and the Jupyter Notebook system.
GenePattern Notebook presents a familiar lab notebook format that allows researchers to build a record of their work by creating “cells” containing text, graphics, executable code, or GenePattern analyses. Researchers add, delete, and modify cells as the research evolves, supporting the initial research phases of prototyping and collaborative analysis. When an analysis is ready for publication, the same document that was used in the design and analysis phases becomes a research narrative that serves as the complete, reproducible, in silico methods section for a publication.
GenePattern Notebooks is freely available for all platforms. We are also developing an online repository where researchers can create, collaborate on, and share their own notebooks, as well as a collection of example notebooks demonstrating integrative genomics research scenarios, which users can adapt to their own work.
Short Abstract: The search for causative genetic variants in rare diseases with unknown etiology of presumed monogenic inheritance has been boosted by the implementation of whole genome (WGS) sequencing. WGS is helpful in monogenic disorder studies thanks to equally distributed coverage and possibilities of non-coding variant analysis, but the analysis and visualization of the vast amounts of data is demanding. To meet this challenge, we apply RareVariantVis for analysis of genome sequence data (including non-coding regions) for both germ line and somatic variants. We have tested extensively the RareVariantVis tool in two WGS data sets, Genome in a Bottle Ashkenazim Trio data (Complete Genomics) and about 30 in-house samples (Illumina X Ten and Illumina HiSeq4000) obtained from families with rare inherited disorders. This work has clearly demonstrated the usefulness and efficiency of RareVariantVis in the screening and identification of possible causative variants for monogenic disorders.
The RareVariantVis tool accepts vcf files and annotated variant tables. It can be efficiently run on a desktop computer - whole genome is loaded, filtered and visualized in about 10 minutes. The tool with its documentation is available for download under the following link:
http://bioconductor.jp/packages/3.2/bioc/html/RareVariantVis.html
Short Abstract: Multi-gene expression signatures are routinely assessed in genome-wide expression studies because they can provide information about concerted transcriptional regulation of biological processes. Whereas a multitude of methods are available to assess gene sets in new expression data for association with a known grouping of samples, there are hardly any methods to allow assessment of the relevance of gene sets for new expression studies in which nothing is known about the grouping of samples.
Here, we compare one published and two novel methods to score the relevance of gene sets in novel unstratified expression data. We compare the methods in the light of simulated and original data from published studies. A simple measure that captures gene-versus-gene correlation of a multi-gene signature and performed well compared to computationally more intense methods.
We present a computational framework for assessment of the relevance of multi-gene expression signatures in novel data sets based on signature coherence. This fills a niche in the space of methods for gene set evaluation, as it can be used to rank signature-based hypotheses even for unknown grouping of samples, as it is the case during the translation of signatures for use in profiling of unstratified clinical samples of large cohorts. We present an exemplary application of our new methodology as a first step in an analysis pipeline for comprehensive classification of breast cancers. The validity of our method is shown by detection of known cancer subtype signatures, in addition to other signatures not yet linked to breast cancer.
| Search Posters: |