Selected Talks

Pathogenic impact of transcript isoform switching in 1209 cancer samples covering 27 cancer types using an isoform-specific interaction network

Tülay Karakulak, Abdullah Kahraman, Damian Szklarczyk and Christian von Mering

Presenting author: Tülay Karakulak

Abstract

Integrated analysis of transcriptomic and proteomic data to understand the effect of aneuploidy on cancer genomes

Gökçe Senger and Martin Schaefer

Presenting author: Gökçe Senger

Abstract

Inference Attacks Against Differentially-Private Query Results from Genomic Datasets Including Dependent Tuples

Nour Alserr, Erman Ayday and Ozgur Ulusoy

Presenting author: Nour Alserr

Abstract

Systematic analysis of phosphorylation structure

Altuğ Kamacıoğlu, Nurhan Ozlu and Nurcan Tuncbag

Presenting author: Altuğ Kamacıoğlu

Abstract

DriveWays: A method for identifying possibly overlapping driver pathways in cancer

Ilyes Baali, Cesim Erten and Hilal Kazan

Presenting author: Ilyes Baali

Abstract

Validation of LOAD-RF-RF selected risk SNVs for the early and differential diagnosis of Alzheimer’s disease

Sevda Rafatov, Hüseyin Cahit Burduroğu, Yavuzhan Çakır, Onur Erdoğan, Cem İyigün and Yeşim Aydın Son

Presenting author: Yeşim Aydın Son

Abstract

Age-related diseases share common genetic associations

Handan Melike Donertas, Daniel K Fabian, Matias Fuentealba Valenzuela, Linda Partridge and Janet M. Thornton

Presenting author: Handan Melike Donertas

Abstract

ChemBoost: A chemical language based approach for protein - ligand binding affinity prediction

Rıza Özçelik, Hakime Öztürk, Arzucan Özgür and Elif Ozkirimli

Presenting author: Rıza Özçelik

Abstract

Heterogeneous COVID-19 knowledge graphs in comprehensive resource of biomedical relations (CROssBAR) system

Tunca Dogan, Heval Ataş, Vishal Joshi, Ahmet Atakan, Ahmet Süreyya Rifaioğlu, Esra Nalbat, Andrew Nightingale, Rabie Saidi, Vladimir Volynkin, Hermann Zellner, Rengul Atalay, Maria Martin and Volkan Atalay

Presenting author: Tunca Dogan

Abstract

Genotyping macro-satellites in the human population

Marzieh Eslami Rasekh and Gary Benson

Presenting author: Marzieh Eslami Rasekh

Abstract

Robust inference of kinase activity using functional networks

Serhan Yılmaz, Marzieh Ayati, Daniela Schlatzer, A. Ercument Cicek, Mark Chance and Mehmet Koyuturk

Presenting author: Serhan Yılmaz

Abstract

Cardiac atrial transcriptomic landscaping reveals defects in various pathways in patients with ischemic heart disease or heart failure

Arda Eskin, Severi Mulari, Nurcan Tunçbağ and Esko Kankuri

Presenting author: Arda Eskin

Abstract

New solutions to old problems: Mitigating data loss and bias in ancient genome data processing

Dilek Koptekin, Etka Yapar, Ekin Sağlıcan, Can Alkan and Mehmet Somel

Presenting author: Dilek Koptekin

Abstract

Ligand switching mutations in PDZ domain explained by centrality of amino acids

Tandac Guclu, Canan Atilgan and Ali Rana Atilgan

Presenting author: Tandac Guclu

Abstract

Alternative splicing regulation is often disturbed in various cancers leading to cancer-specific switches in the Most Dominant Transcripts (cMDT). To understand how these switches drive oncogenesis, we have analyzed isoform-specific protein interaction disruptions in the Pan-Cancer Analysis of Whole Genomes (PCAWG) project. Our study identified large variations in the number of cMDT with the highest frequency in cancers of female reproductive organs. Surprisingly, in contrast to the mutational load, cancers arising from the same primary tissue showed similar numbers of cMDT. Some cMDT were found in almost all samples of a cancer type rendering them as ideal diagnostic biomarkers. Other cMDT tended to be located at densely populated protein network regions disrupting interactions next to pathogenic cancer gene products in enzyme signalling, protein translation, and RNA splicing pathways. The highlighted common and distinct patterns of alternative splicing deregulations constitute new avenues for novel therapeutic targets in the fight against cancer.

Aneuploidy, whole chromosomal or chromosome arm level changes, is a hallmark of human cancer cells, but its role in cancer still remains to be fully elucidated. In this work, we focus on developing an understanding of how cancer cells deal with the excess amount of expression at both transcriptome and proteome level induced by chromosome gains, and how the excess expression affects protein complex stoichiometry. For 298 tumor samples, for which we have aneuploidy, transcriptomic and proteomic data made available by TCGA and CPTAC consortia, we first identified cancer-type specific chromosomes that are altered at higher frequencies than would be expected by chance. Then we profiled transcriptomic changes in response to chromosome number changes. To our surprise, we found that a relatively small number of genes on the aneuploid chromosomes changed expression while many expression changes happened on other chromosomes. Those differentially expressed genes on other chromosomes often form complexes and, even more, are often in the same complexes as differentially expressed genes on aneuploid chromosomes. These observations are even more pronounced on proteome level. To further investigate the differential co-regulation between co-complex members, we calculated protein level correlations between proteins of aneuploid chromosomes and their partner proteins of other chromosomes. We found that proteins involved in a smaller number of complexes have stronger correlations with their partners, highlighting the importance of compensation for stoichiometric imbalance in protein complexes. Aggregation-prone complex members also show stronger expression correlations suggesting that proteotoxicity of unpaired complex members make this compensation necessary. Our ongoing efforts focus on deciphering the regulatory control of gene expression of complex members (both on transcriptome and proteome level) to understand the molecular mechanisms of cancer cell adaptation to aneuploidy.

Thanks to the fast-paced throughput sequencing technologies which result in a large-scale datasets and biobanks. The number of sequenced human genomes has been increasing at an exponential rate, and now we are at about 2.5 million sequenced genomes around the world. This is projected to reach 105 million and this number can reach a lot more in 2025, especially after the COVID 19 pandemic, where many countries decide to study the genomic data in a population scale. These rich troves of data can empower the scientific advances. However, according to the sensitive nature of the genetic information, sharing the genomic datasets which include sensitive genetic or medical information for individuals can be misused if it lands in the wrong hands. Hence, for the hope of sharing the genomic dataset to gain better understanding of the human genetics, differential privacy (DP) is one of the privacy concepts proposed for sharing the summary statistics of genomic datasets in a private manner. DP mechanism provides a rigorous mathematical foundation for preserving privacy, but it does not consider the dependency of the data tuples in the dataset, which is a common situation for genomic datasets due to the inherent correlations between genomes of family members. We show how kin relationships between individuals in a genomic dataset cause a significant reduction in the privacy guarantees of traditional DP-based mechanisms. We formulate this as an attribute inference attack and show the privacy loss using differentially-private results of minor allele frequency (MAF) and chi-square queries over two real-life genomic datasets. Our results show that using the results of differentially-private MAF queries and exploiting the dependency between tuples, an adversary can reveal up to 50% more sensitive information about the genome of a target (compared to original privacy guarantees of standard DP-based mechanisms), while differentially-privacy chi-square queries can reveal up to 40% more sensitive information. Furthermore, we show that these inferred genomic records (as a result of the attribute inference attack) can be utilized to perform successful membership inference attacks to other statistical genomic datasets (e.g., associated with a sensitive trait). Using a log-likelihood-ratio (LLR) test, our results also show that the inference power of the adversary can be significantly high in such an attack even by using inferred (and hence partially incorrect) genomes. This work presented at the 28th conference of Intelligent Systems for Molecular Biology (ISMB2020). The full paper is available at: https://doi.org/10.1093/bioinformatics/btaa475

Phosphorylation is an essential post-translational modification for the regulation of almost all cellular processes. Several phosphorylation-sites for diverse cellular mechanisms and their corresponding kinases and quantitative change in phosphorylation is revealed with widespread quantitative phosphoproteomics analyses and even though the structure of a single protein and its phosphorylation-sites are studied, no systematic analysis concerning the structure of whole phosphoproteomics has been performed. In this study, we focused on the structural mechanism of phosphorylation to detect the respective location of phospho-sites through relative solvent accessibility of the phospho-sites and their characteristic features based on their location. We build on the data from all phosphorylation regions in current databases and a selected paper which filter false positive phosphorylation via quality-control. We find that a certain part of phosphorylation-sites locates in core part of protein with extremely low solvent accessibility and we observed that core phosphorylation-sites are highly found in false-positive phosphorylation-sites in databases. Core phosphorylation-sites are significantly less functional and more rigid than other type of phosphorylation. We found out that some of core phosphorylation-sites are very dynamic and highly functional. Lastly, we performed same analysis in Karayel et al. paper which include phosphorylation regulation in cell division, and almost all core phosphorylation regulated throughout cell division are detected as dynamic.

The majority of the previous methods for identifying cancer driver modules output non-overlapping modules. This assumption is biologically inaccurate as genes can participate in multiple molecular pathways. This is particularly true for cancer associated genes as many of them are network hubs connecting functionally distinct set of genes. It is important to provide combinatorial optimization problem definitions modeling this biological phenomenon and to suggest efficient algorithms for its solution. We provide a formal definition of the Overlapping Driver Module Identification in Cancer (ODMIC) problem. We show that the problem is NP-hard. We propose a seed-and-extend based heuristic named DriveWays that identifies overlapping cancer driver modules from the graph built from the IntAct PPI network. DriveWays incorporates mutual exclusivity, coverage, and the network connectivity information of the genes. We show that DriveWays outperforms the state-of-the-art methods in recovering well-known cancer driver genes performed on TCGA pan-cancer data. Additionally, DriveWay’s output modules show a stronger enrichment for the reference pathways in almost all cases. Overall, we show that enabling modules to overlap improves the recovery of functional pathways filtered with known cancer drivers, which essentially constitute the reference set of cancer-related pathways. The data, the source code, and useful scripts are available at: https://github.com/abucompbio/DriveWays.

Late-Onset Alzheimer’s Disease (LOAD) is the most common type of dementia in the aging populations, characterized by deterioration of memory and other cognitive domains. The complex genetic etiology of the LOAD is still unclear, which restrains the early and differential diagnosis of LOAD. Genome-Wide Association Studies (GWAS) allows exploration of the statistical interactions of individuals variants, but the univariate analysis oversees interactions between variants. The machine learning algorithms can capture hidden, novel, and significant patterns considering nonlinear interactions between variants for the understanding of the genetic predisposition for the complex genetic disorders, where multiple variants determine the risk. We developed in-silico LOAD models based on genotyping data from three different datasets from ADNI and dbGAP initiatives, through controlled access. GWAS datasets provided by ADNI (210 controls and 344 cases), and GenADA (777 controls and 798 cases), and NCRAD by dbGaP (1310 controls and 1289 cases) are analyzed. In the first step, GenADA, NCRAD, and ADNI datasets analyzed independently, and after preprocessing, PLINK is used for GWAS and followed by p-value filtering for the initial dimension reduction. For each dataset, two-step Random Forest (RF) is implemented with 5-fold cross-validation (CV) using the RANGER R package after GWAS with PLINK. Test performances of LOAD-RF models of ADNI, NCRAD, and GenADA datasets were 72,9%, 68,8%, and 92,4%, respectively. 390 SNVs from ADNI, 1740 from NCRAD, and 434 from GenADA datasets selected by the individual LOAD-RF models considering permutation importance of variants at 95% confidence. There were no consensus variants, but 62 genes common in at least two datasets are identified. Additionally, six genes were common in all 3 LOAD-RF models is identified. The test performances of LOAD-RF-RF models of ADNI, NCRAD and GenADA datasets were 74,0%, 72,1%, and 85,1% respectively. 32 SNVs from ADNI, 581 from NCRAD, and 107 from GenADA datasets selected by the individual LOAD-RF-RF models considering permutation importance of variants at 95% confidence. The LOAD-RF-RF analysis identified the SNVs that are highly significant and six SNVs are selected for experimental validation with pyrosequencing. Initially, we have genotyped 41 LOAD patients for the SPOCK1 variant and observed the minor allele frequency as 0.317 , which is significantly higher than the expected global minor allele frequency of 0.154. The experimental validation of the rest of the LOAD-RF-RF selected risk variants is still ongoing. SNVs identified and validated in this study will be utilized for the development of a genotyping kit for the early and differential diagnosis of LOAD. The kit will support the clinician’s decision in the early and differential diagnosis of LOAD and benefit the patients and their families for the planning of the treatment and support strategies.

Ageing is the major risk factor for many diseases. With the rise in life expectancy, the overall burden of ageing-related diseases increases. The molecular link between ageing and age-related diseases, however, remains elusive. In this study, we test whether diseases with similar age-of-onset share a genetic component that is also implicated in ageing. We perform GWAS on UK Biobank data, which includes genomic, medical and lifestyle measures for almost half a million participants. Our analysis comparing 116 diseases suggested four disease clusters defined by their age-of-onset. We found that diseases with the same onset profile are genetically more similar, suggesting a common aetiology. Moreover, this similarity cannot be explained by disease categories (e.g. cardiovascular, endocrine), co-occurrences, or disease cause-effect relationships. Two of the clusters showed an age-dependent profile, starting to increase in prevalence after the age of 20 and 40 years. These clusters had genetic risk factors associated with senescence regulators and targets of the pro-longevity drugs. However, they had distinct functional enrichment and risk allele frequency distributions. We also tested predictions of mutation accumulation and antagonistic pleiotropy theories of ageing and found support for both. We are now working on a drug repurposing approach to find drugs targeting the common genetics between age-related diseases. This approach has the potential to identify drugs targeting multiple diseases simultaneously and alleviate the effects of multimorbidity and polypharmacy in late ages.

Identification of high affinity drug-target interactions is a major research question in drug discovery. Proteins are generally represented by their structures or sequences. However, structures are available only for a small subset of biomolecules and sequence similarity is not always correlated with functional similarity. We propose ChemBoost, a chemical language based approach for affinity prediction using SMILES syntax. We hypothesize that SMILES is a codified language and ligands are documents composed of chemical words. These documents can be used to learn chemical word vectors that represent words in similar contexts with similar vectors. In ChemBoost, the ligands are represented via chemical word embeddings, while the proteins are represented through sequence-based features and/or chemical words of their ligands. Our aim is to process the patterns in SMILES as a language to predict protein-ligand affinity, even when we cannot infer the function from the sequence. We used eXtreme Gradient Boosting to predict protein-ligand affinities in KIBA and BindingDB data sets. ChemBoost was able to predict drug-target binding affinity as well as or better than state-of-the-art machine learning systems. When powered with ligand-centric representations, ChemBoost was more robust to the changes in protein sequence similarity and successfully captured the interactions between a protein and a ligand, even if the protein has low sequence similarity to the known targets of the ligand.

Systemic analysis of available biological/biomedical data is critical for developing novel and effective treatment approaches against both complex diseases and rapidly emerging outbreaks (e.g., COVID-19). Owing to the fact that different sections of the biomedical data are produced by different organizations/institutions using various technologies, the data is scattered across individual resources without any explicit relations/connections, hindering comprehensive multi-omics-based analysis. We aimed to address this issue by constructing a comprehensive biological/biomedical resource, CROssBAR, with large-scale data integration from various data sources, enriching this data with deep learning-based prediction of relations, and its presentation via cutting-edge knowledge graph (KG) representations in our open-access web-service at https://crossbar.kansil.org. Starting from late 2019, the new coronavirus pandemic has wreaked havoc and brought along nearly 850K deaths. Systemic evaluation of the current biomedical knowledge about SARS-CoV-2 infection is expected aid researchers in developing effective drugs and vaccines. With the aim of contributing to this endeavor, we have constructed two COVID-19 KGs (https://crossbar.kansil.org/covid_main.php) using the CROssBAR system; (i) large-scale version including the entirety of related information on various CROssBAR-integrated resources, and (ii) simplified version distilled to include only the most relevant terms, ideal for fast interpretation. CROssBAR COVID-19 KGs incorporate relevant virus and host genes/proteins, interactions, pathways, phenotypes and other diseases, as well as drugs/compounds, some of which are new. These new drugs have been incorporated to the KGs either due to our network analysis-based pipeline or predicted by our deep-learning-based tools. We conducted a literature-based validation study and found that many of these drugs are now being experimented at preclinical/clinical stages against COVID-19. It is interesting to observe direct/indirect relations between the phenotypes/diseases in the KGs and COVID-19 over the incorporated host genes/proteins and enriched pathways, and between COVID-19 and our computationally predicted drugs/compounds, as they may reveal further evidence to be utilized against this disease.

Macrosatellite repeats (MSRs) are DNA patterns of 100 bp or longer that repeat tandemly throughout the genome. MSRs that change copy number are called variable number tandem repeats (VNTRs), which have been predicted to have biological effects and have been linked to diseases. However, MSRs have not been studied in a high-throughput fashion. Therefore, we have developed a computational tool named Macro-Satellites Using Depth (MaSUD) to genotype MSR loci in the human genome. To predict copy number changes, MaSUD compares the number of reads mapping inside each MSR locus to a background distribution of similarly simulated reads of the reference allele. The performance of MaSUD was demonstrated on simulated datasets (precision>90% and recall>50%) and validated using long PacBio reads (linear regression p-value<2e-16 and r2=0.55, correlation=74.76%).We ran MaSUD on 2,504 genomes from five super-populations of the 1000 Genomes Project using 3,875 reference MSR loci. MaSUD predicted that >95% of these MSRs have a copy number variant in at least one individual and that, on average, a locus was variant in 1,457 individuals. A total of 2,512 VNTRs overlapped with 1,190 genes that were enriched in pathways related to cancer, diabetes, neuron differentiation, and neurogenesis.To identify VNTRs affecting gene expression, we compared the mean B-cell mRNA expression levels from 448 individuals using probes overlapping VNTRs (t-test, FDR<5%). Expression of 84 genes was significantly correlated with the corresponding VNTR allele. Top genes correlated with VNTRs include FANCA, AMFR, SPG7, INPP5E, DPYSL4, GPR35, PIGN, PEX5, PRPF6, EXOC2, MXRA7, and LRCH3. Alternative Splicing was among the UniProt keywords enriched for these genes (FDR=5e-3). In addition, unsupervised clustering shows that VNTRs separate human super-populations, and using a Random Forest model we could predict ancestry with 78% accuracy. This represents the first high-throughput analysis of macrosatellites in humans.

Mass spectrometry enables high-throughput screening of phospho-proteins across a broad range of biological contexts. When complemented by computational algorithms, phospho-proteomic data allows the inference of kinase activity, facilitating the identification of dysregulated kinases in various diseases including cancer, Alzheimer’s disease and Parkinson’s disease. To enhance the reliability of kinase activity inference, we present a network-based framework, RoKAI, that integrates various sources of functional information to capture coordinated changes in signaling. Through computational experiments, we show that phosphorylation of sites in the functional neighborhood of a kinase are significantly predictive of its activity. The incorporation of this knowledge in RoKAI consistently enhances the accuracy of kinase activity inference methods while making them more robust to missing annotations and quantifications. This enables the identification of understudied kinases and will likely lead to the development of novel kinase inhibitors for targeted therapy of many diseases. RoKAI is available as web-based tool at http://rokai.io.

Ischemic heart disease (IHD), causing high morbidity and mortality, continues to be the leading cause of death worldwide. In this study, samples of right atrial appendage were collected for transcriptomic profiling from 40 patients with IHD undergoing elective coronary artery bypass grafting (CABG) surgery. Additionally, 8 samples from patients with solitary valvular disease undergoing corrective valvular surgery were harvested to serve as controls. Clinical and follow-up data including medication, laboratory measurements are also collected for each patient. We obtained the transcriptomic data of healthy right atrial appendage tissue from GTEx (n = 429). Our aim in this study is to find novel associations and genes related to IHD and to predict the risk of having IHD by integrating transcriptomic, clinical and interactome data. We found 357 upregulated and 310 downregulated DEGs in IHD samples compared to healthy tissues (FDR < 0.05 and |logFC|>2). Among these, genes from protocadherin gamma subfamily were found to be significantly different between patient group who has an ejection fraction lower than 55% which represents the percentage of blood leaving the heart each time it contracts. (p value < 0.05). We inferred the most critical pathways from the list of DEGs and found that agrin interactions at neuromuscular junction, epithelial adherens junction signaling, sirtuin signaling and oxidative phosphorylation are significantly enriched. DEGs associated with oxidative phosphorylation are downregulated. Additionally, functional analysis of miRNAs and their targets that have significantly different expression values between patient groups, resulted with the enrichment of lipid metabolism. Overall, our results provide a transcriptome level understanding into processes reactive to IHD and the association of gene level data to phenotypic information.

DNA in ancient samples is highly fragmented due to decay after death, has exogenous contamination and contains a low amount of endogenous DNA. Consequently, ancient DNA processing usually involves studying genome data with <1x coverage, composed of short reads with frequent C-to-T transitions at their ends. These create two types of challenges. One is the inability to call full diploid genotypes. Solutions include pseudo-haploidization, and genotype likelihood methods. However, it has been observed that such ancient genome data is “reference biased”, i.e. contain more reference alleles than alternatives at heterozygous positions. This appears to be caused by loss of alternative allele-bearing reads due to their slightly lower mapping quality. Second challenge is to avoid confusing postmortem C-to-T transitions with authentic variation. The solution is to use variants identified in worldwide populations instead of de novo calls. Further, one may use only transversions, or both transversions and transitions but after trimming 2-10 nucleotides of read ends where postmortem damage accumulates. Unfortunately, the former approach means not using c.67% of SNP data, while the latter means losing up to 30% of data due to short read lengths. Here we propose solutions to mitigate these effects in ancient genome data preprocessing. The first addresses reference bias. We show that aligning read data to a graph genome, or aligning to a linear reference genome but after masking common polymorphic sites in the reference, effectively removes reference bias in ancient genotype data. The second involves avoiding postmortem damage effects and minimizing data loss. Here, instead of trimming read ends, we mask potential sites where the read’s genotype can be affected by postmortem cytosine deamination. Our primary analysis increases genotyping by 15% especially in the lowest coverage samples without compromising accuracy, thereby significantly boosting statistical power in downstream population genetics analyses.

Mutations occasionally affect protein structure and/or function, and these changes are important alterations in ligand specificity that may have significant consequences, such as emergence of antibiotic resistance or disruption in cell signaling. Here we study PDZ3 domain which has an important role in mammal neural cell signaling. PDZ domains construct the PSD-95 complex by binding CRIPT (ligand I) and T-2F (ligand II) ligands. Previously, its specific mutations have been demonstrated to display preferred ligand specificity: Wild-type(WT) protein has higher binding affinity to ligand I and G330T mutation binds to both ligands I/II while the H372A mutation and the G330T-H372A double-mutation(DM) tend to bind only to ligand II. To scrutinize the emergent structural features due to the mutations, we conducted network analyses on the snapshots from the 400-ns long molecular dynamics simulations. Then, we utilized betweenness centrality (BC) to find the nodes which act as hubs for information communication in biological function. ΔBC results show that the N-terminus has an impact on the formation of H372AL2. Furthermore, we employed Girvan-Newman algorithm to investigate the modularity of PDZ3 protein. The results indicate that N and C termini of the structure are in the same community, while N-terminus and the ligand tend to be located in the same community only in favorable WT and the single mutation cases. We explain how the changes of the residue centralities by perturbations introduced in the form of mutations lead to the ligand switching behavior in the PDZ domain, and discuss why this behavior is governed by N-terminus region.