STUDY OF MARKERS FOR REGULATORY ELEMENTS IN HUMAN GENOME by Vatsal Agarwal A thesis submitted to Johns Hopkins University in conformity with the requirements for the degree of Master of Science in Engineering Baltimore, Maryland October 2013 © 2013 Vatsal Agarwal All rights reserved Abstract Most genetic traits and diseases in humans from height to cancer or sudden cardiac death do not follow Mendelian principles but originate from complex combinatorial effects of multiple genes with possibly multiple variants. Most of these variants lie within non-coding regions of the genome such as promoters, enhances or insulators, which regulate the expression levels of genes. Numerous algorithms predict the likely location of these regulatory regions using biological features such as conservation, transcription factor binding, deoxyribonuclease I (DNaseI) hypersensitivity, and others. The first part of the thesis presents a software to compile such annotations and visualize them in a customizable manner. The second part discusses the distribution of one of these features, DNaseI sensitivity, across the human genome. In the first part, we developed a software and used it to study the NOS1AP (NO-synthase adapter protein) gene locus and the beta-globin gene locus. Since, single nucleotide polymorphisms (SNPs) at NOS1AP locus are known to affect the electro-cardiographic QT-interval, we collected the corresponding data from a genome-wide association study. We plotted the genetic effect and frequency of these SNPs across the length of the NOS1AP locus, along with genes and other functional annotations from various public databases including RefSeq, University ii Abstract of California Santa Cruz (UCSC) Genome Browser, TRANSFAC, and the Encyclopedia of DNA Elements (ENCODE) project. We also added SNPs from the 1000 Genomes project to increase the available number of variants to analyze. We observed a lack of known annotations at almost all variants, which led to the following possibility: although particular regions of the human genome may not be significant enough to be designated as regulatory regions, there may still be weak sites affecting overall gene expression. This was the motivation to study the distribution of DNaseI sensitivity across the human genome, which forms the second part of the thesis. In the second part, we modeled DNaseI sensitivity, a marker for chromatin accessibility and regulatory elements, using data collected by the University of Washington (UW) as part of the ENCODE project. We used Gamma-weighted Poisson distribution as our model and normal Poisson distribution as noise. Maximum-likelihood estimation fitting over the entire genome as well as over individual chromosomes, across different cell lines, indicated that most of the human genome is inactive, and the remainder has generally very low DNaseI sensitivity. Only a very small fraction of the genome (<1%) is DNaseI hypersensitive. Primary reader: Dr. Aravinda Chakravarti (Advisor). Secondary readers: Dr. Michael Beer, Dr. Liliana Florea. iii Acknowledgement I am grateful to Dr. Aravinda Chakravarti, my mentor and advisor for not only providing me the opportunity and guidance to do this project but also teaching me ways and ethics to conduct proper research. I would like to thank Dr. Ashish Kapoor for his thorough inputs in this project and thesis as well as his personal help and support for past year and a half. I would also like to take this opportunity to thank all the members of my lab and my friends for suggesting ideas for some aspects of the project from time to time and keep me motivated to complete the thesis. Finally, I am truly grateful to my parents for encouragement to pursue graduate studies and their endless love and support in all aspects of my life, without which it would not be possible for me to complete this thesis. iv Table of Contents Abstract ....................................................................................................... ii Acknowledgement ................................................................................. iv List of figures........................................................................................ viii Chapter 1 Introduction ....................................................................... 1 1.1 Non-mendelian genetics and complex traits ...................... 1 1.1.1 Overview and application .................................................................. 1 1.2 Dissertation outline ....................................................................... 3 1.2.1 Software to compile and visualize various known functional annotations ............................................................................................................. 3 1.2.2 Genome-wide modeling of DNase I sensitivity ........................... 4 Chapter 2 Annotation visualization software ......................... 5 2.1 Introduction ....................................................................................... 5 2.2 Samples ................................................................................................ 6 2.2.1 Sample selection .................................................................................. 6 2.2.2 Setting up the software ..................................................................... 8 2.3 Results .................................................................................................. 9 2.3.1 Analysis of the NOS1AP locus ......................................................... 9 v Table of Contents 2.3.2 Analysis of Beta-globin locus ........................................................ 11 2.4 Summary & Discussion ............................................................... 12 Chapter 3 Modelling DNaseI sensitivity across human genome 14 3.1 Introduction ..................................................................................... 14 3.2 Samples & Methods ...................................................................... 15 3.2.1 Sample selection and preparation ............................................... 15 3.2.2 Model proposition and fitting ........................................................ 16 3.3 Results ................................................................................................ 18 3.3.1 Parameters for final fitting ............................................................. 18 3.3.2 Comparison of replicate datasets ................................................. 19 3.3.3 Variation across chromosomes ..................................................... 21 3.3.4 Differences among different cell lines ......................................... 23 3.4 Conclusion & Discussion............................................................ 24 References ................................................................................................27 Appendices ................................................................................................30 Appendix A ..................................................................................................... 30 Appendix B ..................................................................................................... 41 Appendix C ..................................................................................................... 44 vi Table of Contents Curriculum Vitae ...................................................................................52 vii List of figures Chapter 2 Figure 2.1 Software output for 30kb region, around NOS1AP locus, along the length on the chromosome on X-axis ................................................ 10 Figure 2.2 Software output for 70kb region around beta-globin protein gene, along the length of the chromosome on X-axis ............................... 11 Chapter 3 Figure 3.1 Ideal Poisson curve for uniformly sensitive DNA against bar curve of real values from chromosome 1 of HCF cell line (replicate 1) ...... 17 Figure 3.2 Fitted model curve against bar curve represents raw data from chromosome 1 of HCF cell line (replicate 1) ............................................ 19 Figure 3.3 Comparison of parameters between replicates ...................... 20 Figure 3.4 Gamma distributions from best-fit parameters for each chromosome in (a) HCF cell line and (b) GM12864 cell line ..................... 22 Figure 3.5 Comparison of genome-wide fitting parameters from different cell lines .............................................................................................. 24 viii Chapter 1 Introduction 1.1 Non-mendelian genetics and complex traits 1.1.1 Overview and application Physical traits studied in early genetics were simple and monogenic in nature, following Mendelian principles where a significant mutation in one of the genes caused a distinguished phenotype or a disease. Fischer’s model extended this logic to multiple genes and quantitative trait loci where expression of multiple genes would have additive effect on the phenotype [1]. However, these principles account for a small number of traits. Improvement in sequencing technologies for sequencing of exomes to complete genomes, coupled with steeply falling prices for sequencing, has provided the scientific community with a huge amount of genetic data to analyze the correlation of sequence variation to human genetic traits and diseases. Genome-wide association studies (GWAS) have been performed 1 Chapter 1: Introduction for many traits and diseases, but these are able to explain only a small portion of observed phenotypic variation [2]. Moreover, GWAS are based on the principle of Linkage Disequilibrium [3, 4], and hence, only highlight the target loci rather than identifying the causal variation. However, data from GWAS of over 240 traits and diseases, identifying over 3500 associated SNPs, shows that about 88% of these SNPs lie within non-coding region of the genome [5]. These non-coding variants are hypothesized to lie in regulatory regions of the genome, which regulate gene expression. So, the aim to identify the causal variation would be a step closer if we could locate the regulatory regions in the genome. Unfortunately, there are many classes of regulatory elements that have significantly different structure and function. Promoters are responsible for initiating and regulating transcription processes and lie upstream of the gene on the same strand; enhancers increase the pace of transcription whereas suppressors decrease the speed, but both of these may lie far from the gene they regulate; insulators act as an impermeable wall to prevent the effect of certain enhancers and suppressors beyond a certain region; transcription factor binding sites, as the name suggests, are locations that are bound by transcription factors. 2
Description: