TITLE Identification of common and unique stress responsive genes of Arabidopsis thaliana under different abiotic stress through RNA-Seq meta-analysis Shamima Akter Thesis submitted to the faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE in Crop and Soil Environmental Sciences Song Li, Chair M. A Saghai Maroof Bo Zhang December 19, 2017 Blacksburg, VA Keywords: Abiotic stress, Reactive oxygen species, RNA-Seq, RNA-Seq pipeline, Gene Omnibus Series (GSE), Differentially Expressed Genes (DEG), Jaccard similarity index, Gene Ontology (GO) Identification of common and unique stress responsive genes of Arabidopsis thaliana under different abiotic stress through RNA-Seq meta-analysis Shamima Akter ABSTRACT Abiotic stress is a major constraint for crop productivity worldwide. To better understand the common biological mechanisms of abiotic stress responses in plants, we performed meta- analysis of 652 samples of RNA sequencing (RNA-Seq) data from 43 published abiotic stress experiments in Arabidopsis thaliana. These samples were categorized into eight different abiotic stresses including drought, heat, cold, salt, light and wounding. We developed a multi- step computational pipeline, which performs data downloading, preprocessing, read mapping, read counting and differential expression analyses for RNA-Seq data. We found that 5729 and 5062 genes are induced or repressed by only one type of abiotic stresses. There are only 18 and 12 genes that are induced or repressed by all stresses. The commonly induced genes are related to gene expression regulation by stress hormone abscisic acid. The commonly repressed genes are related to reduced growth and chloroplast activities. We compared stress responsive genes between any two types of stresses and found that heat and cold regulate similar set of genes. We also found that high light affects different set of genes than blue light and red light. Interestingly, ABA regulated genes are different from those regulated by other stresses. Finally, we found that membrane related genes are repressed by ABA, heat, cold and wounding but are up regulated by blue light and red light. The results from this work will be used to further characterize the gene regulatory networks underlying stress responsive genes in plants. Identification of common and unique stress responsive genes of Arabidopsis thaliana under different abiotic stress through RNA-Seq meta-analysis Shamima Akter GENERAL AUDIENCE ABSTRACT Abiotic stress is a major constraint for crop productivity worldwide. To better understand the common biological mechanisms of abiotic stress responses in plants, we performed analysis of 652 samples of RNA sequencing data from 43 published abiotic stress experiments in Arabidopsis thaliana. These samples were collected from eight different abiotic stresses including drought, heat, cold, salt, light and wounding. We identified genes that were induced or repressed by each of these stresses. We found that 5729 and 5062 genes are induced or repressed by only one type of abiotic stresses. There are only 18 and 12 genes that are induced or repressed by all stresses. The commonly induced genes are related to gene expression regulation by stress hormone. The commonly repressed genes are related to reduced growth. We compared stress responsive genes between any two types of stresses and found that heat and cold regulate similar set of genes. We also found that high light affects different set of genes than blue light and red light. Finally, we found that membrane related genes are repressed by stress hormone, heat, cold and wounding but are up regulated by blue light and red light. The results from this work will be used to further characterize the gene regulations underlying stress responsive genes in plants DEDICATION My Mother (Mrs. Rowshan Akter) A strong, beloved, and gentle soul who taught and encouraged me in hard work and so that much could be done with little effort. My Father (Md. Fazlul Haque Sarker) For earning an honest living for us and for supporting and encouraging me to believe in myself. My Husband (Manik Ahmed) His relentless trying for my future and always encouraging me to do toughest jobs on the earth. My daughters (Mumtahina S. Ahmed & Mahdita S. Ahmed) If they were not with me I could not have write a single word, probably I could not even take breath. iv ACKNOWLEDGEMENTS I would like to express my sincere gratitude and thanks to Dr. Song Li for giving me the opportunity to work in genomics and bioinformatics area which got my interest when I rotated through in his lab. I am very grateful to him as he provided directions in every way to improve myself in research field. I believe for his guidance, advices and suggestions I could be able to finish my journey in this new area successfully and now at the verge of getting my degree that will help me to move forward with new vision and ambitions. He taught me to think critically about my research. I would also like to thank my committee members Dr. M. A Saghai Maroof and Dr. Bo Zhang for their guidance, support and time. I am thankful and grateful to my lab members Jiyoung Lee who always gave me the critical and realistic suggestions to handle the situations and problems. Several times I took suggestions and assistance from her to work in the project. Her advices, providing solutions are remarkable remembrance for me. I would like to thank Alex Qi Song who is my lab member and I met with him when I did not know anything about the coding. I got help from him. He helped me in this project by writing a very critical script to get important results. I am also grateful to Dr. Zhang and her lab members. She gave me the opportunity to work in her project and to work in her lab for GWAS project. I want to thank Ph.D student Matthew Colson in Dr. Zhang’s lab as he helped me to organize some works for sugar extraction when I was in very hardship for my daughter’s illness. I would like to thank Hazem Sharaf who is my friend and classmate. He helped me several times to trouble shoot and solve problem during building the pipeline. I am grateful to my husband Manik Ahmed for his continuous support and encouragement in every step of my study. He taught me to move forward and not to be hopeless but try. I am ever grateful to my daughters that they allowed me to study despite they need me most. v TABLE OF CONTENTS TITLE ......................................................................................................................................... i ABSTRACT ................................................................................................................................ i DEDICATION .......................................................................................................................... iv ACKNOWLEDGEMENTS ....................................................................................................... v TABLE OF CONTENTS .......................................................................................................... vi LIST OF FIGURES ............................................................................................................... viii LIST OF TABLES ..................................................................................................................... x CHAPTER 1 ............................................................................................................................. 1 LITERATURE REVIEW ....................................................................................................... 1 1.1 ABIOTIC STRESS AND AGRICULTURE PRODUCTIVITY ................................ 2 1.2 MECHANISMS OF STRESS TOLERANCE: GENE EXPRESSION CHANGE .... 3 1.3 RNA-SEQ.................................................................................................................... 6 1.4 RNA-SEQ PIPELINE ................................................................................................. 7 1.5 SOFTWARE TOOLS OF THE RNA-SEQ PIPELINE .............................................. 7 1.5.1 READ ALIGNMENT: STAR.............................................................................. 7 1.5.2 READ COUNTING: FeatureCounts ................................................................... 8 1.5.3 DIFFERENTIAL EXPRESSION: DESeq2 ......................................................... 8 1.5.4 DIFFERENTIAL EXPRESSION: edgeR ............................................................ 9 1.6 REFERENCES .......................................................................................................... 10 CHAPTER 2 ........................................................................................................................... 15 vi RNA-SEQ ANALYSIS OF PUBLISHED RNA-SEQ DATA FOR ARABIDOPSIS THALIANA UNDER DIFFERENT ABIOTIC STRESSES ............................................... 15 2.1 INTRODUCTION ..................................................................................................... 16 2.2 OBJECTIVES ........................................................................................................... 17 2.3 METHODS................................................................................................................ 18 2.3.1 PROCESS OF RNA-Seq ANALYSIS .............................................................. 18 2.3.2 PIPELINE OVERVIEW .................................................................................... 19 2.3.3 FLOW CHART OF THE PIPELINE ................................................................ 20 2.3.4 SCRIPTS OF THE PIPELINE .......................................................................... 21 2.3.5 SOFTWARES OF THE PIPELINE................................................................... 22 2.3.6 DATA SETS ...................................................................................................... 25 2.4 RESULTS.................................................................................................................. 26 2.5 DISCUSSION .......................................................................................................... 41 2.6 CONCLUSIONS ....................................................................................................... 47 2.7 FUTURE DIRECTIONS ......................................................................................... 47 2.8 REFERENCES ......................................................................................................... 49 3.0 APPENDIXES .......................................................................................................... 58 3.1 APPENDIX 1: SCRIPT FOR MAPPING READS ON GENOME ........................... 58 3.3 APPENDIX 3: SCRIPT FOR MERGING READS AND FPKM .............................. 62 3.4 APPENDIX 4: RSCRIPT FOR DIFFERENTIAL EXPRESSION ANALYSIS ....... 66 vii LIST OF FIGURES Figure 2.1 Process of RNA-Seq analysis. Three steps in RNA-Seq analysis: 1) Experimental design 2) Sequencing 3) Data analysis in high performance computing system. .................... 19 Figure 2.2 Flowchart of the RNA-Seq pipeline. There are seven steps in this pipeline. In each step, software names were denoted by orange text. In every step one input file is needed which is represented by blue text whereas the output file is denoted by green text. The output file of each step is the input for the next step. ......................................................................... 21 Figure 2.3 Summary of 43 experiments that have been analyzed through the pipeline. ......... 27 Figure 2.4 Comparison between number of up and down regulated genes in number of stress. X axis represents number of stresses and y axis represents number of genes. Blue colored bar denotes down-regulated and orange colored bar denotes up-regulated genes. This bar plot showed decreasing pattern of number of genes with increased number of stresses. ............... 31 Figure 2.5 The Jaccard similarity index for induced genes for each stress. Jaccard index is defined as Jaccard(A,B)=|A∩B|/|A∪B|. A represents genes in one stress and B represents genes in another stress. Dendrogram is on the left and upper side of the heatmap. Stresses are in both right and lower side of the dendrogram. The upper corner scale denotes the color for similarity. Red color is highest similarity, blue color indicates no similarity and white color denotes moderate similarity. Three groups include group 1: heat, cold; group2: ABA, salt and high light and group 3: blue light, red light and wounding. .................................................... 37 Figure 2.6 Jaccard similarity index for repressed genes for each stress. Dendrogram is in the left and upper side of the histogram. Name of the stresses are in both right and lower side of the dendrogram. The upper corner scale denotes the measurement of similarity: red color for highest similarity, blue for no similarity and white color denotes moderate similarity. Group1: heat, cold; Group2: salt and high light. Group 3: blue light and red light. .............................. 38 Figure 2.7 A schematic model of abiotic stress response based on genes found in our analysis. Transcription of many genes results in different mechanisms to stress tolerance. In this study of abiotic stress data analysis of Arabidopsis thaliana, ABF3/RD26 encodes ABREB that induced genes (GBSS1 and AT1G02660) that positively regulate energy production. P5CS1 was induced by stress and has ROS scavenging activity. EXPA16 has cell expansion or cell viii wall modification activity. SRI kinases work in ABA signal transduction pathway. Stress also inhibit CAB3, a positive regulation of chloroplast activity and RGF9, a positive regulator of cell proliferation. This model is drawn using BEACON pathway editor (Elmarakeby et al., 2017). ....................................................................................................................................... 44 ix LIST OF TABLES Table 2.1 This table summarizes the 43 experiments of published data for Arabidopsis thaliana have been collected from GEO database. In all these data, there are 3 time series experiments, 652 SRRs with 260 conditions, 8 stress types and 19 tissue types. ................... 25 Table 2.2 Eight stresses: blue light, high light, red light, cold, heat, salt, ABA and wounding from 10 experiments were selected. These experiments were selected because they have properly designed biological replicates. One experiment (GSE72806) has combination of salt and heat stresses. One experiment has three combined stresses including heat, cold and wounding. Two separated experiments (GSE63406 and GSE67332) tested the cold stress. .. 26 Table 2.3 Summary of read mapping results from STAR software. There were 652 samples used in this analysis. ................................................................................................................ 28 Table 2.4 Summary of reads counted by FeatureCounts software. There were 644 samples used in this analysis. ................................................................................................................ 28 Table 2.5 Summary from the DESeq2 run for six selcted GSE experiments. ......................... 29 Table 2.6 Functional annotation of common stress responsive genes obtained using Thalemine tool in Araport https://apps.araport.org/thalemine/. ............................................... 34 Table 2.7 The functional annotation of the common repressed genes using Thalemine tool in Araport https://apps.araport.org/thalemine/. Highlighted genes are known to be involved in plant stress responses and are discussed in main text. ............................................................. 34 Table 2.8 The detailed functions of the common induced and repressed genes. The genes in the light blue area are the induced genes and the genes in the light orange are repressed genes. .................................................................................................................................................. 35 Table 2.9.1 Summary for gene ontology of unique up and down regulated genes in agriGO GO analysis tool.4 GO terms are significant with very low FDR. Genes are in these GO term the genes have Nutrient reservoir, TF, TF binding activity. .................................................... 40 Table 2.9.2 One single GO term GO:0044425 is significant for the stresses: ABA, cold, heat, wounding with repressed genes and blue light and red light induced genes with function of x
Description: