ebook img

Gene set bagging for estimating replicability of gene set analyses PDF

1.2 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Gene set bagging for estimating replicability of gene set analyses

Gene set bagging for estimating replicability of gene set anal- yses Andrew E. Jaffe1,2 , John D. Storey3 , Hongkai Ji1 and Jeffrey T. Leek∗1 3 1 0 1DepartmentofBiostatistics,JohnsHopkinsBloombergSchoolofPublicHealth,BaltimoreMD21205 2 2LieberInstituteforBrainDevelopment,JohnsHopkinsMedicalCampus,BaltimoreMD21205 n 3Lewis-SiglerInstituteandDepartmentofMolecularBiology,PrincetonUniversity,Princeton,NJ08544 a J Email: JeffreyTLeek∗: [email protected]; 8 1 ∗Correspondingauthor ] E M Abstract . t a t s [ 2 Background v 3 Significance analysis plays a major role in identifying and ranking genes, transcription factor binding sites, DNA 3 9 methylationregions,andotherhigh-throughputfeaturesforassociationwithdisease. Weproposeanewapproach, 3 . called gene set bagging, for measuring the stability of ranking procedures using predefined gene sets. Gene set 1 0 bagging involves resampling the original high-throughput data, performing gene-set analysis on the resampled 3 1 data, and confirming that biological categories replicate. This procedure can be thought of as bootstrapping : v i gene-set analysis and can be used to determine which are the most reproducible gene sets. X r a Results Hereweapplythisapproachtotwocommongenomicsapplications: geneexpressionandDNAmethylation. Even with state-of-the-art statistical ranking procedures, significant categories in a gene set enrichment analysis may be unstable when subjected to resampling. Conclusions Wedemonstratethatgenelistsarenotnecessarilystable,andthereforeadditionalstepslikegenesetbaggingcan improve biological inference of gene set analysis. 1 Key Words Gene Set Analysis, Gene Expression, DNA Methylation, Gene Ontology Background The biology of many organisms is organized naturally as a series of diverse pathways, and the genetic landscape of cells is no different - genes also group together in pathways to perform specific functions [1]. Human health depends on the functionality of these pathways; de-regulation at the pathway level may be moreimportantfordiseaseslikecancerthande-regulationofspecificgenes[2]. Themostcommonstatistical approach for identifying pathways of interest in a high-throughput experiment is to perform a significance analysisgene-by-geneandthensummarizethesignificanthitsusinggenesetorgenepathwayanalyses. Each pathway or gene-set analysis is performed once on the entire data set. However, there is variability in the identified gene sets due to both the instability in gene rankings from the original gene ranking analysis and from the pathway/set analysis. Hereweproposeanewapproachtoevaluatethestabilityofbiologicalinferencedrawnfromanexperiment. Ourapproach,calledgenesetbagging,performsaresamplingoftheentirediscoveryalgorithm-significance analysis and gene set enrichment - to identify the most stable and reproducible enriched gene sets. We perform resampling by drawing an equal number of samples with replacement from the full (observed) dataset, performing a significance analysis followed by gene set analysis, and then identifying which sets are enriched. We can identify which observed gene sets are consistently enriched in resampled data, and compute the gene set replication probability (R), a measure of gene set stability based directly on the biological quantity of interest, representing the probability that an observed gene set will be enriched in future experiments. Thereplicationproportion(R)hassomeimportantadvantagesoverthetraditionally-reportedp-valuefor summarizing gene set enrichment. The structure of the gene set testing problem is fundamentally different than other multiple hypothesis testing problems - correlations between genes, different gene set sizes, and different levels and fraction of differential expression within gene sets makes the hypotheses fundamentally notcomparablewithstandardsignificancetesting[3,4]. Wethereforeinsteadproposetoestimatedirectlythe probability that a gene set will replicate, as in this more complicated multiple testing scenario, an estimate of the probability of replication may be of more interest than a measure of statistical significance. Lastly, given the emphasis for replication in genetics/genomics studies, this replication proportion may be another 2 metric for directing molecular validation of important biological processes involved in human disease. We perform our gene set bagging method on two genomics measurements: gene expression and DNA methylation. Even after adjusting the genomic data for potential batch effects, we demonstrate that some significantgenesetsfailtoreplicatewell,yetothernon-significantsetshavehighreplicationrates. Theresults forthesedifferentgenomictechnologiessuggestthatthesignalandnoisestructureofthespecificgenomicdata type contribute greatly to stability of gene sets. We use a simulation study to assess replication across two simulated datasets, and evaluate the concordance between replication probability (R) and the traditionally- reportedsignificance metric(p-value). We show thatthe replicationprobability betterquantifies thechance that a significant gene set will be consistent across studies, and the result of our analyses suggests that: (1) gene set enrichment analyses from a typical high-throughput study may be highly unstable, and (2) gene set bagging is a resampling approach for measuring the stability of gene sets and ensuring reproducibility of biological conclusions. Methods Bagging, also known as bootstrap aggregating, is traditionally used for assessing the predictive accuracy and stability of prediction models [5]. While bagging procedures have been used for differential expression analyses [6], here we introduce a new bagging procedure for significance analysis of gene sets called gene set bagging. This procedure can be useful for both evaluating significance rankings and also for describing the most reproducible genes and biological gene sets within genomics experiments in a platform-independent fashion. For gene set l, the goal is to estimate: R =Pr(Genesetlwillbesignificantinanewstudy). l The quantity R is useful as a measure of the stability of the significance of an identified gene set. Gene sets l are frequently used to interpret the biological results of studies, so it is important to know if the biological ”story” would change if the study was repeated. This is particularly true since gene set analysis is subject to errors in annotation, variation due to technological noise, and variation due to biological noise. Asanexampleofourgeneralapproach,wefocusonacigarettesmokingdataset(furtherexplainedinthe followingDatasetsandImplementationsection),whichexaminedgeneexpressiondifferencesassociatedwith smoking exposure in 40 smokers and 39 never-smokers. We define gene expression measurements m for ij each of j = 1,...,79 samples over i = 1,...,M genes/probes (corresponding to gene g ) and a covariate of i 3 interest per sample (z ∈[currentsmoker,neversmoker]). We first want to identify differentially expressed j genes between the two outcome groups, so we calculate an empirical Bayes t-statistic and resulting p-value for each gene [7]. We can call any gene significant if α<0.05 (or, alternatively, we can assume α to be the family-wise error rate as to control for multiple testing), and look for enrichment in L predefined gene sets. Each gene set gets a p-value (p ), reflecting the degree of enrichment. The prevalent wilcoxon mean rank l gene set enrichment test [8] available in the limma Bioconductor package [9] and traditional hypergeometric test were used as enrichment tests for each dataset. We can then perform gene set bagging using B = 100 iterations. In each iteration (b), we resample the 40 smokers and 39 smokers separately with replacement. Each gene or probe gets a p-value via calculating at-statisticintheresampleddata, andthesestatisticsarepassedtogenesetanalysisalgorithmstoproduce a p-value of enrichment for each gene set (pb), which are stored in the first column of a [#gene sets by B] l matrix. In the next iteration, we resample again and create a different resampled dataset, and get another set of p-values that are put in the second column of the storage matrix. We fill in the columns of the matrix if we dothis100times. Foreachrow, whichrepresentsasinglegeneorprobe, wecountthenumberoftimeseach subsampled p-value (pb) is less than α (here, 0.05), and divide it by the number of iterations (B), resulting l in an estimate of the replication probability for that gene set (Rˆ ). l Regardlessoftheapplication,estimatedreplicationprobabilities(Rˆ)arebetween0and1,where0means that the gene set always had a p-value greater than α in every iteration, and 1 means that the category always had a p-value less than α in each iteration. For analyses where the gene ranking is stable and the gene set calculation is stable, the replication probability will be higher. This estimate of replication assesses the stability of the gene sets, and might be a better estimate of biological reproducibility than the traditionallyreportedp-values. Ourgoalistoidentifythemorestablesetofgenesets,akintoMeinshausen, and Bu¨hlmann 2010 [10] in selecting a more stable set of covariates in a regression model. Datasets and Implementation Simulated Data We designed two simulation studies to assess different properties of the replication probability based on the Affymetrix Human Genome 133 Plus 2.0 gene expression microarray. Basing the simulation on an existing arraydesign,withprobesannotatedtogenesthatwerealreadymappedtogeneontologycategories,allowed ustorealisticallyadddifferentialexpressionsignaltospecificgenesets. Wefirstselectedarandomsampleof 4 Algorithm 1 Gene set bagging procedure 1. Estimate a test statistic for each gene Tˆ i 2. Use the test statistics to calculate a P-value for each gene set, p , l=1...,L, using any standard gene l set analysis algorithm. 3. For (b∈1,...,B): i. Resample individuals within outcome groups ii. Estimate a bootstrap test statistic for each gene Tˆ∗b i iii. Use the test statistics to calculate a bootstrap p-value for each gene set, p∗b, l=1...,L, using any standard gene l set analysis algorithm. 4. Estimate the replication probability Rˆl = (cid:80)Bb=1IB[p∗lb<α] for each gene set. 100 gene sets to use in our simulation, which corresponded to 2288 unique genes. Then, for each simulation, we selected first 100 and then 500 genes (generally denoted G) to insert signal into, via the following model: m =β +β z +(cid:15) ij 0 i j ij where (cid:15) ∼ N(6,1) and β ∼ N(1,0.5) if g ∈ G, and m and z (defined above) correspond to the ij i i ij j expression value and binary outcome respectively. Inthefirstsimulation,wegenerated1000datasets, whereeachconsistedof100individuals(50casesand 50controls). Foreachdataset,weinsertedsignalinto100genesandcomputedtheobservedp-value(p )and l then the replication probabilities (R ) for each gene set l. In the second simulation we generated 100 pairs l of datasets, where each dataset contained 50 individuals (25 cases and 25 controls), and inserted signal into 500 genes, and computed observed p-values and replications probabilities for each gene set. Gene expression: cigarette smoking data We tested the gene set bagging method in a differential expression analysis with data obtained from Gene ExpressionOmnibus(GSE17913). Thisstudyexaminedtheeffectofcigarettesmokingontheoralepithelial transcriptomebycomparingbuccalbiopsiesin39never-smokerswith40active-smokersusingtheAffymetrix Human Genome U133 Plus 2.0 microarray [11]. We processed the raw CEL files using the RMA algorithm toperformintra-arraynormalizationandthenperformedquantilenormalizationtoadjustforbetween-array biases [12]. We performed surrogate variable analysis (SVA) to adjust for potential batch effects [13,14]. Briefly, this approach identifies the number of right singular vectors that are associated with more variation than expected by chance, and then in the subsets of genes driving this variation, constructs a ’surrogate’ variable 5 for each subset. These surrogate variables are then included as covariates in our differential expression analysis (so that the model becomes: m =β z+β SV +(cid:15) ). ij i SV ij We identified differentially expressed genes comparing cases and controls while controlling for the surro- gatevariablesusinganempiricalBayesapproach[15]. Todeterminestatisticalsignificance,resultingp-values were converted to q-values to control for the false discovery rate [16] and all transcripts with q-values less than0.05wereconsideredsignificant. Weperformedafullgeneontologyanalysis, andthenranthegeneset bagging algorithm. DNA methylation: brain tissue Webelievethisapproachtobegeneralizabletomostgenomicsplatforms,andfirsttestedthishypothesisusing DNA methylation data processed on the Illumina HumanMethylation27 platform using freely available data from GEO [17] [GSE15745]. Our analysis utilized DNA methylation data from a recent paper that assessed quantitativetraitlociusingmethylationandexpressiondatainfourdifferentbraintissues. Previousworkhas identifiedthatDNAmethylationsignaturescandistinguishbraintissues,andmightplayaroleindetermining and stabilizing normal brain differentiation [18]. We conducted our gene set bagging algorithm on the differential DNA methylation analysis between the frontal and temporal cortices. Detailed preprocessing information can be found in the supplementary material. We performed the full differential methylation analysis comparing 131 front cortex and 126 temporal cortex samples, adjusting for plate number, tissue bank site, sex, and age, using the exact same approach as the gene expression example (with and without SVA, then empirical Bayes and multiple testing correction on each). All probes with q-values less than 0.05 were considered significant. We performed a full gene ontology analysis on the gene associated with each probe (from the annotation table), and ran the gene set bagging algorithm. Interpretation The replication probability (R) reflects stability The interpretation of the replication probability reflects the underlying stability of each outcome group. Suppose we observe a p-value for a gene set that we call significant at significance level α. The replication probability estimate is defined as the fraction of times that feature is significant at the same α level in resamplingsoftheoriginaldata. Ifwecalledallp-valuessignificantatα=0.05, R=0.8meansthatfeature had a p-value less than 0.05 in 80% of resampling datasets. If the statistical signal is stable, significant 6 featureswillhavehighreplicationprobabilityestimates,andnon-significantfeatureswillhavelowreplication probabilityestimates(R≈0),becausetheresampleddatashouldberepresentativeoftheoverallpopulation. To better understand how the replication probability therefore addresses stability, suppose we perform a gene expression experiment, and further study two gene sets: Set A with p=0.001 and Set B with p=0.2. We then calculate the replication probability for these two gene sets, and want to interpret the results. Consistency between the replication probability and p-value means that the direction of statistical inference is identical: high replication probabilities with low p-values are consistent. Table 1: Interpretations of sequential replication probabilities (Rˆ) for two different experiment features. R ”Feature A (p = 0.001) is significant, 0 but very inconsistent” 0.25 but very inconsistent” 0.5 inconsistent” 0.75 somewhat consistent” 1 very consistent” R ”Feature B (p = 0.2) is non-significant, 0 very consistent” 0.25 somewhat consistent” 0.5 inconsistent” 0.75 very inconsistent” 1 very inconsistent” Tounderstandtheapparentlycounterintuitiveinterpretations, wewillfocusontherowinTable1where R = 1.0. Feature A is extremely stable because the significance of this feature seems constant throughout resampling. FeatureBisalsoextremelystablebecausetheresamplinginferencesarerelativelynon-variable, but the replication probability is very inconsistent with the p-value. Interpreting stability and consistency throughthereplicationprobabilitythereforerequiresobservedp-valuesforgenesets. However,itisimportant torememberthatthereplicationprobabilityisafunctionofthestatisticalsignificancelevel(α); algebracan demonstrate that R increases as α increases. R estimates the probability a gene set will be significant in a repeated study We simulated 1,000 identical data sets (as described in the Datasets section; Simulation 1), these data sets represent repeated experiments performed under the same conditions. We spiked in specific genes as differentially expressed in these simulated data sets. We then performed gene set analysis using both the hypergeometric and Wilcoxon tests and calculated the replication probability estimates for each gene set in each of the 1,000 simulated studies. The average replication probability estimate across all 1,000 repeated 7 studies very closely approximates the frequency that a gene set is observed to be significant in those 1,000 studies (Figures 1A and 1B). In other words, the estimate of the replication probability is close to the probability a gene set will be significant in a future study. Replication adds biological interpretability In the gene expression dataset (Figure 2), there were 8 GO categories with p > 0.05 and R > 0.8 under the hypergeometric test, including sets associated with phosphorylation (GO:0006468, GO:0016310), a process affected by cigarette smoking [19] and regulation of metabolic processes (GO:0019222, GO:0044267). Sim- ilarly, examining the categories associated with DNA methylation differences across brain tissue that had at least moderate replication and non-significant p-values demonstrates support for the gene set bagging approach as well as the shortcomings of relying on strict p-values cutoffs for gene ontology analysis (Figure 3). Several biologically plausible GO categories for a comparison of methylation differences in brain tis- sues fell into the ”marginally significant” bin of observed p-values between 0.05 and 0.1 but had consistent replication. Similarly, there were many smaller gene sets that had statistically significant p-values (p < 0.05) but never appeared in any of the resampled datasets (R=0) in both the gene expression (N = 32) and DNA methylation datasets (N = 12). These represent very unstable gene sets, and should be interpreted with caution. Categories like the first set (p > 0.05,R > 0.8) would have been ignored in a traditional gene set analysis given their statistical significance measure, but might be biologically important to the question of interest. Likewise, gene sets in the second category (p < 0.05,R = 0) are probably less biologically meaningful even though they are ”statistically significant”. Results R correlates better with replication in independent datasets Besidesidentifyingwhichgenesetsarethemoststable,wecanalsoassesshowwellthereplicationprobability (R)reflectsbiologicalreplicationbyspiking-ingenesetenrichmentacrosstwoindependentsimulateddatasets (described fully in the Datasets section). We performed traditional gene ontology analysis on both datasets, obtaining p-values for each gene set and study calculated from the hypergeometric distribution, and then performed our gene set bagging algorithm on these same two datasets. There was very strong Spearman correlation between datasets across 100 simulation runs when all gene sets were considered regardless of whether the replication proportion (median=0.854, IQR: 0.826-0.876) or p-value (median = 0.836, IQR: 8 0.809-0.869)wasused(Figure1C).However,whenonlygenesetswhereatleast1of2datasetswassignificant at p < 0.05 per simulation run, the replication proportion had much stronger correlation (median = 0.755, IQR=0.678-0.817)thanthep-value(median=0.535,IQR:0.387-0.648)(Figure1D).Theseresultssuggest that globally, there might not be a large difference between the replication proportion and the p-value, but when there is any signal in a particular gene set, the replication proportion better captures independent replication of that set in future studies. We also performed the more robust Wilcoxon rank rest on these simulatedpaireddatasets,whichalsohadlesscorrelationbetweentheresultingp-valuesthanthereplication probability (Figure 1E). There were many fewer significant gene sets by this enrichment approach than the hypergeometric test, and it was rare that both independent datasets within a simulation were significant at p < 0.05. The p-values derived from the Wilcoxon method therefore appear more conservative than collapsing each gene set into a 2x2 table and performing the hypergeometric test. Relationship to the problem of regions The set of test statistics corresponding to genes within an individual set can be viewed as a multivariate random variable. When viewed in this way, a gene set is significant if the vector of test statistics falls into a multi-dimensional region defined by the significance threshold. The replication probability is then a first approximationestimateoftheposteriorprobabilityagenesetwillbesignificant,assuminganon-informative priordistributiononthevectorofteststatistics. Thisproblemhasbeenconsideredinthecaseofmultivariate normal data [20] and for estimating confidence in inferred phylogenies [21]. As has been previously pointed out, this posterior probability is a reasonable first approximation to the posterior probability in question, but should not be interpreted as a frequentist measure of statistical significance [20,22]. Asanexampleoftherelationshipbetweenthebootstrapandaposteriorprobability,supposez ,...,z ∼ 1 n N(µ,σ2). A non-informative prior distribution for the parameters (µ,σ2) is the Jeffrey’s prior [23]. The Jeffrey’spriorforµisanimproperuniformprioracrossthereallineandtheJeffrey’spriorforσ2 ∝ 1 . Using σ2 thesepriordistributions,theposteriordistributionforµisN(z¯,τ2)whereτ ∼InverseWishart ((ns2)−1) n−1 and s2 = 1 (cid:80)n (z −z¯)2. In this case, since µ is one dimensional, the InverseWishart distribution is n i=1 i equivalent to an InveseGamma distribution. Drawing bootstrap samples from the z and recalculating i the mean approximates sampling from the posterior distribution of µ (see supplemental R code). It is important to note that the variance of the posterior for µ is inflated compared to σ2 assuming a frequentist model [20,22]. Note that the p-values these bootstrap samples should not be interpreted as measures of statistical significance, because they are no longer distributed uniformly. 9 Discussion We have developed a resampling-based strategy for estimating the probability a gene set will replicate as statistically significant. This direct approach to estimating replicability may be more useful than statistical significanceforinvestigatorswhoaimtoidentifystableandreproduciblebiologicalstoriesfortheirresults. By utilizing outcome-based resamplings of the observed data, the reproducibility of gene sets can be quantified, represented by the replication probability R of each gene set category across all subsamples. This approach can offer an additional metric beyond the p-value for identifying which biological pathways to follow-up. We have successfully applied this method to gene expression and DNA methylation under two commonly- used enrichment metrics: the hypergeometric test and the wilcoxon rank test, and demonstrate that many seemingly statistically significant GO categories fail to replicate consistently. A strength of our approach is thegeneralizabilityofthisalgorithmtomostothergenomicsapplications,includingfollowingbias-correcting approaches like SVA during the analysis, to assess the stability of results lists. Overall, between the two most commonly used gene set enrichment measures, the Wilcoxon rank test appears more stable than the hypergeometric p-value, using simulated and real data. There were fewer inconsistent gene sets (significant either by R or the p-value, but not the other), and the relationship between the replication probability and p-value was more precisely defined (Figure 1B and 2B). Gene sets with high replication probabilities and low p-values represent statistically significant, stable, and consistent sets that might best represent the underlying biology within the experiment. Given that most genomics studies require some form of external replication and that R appears more correlated with replication in second follow-up study than p-values alone, we might also suggest following-up gene sets that have high replication probabilities (R) even if the p-values are marginally, or even non-, significant. From a practical perspective, the gene set bagging algorithm has been turn into the R package ”GeneSet- Bagging”, available through GitHub (https://github.com/andrewejaffe/GeneSetBagging). While defining a recommendedcutoffseemscounterproductive, asindifferentapplications, usersmaychoosedifferentcutoffs depending on their resources for replication and how willing they are to be correct, a cutoff value of R above 0.5 means that gene set is more likely to replicate than not, and could be used as a lower bound for replicability. Typical genomics practice often involves drawing the majority of biological conclusions of an experiment from the results of a gene set analysis without assessing the stability of the results. We envision replication probabilities used in conjunction with standard measures of statistical significance, as the emphasis on replication in genetics/genomics makes the replication probability a useful quantity to estimate and use in 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.