ebook img

Principles of Statistical Genomics PDF

424 Pages·2013·4.296 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Principles of Statistical Genomics

Principles of Statistical Genomics Shizhong Xu Principles of Statistical Genomics 123 ShizhongXu DepartmentofBotanyandPlantSciences UniversityofCalifornia 900UniversityAvenue Riverside,California USA ISBN978-0-387-70806-5 ISBN978-0-387-70807-2(eBook) DOI10.1007/978-0-387-70807-2 SpringerNewYorkHeidelbergDordrechtLondon LibraryofCongressControlNumber:2012942325 ©SpringerScience+BusinessMedia,LLC2013 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartof thematerialisconcerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation, broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionorinformation storageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodology nowknownorhereafterdeveloped.Exemptedfromthislegalreservationarebriefexcerptsinconnection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’slocation,initscurrentversion,andpermissionforusemustalwaysbeobtainedfromSpringer. PermissionsforusemaybeobtainedthroughRightsLinkattheCopyrightClearanceCenter.Violations areliabletoprosecutionundertherespectiveCopyrightLaw. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant protectivelawsandregulationsandthereforefreeforgeneraluse. While the advice and information in this book are believed to be true and accurate at the date of publication,neithertheauthorsnortheeditorsnorthepublishercanacceptanylegalresponsibilityfor anyerrorsoromissionsthatmaybemade.Thepublishermakesnowarranty,expressorimplied,with respecttothematerialcontainedherein. Printedonacid-freepaper SpringerispartofSpringerScience+BusinessMedia(www.springer.com) Preface Statistical genomicsis a newinterdisciplinaryareaof science,includingstatistics, genetics, computer science, genomics, and bioinformatics. The rapid advances in these areas have dramatically changed the amount and type of information available for characterization of genes. In many genomic applications, existing methods coupled with new computational technology have successfully directed the exploration of high-dimensional data. What remains to be accomplished is the successful statistical modeling of genomic data to support hypothesis-driven biological research. This will ultimately lead to the exploitation of the predictive wealththatmuchofthecurrentandimpendinggenomicdatahavethepotentialto offer. Statistical developmentwill continue to significantly amplify and focus the molecularadvancesofthelastdecadestowardgeneralimprovementsinagriculture andhumanhealth. Using advanced statistical technology to study the behavior of one or a few Mendelianloci defines the field of statistical genetics. For complex traits, such as grain yield in crops and cancers in human, one or two loci are rarely sufficient to explain majority of the trait variation. People then study the behavior of all genes influencing a trait without distinguishing the effects of individual genes, creating a field called quantitative genetics. Taking advantage of saturated DNA markersgeneratedwithadvancedmoleculartechnology,wearenowabletolocalize individualgenesonthegenomethataffectacomplextrait,whichleadstothisnew field of statistical genomics or quantitative genomics. In statistical genomics, we emphasizethenotionofwholegenomeanalysisandevaluatethejointeffectofthe entiregenomeonaquantitativetrait. Anygenomestudyrequiresasampleofindividualsfromatargetpopulationand genomic data collected from this sample. Genomic data include (a) genotypes of molecular markers, (b) microarray gene expressions, and (c) phenotypesof clinic or economic traits measured from all individuals of the sample. Any particular genomic study may involve all the three types of data or any two of them. With theadvancedbiotechnology,molecularmarkerdatawillsoonbereplacedbywhole genome sequences. In the narrow sense, phenotypic data are not genomic data but the ultimate purpose of genomic data analysis is to dissect these traits and v vi Preface understand the genetic architectures of these traits. Therefore, phenotypic data are essential in genomic data analysis. This is why phenotypic data are included as genomic data. When a study involves phenotypes and marker genotypes, it is calledQTLmappingwhereQTLstandsforquantitativetraitloci.Astudyinvolving phenotypesandmicroarraygeneexpressionsiscalleddifferentialexpression(DE) analysis.Ifastudyinvolvesmarkergenotypesandmicroarraygeneexpressions,it is called expression quantitativetrait locus (eQTL) analysis. The purposeof QTL mapping is to find the genome locations, the sizes, and other properties of QTL throughassociationsofmarkergenotypeswith thevariationofa quantitativetrait. InDEanalysis,thephenotypeofinterestisusuallybinarysuchascase(represented by one) and control (represented by zero). The primary interest of DE is to find genesthatexpressdifferentlyincaseandcontrol.ThepurposeofeQTLmappingis to findregulationpathwaysof thegenes.Transcriptsmappedto thesame locusof thegenomeareconsideredinthesameregulationpathway. Manystatisticalmodels,methodologies,andcomputingalgorithmsareinvolved in the textbook. Major statistical models include the linear model (LM), the gen- eralizedlinearmodel(GLM),thelinearmixedmodel(LMM),andthegeneralized linear mixedmodel(GLMM).In a few places, the hiddenMarkovmodel(HMM) isrequiredtoinfertheunobservedgenotypesofQTLgivenobservedmarkergeno- types.AnotherimportantmodelistheGaussianmixturemodelforclusteranalysis. Commonly used statistical methods include the least squares (LS) estimation, the maximum likelihood (ML) estimation, the Bayesian estimation implemented via the Markovchain Monte Carlo (MCMC) algorithm,and the Bayesian methodvia the maximum a posteriori (MAP) estimation. Optimization technologies include theNewton–Raphsonalgorithm,theFisherscoringalgorithm,andtheexpectation– maximization(EM)algorithm.FortheNewton–Raphsonalgorithm,ifthefirst-and second-orderpartialderivativesofthetargetfunctionwithrespecttotheparameters areeasytoderive,anexplicitformoftheiterationequationwillbegiven.Otherwise, numericalevaluationsofthepartialderivativesarecalculatedusingsomepowerful numerical differentiation subroutines. In genomic data analysis, the number of parameters is often very large, updating all parameters simultaneously can be prohibitive.Inthiscase,acoordinatedescentapproachmaybetaken,inwhichone parameterisupdatedatatimeconditionaloncurrentvaluesofallotherparameters. This approachcan improvethe robustness of the optimization algorithmand save muchcomputermemorybutatthecostofcomputingtimeandriskoftrappingtoa localsolutionofparameters. This book was compiled from a collection of lecture notes for the statistical genomics course (BPSC234) offered to UCR graduate students by the author. Approximately half of the material was collected from studies published by the author and his research team. A small proportion of the remaining half consists of some unpublished works conducted by the author. Much of the remaining half of the book represents a collection of the most updated statistical genomic methods published in various journals for the last couple of decades. The topics selectedpurelyreflectthe author’schoicesforthecourseaccordingtothe levelof understandingof the targetstudents. The book is not an introductionto statistical Preface vii genomicsbecausestatisticalgenomicisadiversifiedareaincludingmanydifferent topics,andthisbookonlycoversaproportionofthetopics.However,thestatistical technologies chosen represent the core of statistical genomics. Understanding the principles of these technologies, students will easily extend the methods to other analysesofgenomicdatageneratedfromdifferentexperimentaldesigns.Although the book narrowly focuses on a few topics, each topic introduced is provided with the derivation of the method or at least a direction leading to the derivation. Statisticalgenomicsisamultidisciplinaryareawitharapiddevelopment.Writinga comprehensivebookinsuchanareaislikeshootingamovingtarget.Forexample, during the time between the completion of the first draft and the publication of thisbook,newtechnologiesandmethodologiesmayhavealreadybeendeveloped. Therefore,thebookcanonlyfocusontheprinciplesofstatistical genomics.Most recently developed methods may not be covered, for which the author owes an apologytothoseresearcherswhoseworksarerelevantbutnotcitedinthebook. The book consists of three parts. Part I contains Chaps. 1–4 and covers topics relatedtolinkagemapconstructionforDNAmarkers.PartIIconsistsofChaps.5– 16andisthemainpartofthebook.Thesechapterscovertopicsrelatedtogenetic mapping for quantitative trait loci using various designs of experiments. Part III (Chaps. 17–25)coverstopics related to microarraygene expressiondata analysis. This book intends to be used as a textbook for graduate students in statistical genomics, but it can be used by researchers as a reference book. For advanced readers, they can choose to read any particular chapters as they desire. However, forjuniorresearchersandgraduatestudents,itisbettertostudyfromthebeginning and not to escape any chapters because some of the methods introduced in early chapterswillbeusedlaterinthebookandtheywillonlybereferenced. Former and current postdocs and graduate students in the lab all contributed to the material published by the UCR quantitative genetics team. Postdocs who contributedtothematerialrelevanttothisbookincludeDamianGessler,Chongqing Xie, Shaoqi Rao, Nengjun Yi, Claus Vogl, Chenwu Xu, Yuan-Ming Zhang, Lide Han, Zhiqiu Hu, and Fuping Zhao. Graduate students involved in the research include Lang Luo, Yun Lu, Hui Wang, Yi Qu, Zhenyu Jia, Xin Chen, Xiaohong Che, and Haimao Zhan. Withouttheir hard work,the authorwould nothave been able to publish this book. Their contributionsare highly appreciated. In the main text, I choose to use the first person plural pronoun “we” instead of “I” for the very reason that the book material was mainly contributed by my research team. Inthe UCR quantitativegeneticsteam, NengjunYi madethe mostcontributionto thematerialincludedinthebookandthushedeservesaspecialacknowledgement. A specialappreciationgoesto the threecurrentmembersof the UCR quantitative genetics team, Zhiqiu Hu (postdoc), Haimao Zhan (student), and Xiaohong Che (student),fortheirhelpindrawingthefigures,checkingtheaccuracyofequations, andcorrectingerrorsoccurredinanearlydraftofthebook. Riverside,California,USA ShizhongXu Contents PartI GeneticLinkageMap 1 MapFunctions.............................................................. 3 1.1 PhysicalMapandGeneticMap ..................................... 3 1.2 DerivationofMapFunctions ........................................ 5 1.3 HaldaneMapFunction............................................... 8 1.4 KosambiMapFunction.............................................. 8 2 RecombinationFraction................................................... 11 2.1 MatingDesigns....................................................... 11 2.2 MaximumLikelihoodEstimationofRecombinationFraction..... 13 2.3 StandardErrorandSignificanceTest................................ 15 2.4 Fisher’sScoringAlgorithmforEstimatingr ....................... 17 2.5 EMAlgorithmforEstimatingr ..................................... 21 3 GeneticMapConstruction ................................................ 23 3.1 CriteriaofOptimality ................................................ 24 3.2 SearchAlgorithms.................................................... 25 3.2.1 ExhaustiveSearch .......................................... 25 3.2.2 HeuristicSearch ............................................ 26 3.2.3 SimulatedAnnealing....................................... 28 3.2.4 BranchandBound.......................................... 29 3.3 BootstrapConfidenceofaMap...................................... 33 4 MultipointAnalysisofMendelianLoci .................................. 35 4.1 JointDistributionofMultiple-LocusGenotype..................... 35 4.1.1 BCDesign .................................................. 36 4.1.2 F Design ................................................... 37 2 4.1.3 Four-WayCrossDesign.................................... 40 4.2 IncompleteGenotypeInformation................................... 40 4.2.1 PartiallyInformativeGenotype............................ 40 ix x Contents 4.2.2 BCandF AreSpecialCasesofFW...................... 42 2 4.2.3 DominanceandMissingMarkers.......................... 43 4.3 ConditionalProbabilityofaMissingMarkerGenotype ........... 44 4.4 JointEstimationofRecombinationFractions....................... 46 4.5 MultipointAnalysisformMarkers ................................. 47 4.6 MapConstructionwithUnknownRecombinationFractions ...... 49 PartII AnalysisofQuantitativeTraits 5 BasicConceptsofQuantitativeGenetics................................. 53 5.1 GeneFrequencyandGenotypeFrequency.......................... 53 5.2 GeneticEffectsandGeneticVariance............................... 55 5.3 AverageEffectofAllelicSubstitution............................... 56 5.4 GeneticVarianceComponents....................................... 57 5.5 Heritability............................................................ 58 5.6 AnF FamilyIsinHardy–WeinbergEquilibrium.................. 59 2 6 MajorGeneDetection...................................................... 61 6.1 EstimationofMajorGeneEffect.................................... 61 6.1.1 BCDesign .................................................. 61 6.1.2 F Design ................................................... 63 2 6.2 HypothesisTests...................................................... 64 6.2.1 BCDesign .................................................. 64 6.2.2 F Design ................................................... 65 2 6.3 ScaleoftheGenotypeIndicatorVariable........................... 67 6.4 StatisticalPower...................................................... 71 6.4.1 TypeIErrorandStatisticalPower......................... 72 6.4.2 Wald-TestStatistic.......................................... 72 6.4.3 SizeofaMajorGene....................................... 74 6.4.4 RelationshipBetweenW-testandZ-test ................. 75 6.4.5 ExtensiontoDominanceEffect............................ 76 7 SegregationAnalysis ....................................................... 79 7.1 GaussianMixtureDistribution....................................... 79 7.2 EMAlgorithm........................................................ 81 7.2.1 ClosedFormSolution ...................................... 81 7.2.2 EMSteps.................................................... 82 7.2.3 DerivationoftheEMAlgorithm........................... 83 7.2.4 ProofoftheEMAlgorithm ................................ 85 7.3 HypothesisTests...................................................... 87 7.4 VariancesofEstimatedParameters.................................. 88 7.5 EstimationoftheMixingProportions............................... 93 8 GenomeScanningforQuantitativeTraitLoci .......................... 95 8.1 TheMouseData...................................................... 96 8.2 GenomeScanning.................................................... 96 Contents xi 8.3 MissingGenotypes................................................... 98 8.4 TestStatistics......................................................... 99 8.5 BonferroniCorrection................................................ 102 8.6 PermutationTest...................................................... 103 8.7 Piepho’sApproximateCriticalValue................................ 106 8.8 TheoreticalConsideration............................................ 107 9 IntervalMapping........................................................... 109 9.1 Least-SquaresMethod ............................................... 110 9.2 WeightedLeastSquares.............................................. 113 9.3 FisherScoring........................................................ 115 9.4 MaximumLikelihoodMethod....................................... 120 9.4.1 EMAlgorithm .............................................. 121 9.4.2 Variance–CovarianceMatrixof(cid:2)O.......................... 122 9.4.3 HypothesisTest............................................. 126 9.5 RemarksontheFourMethodsofIntervalMapping................ 126 10 IntervalMappingforOrdinalTraits..................................... 131 10.1 GeneralizedLinearModel........................................... 132 10.2 MLUnderHomogeneousVariance.................................. 134 10.3 MLUnderHeterogeneousVariance................................. 136 10.4 MLUnderMixtureDistribution..................................... 137 10.5 MLviatheEMAlgorithm........................................... 139 10.6 LogisticAnalysis..................................................... 146 10.7 Example............................................................... 147 11 MappingSegregationDistortionLoci.................................... 151 11.1 ProbabilisticModel .................................................. 152 11.1.1 TheEMAlgorithm ......................................... 153 11.1.2 HypothesisTest............................................. 155 11.1.3 VarianceMatrixoftheEstimatedParameters............. 156 11.1.4 SelectionCoefficientandDominance ..................... 159 11.2 LiabilityModel....................................................... 159 11.2.1 EMAlgorithm .............................................. 161 11.2.2 VarianceMatrixofEstimatedParameters................. 163 11.2.3 HypothesisTest............................................. 164 11.3 MappingQTLUnderSegregationDistortion....................... 164 11.3.1 JointLikelihoodFunction.................................. 164 11.3.2 EMAlgorithm .............................................. 165 11.3.3 Variance–CovarianceMatrixofEstimatedParameters... 166 11.3.4 HypothesisTests............................................ 168 11.3.5 Example..................................................... 169 12 QTLMappinginOtherPopulations ..................................... 171 12.1 RecombinantInbredLines........................................... 171 12.2 DoubleHaploids...................................................... 175 xii Contents 12.3 Four-WayCrosses.................................................... 175 12.4 Full-SibFamily....................................................... 181 12.5 F PopulationDerivedfromOutbreds .............................. 182 2 12.6 Example............................................................... 183 13 RandomModelApproachtoQTLMapping ............................ 187 13.1 IdentitybyDescent................................................... 189 13.2 RandomEffectGeneticModel ...................................... 191 13.3 Sib-PairRegression .................................................. 192 13.4 MaximumLikelihoodEstimation ................................... 193 13.4.1 EMAlgorithm .............................................. 194 13.4.2 EMAlgorithmUnderSingularValueDecomposition.... 196 13.4.3 MultipleSiblings ........................................... 197 13.5 EstimatingtheIBDValueforaMarker............................. 199 13.6 MultipointMethodforEstimatingtheIBDValue.................. 201 13.7 GenomeScanningandHypothesisTests............................ 203 13.8 MultipleQTLModel................................................. 204 13.9 ComplexPedigreeAnalysis ......................................... 207 14 MappingQTLforMultipleTraits........................................ 209 14.1 MultivariateModel................................................... 210 14.2 EMAlgorithmforParameterEstimation ........................... 211 14.3 HypothesisTests...................................................... 213 14.4 VarianceMatrixofEstimatedParameters........................... 214 14.5 DerivationoftheEMAlgorithm .................................... 215 14.6 Example............................................................... 218 15 BayesianMultipleQTLMapping......................................... 223 15.1 BayesianRegressionAnalysis....................................... 223 15.2 MarkovChainMonteCarlo.......................................... 228 15.3 MappingMultipleQTL.............................................. 234 15.3.1 MultipleQTLModel....................................... 234 15.3.2 Prior,Likelihood,andPosterior............................ 235 15.3.3 SummaryoftheMCMCProcess .......................... 243 15.3.4 Post-MCMCAnalysis...................................... 243 15.4 AlternativeMethodsofBayesianMapping......................... 245 15.4.1 ReversibleJumpMCMC................................... 245 15.4.2 StochasticSearchVariableSelection...................... 250 15.4.3 LassoandBayesianLasso.................................. 252 15.5 Example:ArabidopsisData.......................................... 255 16 EmpiricalBayesianQTLMapping....................................... 257 16.1 ClassicalMixedModel............................................... 257 16.1.1 SimultaneousUpdatingforMatrixG...................... 259 16.1.2 CoordinateDescentMethod ............................... 262 16.1.3 BlockCoordinateDescentMethod........................ 264 16.1.4 BayesianEstimatesofQTLEffects ....................... 267

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.