contents s supplement september 2002 c eti n e g e r u nat editorial m/ Cover art by Darryl Leja o c e. Spreading the word 1 r u Alan Packer at n w. foreword w w http:// PAonwdreear st Do Bthaexe pvaenoips l&e Francis S Collins 2 p u o perspective r G g hin Genomic empowerment: the importance of public databases 3 s Harold Varmus bli u P user’s guide e r u at A user’s guide to the human genome 4 N 2 Tyra G Wolfsberg, Kris A Wetterstrand, Mark S Guyer, Francis S Collins 0 & Andreas D Baxevanis 0 2 © Introduction: putting it together 5 Question 1 9 How does one find a gene of interest and determine that gene’s structure? Once the gene has been located on the map, how does one easily examine other genes in that same region? Question 2 18 How can sequence-tagged sites within a DNA sequence be identified? Question 3 21 During a positional cloning project aimed at finding a human disease gene, linkage data have been obtained suggesting that the gene of interest lies between two sequence-tagged site markers. How can all the known and predicted candidate genes in this interval be identified? What BAC clones cover that particular region? Question 4 29 A user wishes to find all the single nucleotide polymorphisms that lie between two sequence-tagged sites. Do any of these single nucleotide polymorphisms fall within the coding region of a gene? Where can any additional information about the function of these genes be found? Question 5 33 Given a fragment of mRNA sequence, how would one find where that piece of DNA mapped in the human genome? Once its position has been determined, how would one find alternatively spliced transcripts? 40 supplement to nature genetics • september 2002 contents Question 6 How would one retrieve the sequence of a gene, along with all annotated exons and introns, as well as a certain number of flanking bases for use in primer design? Question 7 44 How would an investigator easily find compiled information describing the structure of a gene of interest? Is it possible to obtain the sequence of any putative promoter regions? Question 8 49 How can one find all the members of a human gene family? s Question 9 53 c eti Are there ways to customize displays and designate preferences? Can tracks or n features be added to displays by users on the basis of their own research? e g e ur Question 10 57 at For a given protein, how can one determine whether it contains any functional n m/ domains of interest? What other proteins contain the same functional domains as o this protein? How can one determine whether there is a similarity to other proteins, c e. not only at the sequence level, but also at the structural level? r u at Question 11 63 n w. An investigator has identified and cloned a human gene, but no corresponding w mouse ortholog has yet been identified. How can a mouse genomic sequence with w p:// similarity to the human gene sequence be retrieved? htt Question 12 66 p How does a user find characterized mouse mutants corresponding to human genes? u o Gr Question 13 70 g A user has identified an interesting phenotype in a mouse model and has been able hin to narrow down the critical region for the responsible gene to approximately 0.5 cM. s How does one find the mouse genes in this region? bli u Commentary: keeping biology in mind 74 P e ur Acknowledgments 75 at N 2 References 76 0 0 2 Web resources: Internet resources featured in this guide 77 © supplement to nature genetics • september 2002 editorial s c eti supplement september 2002 n e g e r u at n m/ o c e. r u Spreading the word at n w. w w p:// doi:10.1038/ng961 p htt There was a time, not too long ago, when the wisdom of swimming in a rapidly rising sea of data…how do we u genome-sequencing projects was up for discussion. keep from drowning?” And if geneticists and bioinfor- o Gr Would they be too expensive, draining funds from other maticians are struggling to stay afloat, what of the non- g areas of the life sciences? Would they be worth the trou- geneticists who are eager to exploit the sequences but n hi ble? Not much more than 15 years have passed since are relative newcomers to the tools needed to navigate s bli those early debates, and the importance of sequenced all of this information? u P genomes to biology and medicine has now gained wide It is with these questions in mind that we present A re acceptance. This is in part owing to the relatively rapid User’s Guide to the Human Genome. Written by Tyra u at fall in the cost of sequencing, followed by the undeniably Wolfsberg, Kris Wetterstrand, Mark Guyer, Francis N 2 important insights gained from the annotation of sev- Collins and Andreas Baxevanis of the National Human 0 0 eral bacterial genomes, and those of a few of our favorite Genome Research Institute (NHGRI), this peer- 2 © eukaryotes. The news has been so relentlessly upbeat reviewed how-to manual guides the reader through that one might even have expected some ‘genome some of the basic tasks facing anyone whose work might fatigue’ to set in, especially given the saturation coverage be facilitated by an improved understanding of the of the publication of the drafts of the human genome online resources that make sense of annotated genomes. sequence 18 months ago. Not so, however; witness the The directors of these online resources—Ewan Birney of recent jockeying by different groups for inclusion of Ensembl, David Haussler of the University of California, ‘their’ model organism in the next round of sequencing Santa Cruz and David Lipman of the National Center for projects. The honeymoon goes on. Biotechnology Information—have served as advisors And yet there are important issues to be addressed. during the development of this guide, ensuring a bal- One is the concern surrounding any bestseller—that it anced and accurate treatment of their respective web will have far fewer actual readers than one might expect. portals. The online version of the guide will also evolve, At first glance, this would seem not to apply to the with an initial update scheduled for April, 2003. human genome. After all, one is hard pressed these days As noted by Harold Varmus in his eloquent perspec- to pick up a copy of Nature Genetics, or any genetics tive on A User’s Guideand the public databases it exam- journal, and not find evidence that sequenced genomes ines, one of the important legacies of the Human inform many of the most important advances. A survey Genome Project is its ethos of open access to the data. In published last year by the Wellcome Trust, however, this spirit, and with the generous sponsorship of the found that only half of the researchers who were using NHGRI and the Wellcome Trust, the online version of sequence data were fully conversant with the services this supplement will be freely available on the provided by the freely accessible databases. Nature Geneticswebsite. There is also the concern that genome sequencers might be victims of their own success. As computa- Alan Packer tional biologist David Roos recently put it, “We are Nature Genetics supplement to nature genetics • september 2002 1 foreword Power to the people doi:10.1038/ng962 The National Human Genome Research Institute of the the Wellcome Trust indicated that only half of biomed- National Institutes of Health is delighted to sponsor this ical researchers using genome databases are familiar special supplement of Nature Genetics. The primary aim with the tools that can be used to actually access the data. of this supplement is to provide the reader with an ele- The inherent potential underlying all of this sequence- s mentary, hands-on guide for browsing and analyzing based data is tremendous, so the importance of all biolo- c eti data produced by the International Human Genome gists having the ability to navigate through and cull n ge Sequencing Consortium, as well as data found in other important information from these databases cannot be e ur publicly available genome databases. The majority of this understated. at n supplement is devoted to a series of worked examples, The study of biology and medicine has truly undergone m/ o providing an overview of the types of data available and a major transition over the last year, with the public avail- c e. highlighting the most common types of questions that ability of advanced draft sequences of the genomes of r u at can be asked by searching and analyzing genomic data- Homo sapiens and Mus musculus, rapidly growing n w. bases. These examples, which have been set in a variety of sequence data on other organisms, and ready access to a w w biological contexts, provide step-by-step instructions host of other databases on nucleic acids, proteins and p:// and strategies for using many of the most commonly- their properties. Yet for the full benefits of this dramatic htt used tools for sequence-based discovery. It is hoped that revolution to be felt, all scientists on the planet must be p readers will grow in confidence and capability by work- empowered to use these powerful databases to unravel u ro ing through the examples, understanding the underlying longstanding scientific mysteries. As pointed out by G g concepts, and applying the strategies used in the exam- Harold Varmus in the Perspective, free accessibility of all n hi ples to advance their own research interests. of this basic information, without restrictions, subscrip- s bli One of the motivating factors behind the development tion fees or other obstacles, is the most critical component u of this User’s Guidecomes from the general sense that the of realizing this potential. It is our modest hope that this P e most commonly-used tools for genomic analysis still are User’s Guidewill provide another useful contribution. r atu terra incognitafor the majority of biologists. Despite the N large amount of publicity surrounding the Human Andreas D. Baxevanis and Francis S. Collins 2 0 Genome Project, a recent survey conducted on behalf of National Human Genome Research Institute 0 2 © 2 supplement to nature genetics • september 2002 perspective Genomic empowerment: the importance of public databases doi:10.1038/ng963 Over the past twenty five years, a mere sliver of recorded time, the teaching many of the principles of biological design, including world of biology — and indeed the world in general — has been evolution, gene organization and expression, organismal devel- transformed by the technical tools of a field now known as opment, and disease; and in part because those who work on s c genomics. These new methods have had at least two kinds of genomes have been tireless in attempts to explain the meaning of neti effects. First, they have allowed scientists to generate extraordi- genes to an eager public. Endless metaphors, artistic creations, ge narily useful information, including the nucleotide-by- lively journalism, monographs about social and ethical implica- e r nucleotide description of the genetic blueprint of many of the tions, televised lectures from the White House, and many other u at organisms we care about most—many infectious pathogens; use- cultural happenings have been among the manifestations of this n m/ ful experimental organisms such as mice, the round worm, the fascination. In this way, the HGP has had a strong hand in raising o fruitfly, and two kinds of yeast; and human beings. Second, they the public’s awareness of new ideas in biology and of the power- c e. have changed the way science is done: the amount of factual ful implications of genomics in medicine, law and other societal r u at knowledge has expanded so precipitously that all modern biolo- institutions. n gists using genomic methods have become dependent on com- Some of these cultural effects come as much from the behav- w. w puter science to store, organize, search, manipulate and retrieve ioral aspects of the HGP as from the genomic sequences them- w p:// theT nheuws biniofolorgmya htaiosn b.een revolutionized by genomic information sineltvoe ps.u Tbhlies hsahbalrei nfogr omf ,n heaws sinpfuorrrmeda teifofonr, tesv teon s bheafroer oet ihtes ra kssienmdsb olyf htt and by the methods that permit useful access to it. Equally research tools and has encouraged the notion of making the sci- p importantly, these revolutionary changes have been dissemi- entific literature freely accessible through the Internet. The con- u o nated throughout the scientific community, and spread to other tribution of scientists in many countries to the sequencing of r G interested parties, because many of those who practice genomics many genomes, including the human genome, has inspired g n have made a concerted effort to ensure that access is simplified efforts to develop gene-based sciences—from basic genomics to hi for all, including those who have not been deeply schooled in the biotechnology—throughout the world, including the poorest s bli information sciences. The goal of providing genomic informa- developing nations. Indeed, the World Health Organization, the u tion widely has also inevitably attracted the interests of those in United Nations, and the World Bank have all contributed P e the commercial sector, and privately developed versions of vari- recently to the growth of the ideas that science is both possible ur ous genomes are also now available, albeit for a licensing fee. and valuable in all economies and that science can be a means to at The operative principle most prominently involved in trans- help unify the world’s population under a banner of enlighten- N 2 mitting the fruits of genomics—the one that has captured the ment, demonstrating a virtue of globalization. 00 imagination of the public and served as a standard for the shar- From this perspective, the availability of the sequences of many 2 ing of results and methods more generally in modern biology— genomes through the Internet is a liberating notion, making © has been open access. Funding by public and philanthropic extraordinary amounts of essential information freely accessible organizations, such as the U.S. National Institutes of Health, the to anyone with a desktop computer and a link to the World Wide U.S. Department of Energy, the Wellcome Trust in Britain, and Web. But the information itself is not enough to allow efficient many other organizations, has made this altruistic behavior pos- use. Interested people who reside outside the centers for studying sible and has fostered the idea that genomic information about genomes need to be told where best to view the information in a biological species should be available to all. (Such information form suitable for their purposes and how to take advantage of the about individual human beings is, of course, an entirely different software that has been provided for retrieval and analysis. matter and should be protected by privacy rules.) The attitude of The manual before us now offers such help to those who might open access to new biological knowledge has also been embodied otherwise have had trouble in attempting to use the products of in the databases of the International Nucleotide Sequence Data- genomics. Furthermore, the advice is offered in that spirit of base Collaboration, comprising the DNA DataBank of Japan, the altruism that has come to characterize the public world of European Molecular Biology Laboratory, and GenBank at the US genomics. The information is provided in a highly inviting and National Library of Medicine. The same focus on open access is understandable format by casting it in the form of answers to the exemplified by PubMed (operated by the NLM), other gateways questions most commonly posed when approaching big to the scientific literature, and the assemblies of genomic genomes. The information, made freely available on the World sequence now found at the several Web portals described in this Wide Web, has been assembled by some of the best minds in the guide. HGP, who have generously given their time and intellect to The Human Genome Project (HGP), which has supported the encourage widespread use of the great bounty that has been cre- public genome sequencing effort, has been the mainstay of the ated over the past two decades. effort to make genomes accessible to the entire community of In other words, the guide to use of genomes provided here is scientists and all citizens. This effort has, in fact, been quite natu- simply another indication that the HGP should take great pride rally extended to instruct the public about many themes in mod- in much more than the sequencing of genomes. ern biological science. This has occurred in part because the human genome itself has been such an exciting concept for the Harold Varmus public; in part because genomes are natural entry points for Memorial Sloan-Kettering Cancer Center supplement to nature genetics • september 2002 3 user’s guide A user’s guide to the human genome doi:10.1038/ng964 The primary aim of A User’s Guide to the Human Genomeis to provide the reader with an elementary hands-on guide for browsing and analyzing data produced by the International Human Genome Sequencing Consortium and other systematic sequencing efforts. The majority of this supplement is devoted to a series of worked exam- s ples, providing an overview of the types of data available, details on how these data can be browsed, and step- c eti by-step instructions for using many of the most commonly-used tools for sequence-based discovery. The major n ge web portals featured throughout include the National Center for Biotechnology Information Map Viewer, the e ur University of California, Santa Cruz Genome Browser, and the European Bioinformatics Institute’s Ensembl system, at n along with many others that are discussed in the individual examples. It is hoped that readers will become more m/ o familiar with these resources, allowing them to apply the strategies used in the examples to advance their own c e. r research programs. u at n w. Authors w w Tyra G. Wolfsberg p:// Kris A. Wetterstrand p htt Mark S. Guyer u Francis S. Collins o Gr Andreas D. Baxevanis g n hi National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA. s bli e-mail: [email protected] u P e r u at N 2 0 0 2 © 4 supplement to nature genetics • september 2002 user’s guide Introduction: putting it together doi:10.1038/ng965 In its short history, the Human Genome Project (HGP) has pro- finished when it has been determined at an accuracy of at least vided significant advances in the understanding of gene structure 99.99% and has no gaps. Sequence data that fall short of that and organization, genetic variation, comparative genomics and benchmark but can be positioned along the physical map of the appreciation of the ethical, legal and social issues surrounding chromosomes are termed ‘draft’. Currently, 87% of the euchro- the availability of human sequence data. One of the most signifi- matic fraction of the genome is finished and less than 13% is at s c cant milestones in the history of this project was met in February the draft stage. neti 2001 with the announcement and publication of the draft ver- Even in this incomplete state, the available data are extremely ge sion of the human genome sequence1. The significance of this useful. This usefulness was apparent early on, leading the Inter- e r milestone cannot be understated, as it firmly marks the entrance national Human Genome Sequencing Consortium (IHGSC) to u at of modern biology into the genome era (and not the post- pursue a staged approach in sequencing the human genome. The n m/ genome era, as many have stated). The potential usefulness of first stage generated draft sequence across the entire genome1. o this rich databank of information should not be lost on any biol- The project is now well advanced into its second stage, with draft c e. ogist: it provides the basis for ‘sequence-based biology’, whereby sequence being improved to ‘finished quality’ across the entire r u at sequence data can be used more effectively to design and inter- genome, a necessarily localized process. As a result, and as it has n pret experiments at the bench. The intelligent use of sequence been presented to date, the human genome sequence is an evolv- w. w data from humans and model organisms, along with recent tech- ing mix of both finished and unfinished regions, with the unfin- w p:// nadovloagnicceasl iinn nthoev autniodne rfsotsatnedreindg b oyf t dhies eHaGsePs, a wnidll dleisaodr dtoe rism hpaovrintagn at imshaedde raevgaiiolanbsl ev airny irnagw info rdmat,a wqiuthal istuy.b sAesq uthene td raetfia naerme einnitt iaanlldy htt genetic basis and, more importantly, in how health care is deliv- improvement, and because data of different quality are found in p ered from this point forward2. different places in the genome, users must understand the kinds u o Although this flood of data has enormous potential, many of data presented by the various tools available. r G investigators whose research programs stand to benefit in a tan- ng gible way from the availability of this information have not Determining the human sequence: a brief overview hi been able to capitalize on its potential. Some have found the As with all systematic sequencing projects, the basic experimen- s bli data difficult to use, particularly with respect to incomplete tal problem in sequencing lies in the fact that the output of a sin- u human genome draft sequence information. Others are simply gle reaction (a ‘read’) yields about 500–800 bp1,4. To determine P e not sufficiently conversant with the seeming myriad of data- the sequence of a DNA molecule that is millions of bases long, it ur bases and analytical tools that have arisen over the last several must first be fragmented into pieces that are within an order of at years. To assist investigators and students in navigating this magnitude of the read size. The sequence at one or both ends of N 2 rapidly expanding information space, numerous World Wide many such fragments is determined, and the pieces are then 00 Web sites, courses and textbooks have become available; many ‘assembled’ back into the long linear string from which they were 2 individuals, of course, also turn to their friends and colleagues originally derived. A number of approaches for doing this have © for guidance. We have prepared this Guide in that same spirit, been suggested and tested; the most commonly used is shotgun as an additional resource for our fellow scientists who wish to sequencing4. The application of shotgun sequencing to the mul- make use (or better use) of both sequence data and the major timegabase- or gigabase-sized genomes of metazoans is still tools that can be used to view these data. The Guide has been evolving. A small number of strategies are currently being evalu- written in a practical, question-and-answer format, with step- ated, for example, hierarchical or map-based shotgun sequenc- by-step instructions on how to approach a representative set of ing, whole-genome shotgun sequencing and hybrid approaches. problems using publicly available resources. The reader is These approaches are described in detail elsewhere4. encouraged to work through the examples, as this is the best The IHGSC’s human sequencing effort began as a purely map- way to truly learn how to navigate the resources covered and based strategy and evolved into a hybrid strategy1. The ‘pipeline’ become comfortable using them on a regular basis. We suggest that the IHGSC used to generate the human sequence data that readers keep copies of the Guide next to their computers as involved the following steps. an easy-to-use reference. 1. Bacterial artificial chromosome (BAC) clones were selected, Before embarking on this new adventure, it is important to and a random subclone library was constructed for each one in review a number of basic concepts regarding the generation of either an M13- or a plasmid-based vector. human genome sequence data. This review does not discuss the 2. A small number of members of the subclone library (usually chronological development of the HGP or provide an in-depth 96 or 192) were sequenced to produce very-low-coverage, single- treatment of its implications; the reader is referred to Nature’s pass or ‘phase 0’ data. These data were used for quality control Genome Gateway (http://www.nature.com/genomics/human/) and can be found in the Genome Survey Sequence division of for more information on these topics. The DNA Database of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL) and GenBank (of the National Cen- Current status of human genome sequencing ter for Biotechnology and Information; NCBI). Sequencing of the human genome is nearing completion. The 3. If a BAC clone met the requisite standard, subclones were target date for making the complete, high-accuracy sequence derived and sufficient sequence data generated from theseto pro- available is April 2003, the 50th anniversary of the discovery vide four- to fivefold coverage (that is, enough data to represent of the double helix3. As we go to press, however, the work is still an average base in the BAC clone between four and five times). a mosaic of finished and draft sequence. A sequence becomes This is known as ‘draft-level’ coverage, and permits the assembly supplement to nature genetics • september 2002 5 user’s guide NCBI reference sequences 5. Subsequent to the genera- tion and publication of the The data release and distribution practices adopted by the HGP participants have led not draft human genome sequence, only to very early, pre-publication access to this treasure trove of information, but also to a work has continued towards potentially confusing variety of formats and sources for the sequence data. To address this and finishing the sequencing. The other issues, the NCBI initiated the RefSeq project (http://www.ncbi.nlm.nih.gov/ final stage initially targeted locuslink/refseq.html). draft-quality BAC clones. For The goal of the RefSeq effort is to provide a single reference sequence for each molecule of the each of these clones, enough central dogma: DNA, the mRNA transcript, and the protein. The RefSeq project helps to sim- additional shotgun sequence plify the redundant information in GenBank by providing, for example, a single reference for data are obtained to bring the human glyceraldehyde-3-phosphate dehydrogenase mRNA and protein, out of the 14 or so full- coverage to eight- to tenfold, a s c length sequences in GenBank. Each alternatively spliced transcript is represented by its own ref- stage referred to as ‘fully neti erence mRNA and protein. The RefSeq project also includes sequences of complete genomes topped-up’. The data from each ge and whole chromosomes, and genomic sequence contigs. The human genomic contigs that fully topped-up BAC are e r NCBI assembles, which form the basis of the presentations in the different genome browsers, reassembled, typically resulting u at are part of the RefSeq project. Most RefSeq entries are considered provisional and are derived by in a smaller number of contigs n m/ an automated process from existing GenBank records. Reviewed RefSeq entries are manually (often in just a single contig) o curated and list additional publications, gene function summaries and sometimes sequence than at the draft level. The new c e. corrections or extensions. assembly is again submitted to r u at Reference sequences are available through NCBI resources, including Entrez, BLAST and the HTGS division as an n LocusLink. They can be easily recognized by the distinctive style of their accession numbers. update of the existing BAC w. w NM_###### is used to designate mRNAs, NP_###### to designate proteins and NT_###### to clone, now identified with the w p:// dgeensiogmnaet et oge nanonmoitca cteo ntthieg s.p Tohsiet iNonCsB oI fa nkdn oUwCnS Cg eunsees a. liEgnnsmemenbtls oalfi gthnes mmRRNNAA R RefeSfeSqesq sw ittoh tthhee kacecyewsosirodn n‘uhmtgbs_efru ollft othp’e. clTonhee htt genome. The NCBI also provides model mRNA RefSeqs produced from genome annotation. stays the same, and the version p These are derived by aligning the NM_ mRNAs and other GenBank mRNAs to the assembled number increases by one u o genome and then extracting the genomic sequence corresponding to the transcripts. The result- (AC108475.2, for example, r G ing model mRNA and model protein sequences have accession numbers of the form becoming AC108475.3). g n XM_###### and XP_######. As the XM_ and XP_ records are derived from genomic sequence, 6. At this stage, there are, hi they may differ from the original NM_ or GenBank mRNAs because of real-sequence polymor- even for clones comprising a s bli phisms, errors in the genomic or mRNA sequences or problems in the mRNA/genomic single contig, typically some u sequence alignment. A complete list of types of RefSeqs, along with details on how they are pro- regions that are of insufficient P e duced, is available from http://www.ncbi.nlm.nih.gov/LocusLink/refseq.html. quality for the clone to be con- ur sidered finished. If this is the at case, the fully topped-up N 2 of sequence using computer programs that can detect overlaps sequence is analyzed by a sequence finisher (an actual person) 00 between the random reads from the subclones, yielding longer who collects, in a directed manner, the additional data that are 2 ‘sequence contigs’. At this stage, the sequence of a BAC clone needed to close the few remaining gaps and to bring any regions © could typically exist on between four and ten different contigs, of low quality up to the finished sequence standard. While the only some of which were ordered and oriented with respect to clone is worked on by the finisher, the HTGS entry in GenBank is one another. The BAC ‘projects’ were submitted, within 24 hours identified by the keyword ‘htgs_activefin’. Once work on the of having been assembled, to the High-Throughput Genomic clone has been completed, the keyword of the HTG record is Sequences (HTGS) division of DDBJ/EMBL/GenBank5, where changed to ‘htgs_phase3’, the version number is once again each was given a unique accession number and identified with increased, and the record is moved from the HTGS division to the keyword ‘htgs_draft’. (The DDBJ, EMBL and GenBank are the primate division of DDBJ/EMBL/GenBank. In the context of members of the International Nucleotide Sequence Database a BLAST search at NCBI, these finished BAC sequences would Collaboration, whose members exchange data nightly and assure now be available in the nr (“non-redundant”) database. that the sequence data generated by all public sequencing efforts 7. The finished clone sequences are then put together into a are made available to all interested parties freely and in a timely finished chromosome sequence. As with the initial draft assem- fashion.) Less-complete high-throughput genomic (HTG) blies, there are a number of steps involved in this process that use records are also known as ‘phase 1’ records. As the sequence is map-based and sequence-based information in calculating the refined, it is designated ‘phase 2’. In the context of a BLAST maps. The final assembly process involves identifying overlaps search at the NCBI, these sequences would be available in the between the clones and then anchoring the finished sequence HTGS database. contigs to the map of the genome; details of the process can be 4. In late 2000, the draft sequence of the entire human genome found on the NCBI web site (http://www.ncbi.nlm.nih.gov/ was assembled from the sequence of 30,445 clones (BAC clones genome/guide/build.html). and a relatively small number of other large-insert clones). This Initially, both the UCSC and NCBI groups generated complete assembled draft human genome sequence was published in Feb- assemblies of the human genome, albeit using different ruary 2001 and made publicly available through three primary approaches. As noted on the UCSC web site, the NCBI assembly portals: the University of California, Santa Cruz (UCSC), tended to have slightly better local order and orientation, whereas Ensembl (of the European Bioinformatics Institute; EBI) and the the UCSC assembly tended to track the chromosome-level maps NCBI. The use of all three of these sites to obtain annotated somewhat better. Rather than having different assemblies based information on the human genome sequence is the primary sub- on the same data, IHGSC, UCSC, Ensembl and NCBI decided ject of this guide. that it would be more productive (and obviously less confusing) 6 supplement to nature genetics • september 2002 user’s guide to focus their efforts on a single, definitive assembly. To this end, Over the next year, sequence producers will continue to add and by agreement, the NCBI assembly will be taken as the refer- finished sequence to the nucleotide sequence databases, and the ence human genome sequence. It is this NCBI assembly that is NCBI will continue to update the human sequence assembly displayed at the three major portals covered in this guide. until its ultimate completion. The human genome sequence will, however, continue to improve even after April 2003, as new Annotating the assemblies cloning, mapping and sequencing technologies lead to the clo- Once the assemblies have been constructed, the DNA sequence sure of the few gaps that will remain in the euchromatic regions. undergoes a process known as annotation, in which useful It is hoped that such technological advances will also allow for sequence features and other relevant experimental data are cou- the sequencing of heterochromatic regions, regions that cannot pled to the assembly. The most obvious annotation is that of be cloned or sequenced using currently available methods. known genes. In the case of NCBI, known genes are identified by The sequence-based and functional annotations presented at s c simply aligning Reference Sequence (RefSeq) mRNAs (see box), the three major genome portals will certainly continue to evolve neti GenBank mRNAs, or both to the assembly. If the RefSeq or Gen- long after April 2003. Computational annotation is a highly ge Bank mRNA aligns to more than one location, the best align- active area of research, yielding better methods for identifying e r ment is selected. If, however, the alignments are of the same coding regions, noncoding transcribed regions and noncoding, u at quality, both are marked on to the contig, subject to certain rules non-transcribed functional elements contained within the n m/ (specifically, the transcript alignment must be at least 95% iden- human sequence. o tical, with the aligned region covering 50% or more of the length, c e. or at least 1,000 bases). Transcript models are used to refine the Accessing human genome sequence data r u at alignments. Ensembl identifies ‘best in genome’ positions for Although each of the three portals through which users access n known genes by performing alignments between all known genome data has its own distinctive features, coordination w. w human proteins in the SPTREMBL database6and the assembly among the three ensures that the most recent version and anno- p://w tuhsein lgo caa tfiaosnt porfo ktneionw-tno g-DenNeAs a snedq uheunmcea nm matRchNeAr7s. b Uy CalSigCn pinrged Ricetf-s tatEionnsse mofb tlh (eh httupm://awnw gwen.eonmseem sebqlu.oerngc)e i sa rteh ea vparioladbulec.t of a collab- htt Seq and other GenBank mRNAs to the genome using the BLAST- orative effort between the Wellcome Trust Sanger Institute and p like alignment tool (BLAT) program8. In addition to identifying EMBL’s European Bioinformatics Institute and provides a bioin- u o and placing known genes onto the assemblies, all of the major formatics framework to organize biology around the sequences r G genome browser sites provide ab initiogene predictions, using a of large genomes7. It contains comprehensive human genome g n variety of prediction programs and approaches. annotation through ab initio gene prediction, as well as infor- hi Genome annotation goes well beyond noting where known mation on putative gene function and expression. The web site s bli and predicted genes are. Features found in the Ensembl, NCBI provides numerous different views of the data, which can be u and UCSC assemblies include, for example, the location and either map-, gene- or protein-centric. Ensembl is actively build- P e placement of single-nucleotide polymorphisms, sequence- ing comparative genome sequence views, and presents data ur tagged sites, expressed sequence tags, repetitive elements and from human, mouse, mosquito and zebrafish. In addition, at clones. Full details on the types of annotation available and the numerous sequence-based search tools are available, and the N 2 methods underlying sequence annotation for each of these dif- Ensembl system itself can be downloaded for use with individ- 00 ferent types of sequence feature can be found by accessing the ual sequencing projects. 2 URLs listed under Genome Annotation in the Web Resources The UCSC Genome Browser (http://genome.ucsc.edu) was © section of this guide. At UCSC, many of the annotations are pro- originally developed by a relatively small academic research vided by outside groups, and there may be a significant delay group that was responsible for the first human genome assem- between the release of the genome assembly and the annotation blies. The genome can be viewed at any scale and is based on of certain features. Furthermore, some tracks are generated for the intuitive idea of overlaying ‘tracks’ onto the human only a limited number of assemblies. For an in-depth discussion genome sequence; these annotation tracks include, for exam- of genome annotation, the reader is referred to an excellent ple, known genes, predicted genes and possible patterns of review by Stein9 and the references cited therein. This review, alternative splicing. There is also an emphasis on comparative along with the Commentary in this guide, also provides cautions genomics, with mouse genomic alignments being available. on the possible overinterpretation of genome annotation data. The browser also provides access to an interactive version of the BLAT algorithm8, which UCSC uses for RNA and compar- The data—and sometimes the tools—change every day ative genomic alignments. The steps outlined in the previous section should emphasize Given its Congressional mandate to store and analyze biologi- that the state of the human genome sequence will continue to be cal data and to facilitate the use of databases by the research com- in flux, as it will be updated daily until it has actually been munity, the NCBI (http://www.ncbi.nlm.nih.gov) serves as a declared ‘finished’. (Finished sequence is properly defined as the central hub for genome-related resources. NCBI maintains Gen- “complete sequence of a clone or genome, with an accuracy of at Bank, which stores sequence data, including that generated by least 99.99% and no gaps”2. A more practical definition is that of the HGP and other systematic sequencing projects. NCBI’s Map “essentially finished sequence,” meaning the complete sequence Viewer provides a tool through which information such as exper- of a clone or genome, with an accuracy of at least 99.99% and no imentally verified genes, predicted genes, genomic markers, gaps, except those that cannot be closed by any current physical maps, genetic maps and sequence variation data can be method.) The reader should be mindful of this, not just when visualized. The Map Viewer is linked to other NCBI tools—for reading this guide, but also, when referring back to it over time. example, Entrez, the integrated information retrieval system that Similarly, the tools used to search, visualize and analyze these provides access to numerous component databases. sequence data also undergo constant evolution, capitalizing on Although we have chosen to illustrate each example using new knowledge and new technology in increasing the usefulness resources available at a single site, almost all the questions in this of these data to the user. guide can be answered using any of the three browsers. The supplement to nature genetics • september 2002 7 user’s guide informational sidebars that follow some of the questions provide Browser problems? pointers on how to format the search at other sites. Furthermore, the three sites link to each other wherever possible. Examples In following the question-and-answer portion of this guide, presented in this Guide rely on the data and genome browser some readers may find that their web browsers are not be able interfaces that were available in June 2002. As new versions of the to render the web pages properly. If this occurs, do one or genome assembly and viewing tools will come online every few more of the following: months, the specifics of some of the examples may change over 1. Install the most recent version of either Netscape Navi- time. Regardless, the basic strategies behind answering the ques- gator or Internet Explorer. tions in the examples will remain the same. This underscores the 2. Increase the amount of memory available to the web importance of readers working through the examples at their browser. own computers so that they may understand and be able to navi- 3. Try a different web browser. In general, Macintosh users s c gate these public databases. The readers are encouraged to who seek to gain access to these three genome portals will see neti explore the alternative methods for answering the questions. better performance with Internet Explorer. e g e r u at n m/ o c e. r u at n w. w w p:// htt p u o r G g n hi s bli u P e r u at N 2 0 0 2 © 8 supplement to nature genetics • september 2002