AD Award Number: DAMD17-98-2-8005 TITLE: Malaria Genome Sequencing Project PRINCIPAL INVESTIGATOR: Malcolm J. Gardner, Ph.D, CONTRACTING ORGANIZATION: The Institute for Genomic Research Rockville, Maryland 2 0850 REPORT DATE: January 2 000 TYPE OF REPORT: Annual PREPARED FOR: U.S. Army Medical Research and Materiel Command Fort Detrick, Maryland 21702-5012 DISTRIBUTION STATEMENT: Approved for public release; distribution unlimited The views, opinions and/or findings contained in this report are those of the author(s) and should not be construed as an official Department of the Army position, policy or decision unless so designated by other documentation. 20000426 094 DTCC QUALITY INSPECTED S |^l_| V^IXI L/V/WWIIIUI1 I It I IN/ll ■ nV>k VVIIVBIUD IIIVUU.. Vl/IITH--UUIIUOUO Public reporting burden for this collection of information is estimated lo average 1 hour per response, including the time for reviewing instructions searching existing data sources, gathering and maintaining ihn data needed and comoletinq and reviewing this collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information including suggestions for reducing^ burden, to^shingt«! Heada^ers Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway. Suite 1204, Arlington, VA 22202^1302. and to the Office of Management and Budget, Paperwork Reduction Project (0704-0188), Washington, DC 20503 1. AGENCY USE ONLY (Leave 2. REPORT DATE 3. REPORT TYPE AND DATES COVERED blank) January 2000 Annual (17 Dec 98 - 16 Dec 99) 5. FUNDING NUMBERS 4. TITLE AND SUBTITLE Malaria Genome Sequencing Project DAMD17-98-2-8005 6. AUTHOR(S) Malcolm J. Gardner, Ph.D. 8. PERFORMING ORGANIZATION 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) REPORT NUMBER The Institute for Genomic Research Rockville, Maryland 20850 E-MAIL: [email protected] 9. SPONSORING / MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSORING / MONITORING AGENCY REPORT NUMBER U.S. Army Medical Research and Materiel Command Fort Detrick, Maryland 21702-5012 11. SUPPLEMENTARY NOTES 12b. DISTRIBUTION CODE 12a. DISTRIBUTION / AVAILABILITY STATEMENT Approved for public release; distribution unlimited ■ 13. ABSTRACT (Maximum 200 Words) The objectives of this 5-year Cooperative Agreement between TIGR and the Malaria Program, NMRC, were to: Specific Aim 1, sequence 3.5 Mb of P. falciparum genomic DNA; Specific Aim 2, annotate the sequence; Specific Aim 3, release the information to the scientific community. To date, we have published the first complete sequence of a malarial chromosome (chromosome 2 [ 4] ), completed the random phase sequencing of 3 other large chromosomes totaling 7.2 Mb, and have initiated functional genomics studies using glass slide micorarrays to characterize the expression of chromosome 2, 3, and 14 genes throughout the erythrocytic cycle. We have also collaborated in the construction of a two-enzyme optical restriction map of the entire P. falciparum genome [ 7] , and are continuing to further develop the GlimmerM gene finding software developed in year 1. In addition, we have begun small scale sequencing of the rodent malaria P. yoelii and are collaborating in the sequencing of part of a P. vivax chromosome. Discussions with the Malaria Program, NMRC aimed at development of a program to use genomics and functional genomics to accelerate vaccine research are in progress. 14. SUBJECT TERMS 15. NUMBER OF PAGES Plasmodium falciparum, malaria, genome, chromosome, sequencing, 53 Microarray, software 16. PRICE CODE 17. SECURITY CLASSIFICATION 18. SECURITY CLASSIFICATION 19. SECURITY CLASSIFICATION 20. LIMITATION OF ABSTRACT OF REPORT OF THIS PAGE OF ABSTRACT Unclassified Unclassified Unclassified Unlimited NSN 7540-01-280-5500 Standard Form 298 (Rev. 2-89) Prescribed by ANSI Std. Z39-18 298-102 FOREWORD Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the U.S. Army. ,ij( lf\ Where copyrighted material is quoted, permission has been obtained to use such material. /v//ji4 Where material from documents designated for limited distribution is quoted, permission has been obtained to use the material. f^\ b Citations of commercial organizations and trade names in this report do not constitute an official Department of Army endorsement or approval of the products or services of these organizations. N/A In conducting research using animals, the investigator(s) adhered to the "Guide for the Care and Use of Laboratory Animals," prepared by the Committee on Care and use of Laboratory Animals of the Institute of Laboratory Resources, national Research Council (NIH Publication No. 86-23, Revised 1985). N/A For the protection of human subjects, the investigator(s) adhered to policies of applicable Federal Law 45 CFR 46. N/A In conducting research utilizing recombinant DNA technology, the investigator(s) adhered to current guidelines promulgated by the National Institutes of Health. N/A In the conduct of research utilizing recombinant DNA, the investigator(s) adhered to the NIH Guidelines for Research Involving Recombinant DNA Molecules. N/A In the conduct of research involving hazardous organisms, the investigator(s) adhered to the CDC-NIH Guide for Biosafety in Microbiological and Biomedical Laboratories. PI - Signature Date Table of Contents Front Cover 1 SF298 2 Foreword 3 Table of Contents 4 Introduction 5 Body 5 Sequencing of P. falciparum chromosome 14 (Specific Aim 1) 7 Sequencing of chromosomes 10 and 11 (Specific Aim 1) 11 Optical mapping of P. falciparum chromosomes (added to Specific Aim 1) 11 Development and utilization of a P. falciparum gene finding program (added to Specific Aim 2) 12 Microarray studies (added to Specific Aim 1) 12 Sequencing of other Plasmodium species (Specific Aims 1,2,3) 13 Modifications to the Specific Aims 14 Key Research Accomplishments 14 Reportable Outcomes 15 Conclusions 15 References 17 Appendix 18 Introduction Malaria is caused by apicomplexan parasites of the genus Plasmodium. It is a major public health problem in many tropical areas of the world, and also affects many individuals and military forces that visit these areas. In 1994 the World Health Organization estimated that there were 300-500 million cases and up to 2.7 million deaths caused by malaria each year, and because of increased parasite resistance to chloroquine and other antimalarials the situation is expected to worsen considerably [1]. These dire facts have stimulated efforts to develop an international, coordinated strategy for malaria research and control [2]. Development of new drugs and vaccines against malaria will undoubtedly be an important factor in control of the disease. However, despite recent progress, drug and vaccine development has been a slow and difficult process, hampered by the complex life cycle of the parasite, a limited number of drug and vaccine targets, and our incomplete understanding of parasite biology and host-parasite interactions. The advent of microbial genomics, i.e. the ability to sequence and study the entire genomes of microbes, should accelerate the process of drug and vaccine development for microbial pathogens. As pointed out by Bloom, the complete genome sequence provides the "sequence of every virulence determinant, every protein antigen, and every drug target" in an organism [3], and establishes an excellent starting point for this process. In 1995, an international consortium including the National Institutes of Health, the Wellcome Trust, the Burroughs Wellcome Fund, and the US Department of Defense was formed (Malaria Genome Sequencing Project) to finance and coordinate genome sequencing of the human malaria parasite Plasmodium falciparum, and later, a second, yet to be determined, species of Plasmodium. Another major goal of the consortium was to foster close collaboration between members of the consortium and other agencies such as the World Health Organization, so that the knowledge generated by the Project could be rapidly applied to basic research and antimalarial drug and vaccine development programs worldwide. Body This report describes progress in the Malaria Genome Sequencing Project achieved by The Institute for Genomic Research and the Malaria Program, Naval Medical Research Center, under Cooperative Research Agreement DAMD17-98-2-8005, over the 12 month period from Dec. '98 to Dec '99. The specific aims of the work covered under this cooperative agreement were to: 1. Determine the sequence of 3.5 megabases of the P. falciparum genome (clone 3D7): a) Construct small-insert shotgun libraries (1-2 kb inserts) of chromosomal DNA isolated from preparative pulsed-field gels. b) Sequence a sufficiently large number of randomly selected clones from a shotgun library to provide 10-fold coverage of the selected chromosome. c) Construct PI artificial chromosome (PAC) libraries (inserts up to 20 kb) of chromosomal DNA isolated from preparative pulsed-field gels. d) If necessary, generate additional STS markers for the chromosome by i) mapping unique-sequence contigs derived from assembly of the random sequences to chromosome, ii) mapping end-sequences from chromosome-specific PAC clones to YACs. e) Use TIGR Assembler to assemble random sequence fragments, and order contigs by comparison to the STS markers on each chromosome. f) Close any remaining gaps in the chromosome sequence by PCR and primer-walking using P. falciparum genomic DNA or the YAC, BAC, or PAC clones from each chromosome as templates. 2. Analyze and annotate the genome sequence: a) employ a variety of computer techniques to predict gene structures and relate them to known proteins by similarity searches against databases; identify untranslated features such as tRNA genes, rRNA genes, insertion sequences and repetitive elements; determine potential regulatory sequences and ribosome binding sites; use these data to identify metabolic pathways in P. falciparum. 3. Establish a publicly-accessible P. falciparum genome database and submit sequences to GenBank. We are pleased to report that excellent progress has been made towards achievement of these goals. In last year's annual report we announced the the publication in Science of the first complete sequence of a malarial chromosome (chromosome 2) [4]. In addition, we reported on work done by the TIGR/NMRC team and collaborators to provide new tools and resources for the Malaria Genome Project, including development of a Plasmodium gene finding program, GlimmerM [5], and introduction of optical restriction mapping technology for rapid mapping of whole Plasmodium chromosomes [6]. We also reported that sequencing of 3 additional P. falciparum chromosomes was underway, and that we were investigating the use of microarray technology to examine the expression of all genes from chromosomes 2 and 3 of Plasmodium. To facilitate community access to the sequence data, a P. falciparum genome web site was also established at TIGR which contains all of the chromosome 2 sequence data and annotation, as well as preliminary data for other chromosomes currently being sequenced (http://www.tigr.org/tdb/mdb/pfdb/pfdb.html). In the past year we have completed the high-throughput sequencing phase of chromosomes 10, 11, and 14, which together account for 30 % of the genome. These chromosomes are now in the gap closure phase, and chromosome 14 is expected to be completed this year, and chromosomes 10 and 11 will be completed shortly after. We also collaborated with David Schwartz's laboratory in construction of a two-enzyme optical restriction map of the entire P. falciparum genome; this was published recently in Nature Genetics [7]. As indicated in last year's report we also initiated a functional genomics program in collaboration with the Malaria Program, NMRC. Glass slide microarrays containing PCR fragments from almost all genes from chromosomes 2 and 3 have been prepared, and experiments to profile the expression of these genes through the erythrocytic stage of the life cycle are underway. We have also assisted NMRC in their pilot project to apply the techniques of proteomics towards the identification of novel antigens in parasite (sporozoite) extracts. Finally, we are currently reviewing with NMRC further steps that can be taken to more rapidly apply Plasmodium genomics, functional genomics, and proteomics to problems of vaccine development for malaria. Sequencing of P. falciparum chromosome 14 (Specific Aim 1) Sequencing of chromosome 14 (3.4 Mb) is being funded primarily by a grant from the Burroughs Wellcome Fund; funds from this collaborative agreement are being used to accelerate the sequencing, assist in closure and annotation, develop microarrays for chromosome 14, and facilitate rapid utilization of the sequence data by the DoD vaccine and drug development groups. In last year's report we described the isolation of chromosome 14 DNA on pulsed field gels, preparation of shotgun libraries in pUC18, and high throughput sequencing of these libraries. The high-throughput sequencing phase of the project was completed in December 1998. 74,292 sequences with an average read length of 530 nt were produced. All of these sequences were performed with FS+ dye terminator chemistry which we had previously found to be superior to dye primer chemistry for the sequencing of AT-rich P. falciparum DNA. This is equivalent to 9X coverage assuming that due to co-migration of sheared nuclear DNA with the chromosome 14 DNA on pulsed field gels, 20% of sequences in the shotgun library were derived from other chromosomes. The sequences were assembled in a 2 step procedure with TIGR Assembler [8]. The first assembly was performed at 99.5% stringency to produce a robust set of contigs; these contigs and the remaining unassembled sequences were then used to start a second assembly at 97.5% stringency. 1,750 contigs were obtained and the largest contig was 99 kb. In comparison, the largest contig obtained after the first assembly of the chromosome 2 data was about 20 kb, indicating that exclusive use of the dye terminator chemistry for chromosome 14 resulted in the production of high quality sequence data. The gap closure process began in December 1998. The procedures being used to close gaps are basically the same as those used previously on the chromosome 2 project[4], namely 1) use of GROUPER software to identify groups (contigs linked by shotgun clones), physical gaps and sequence gaps; 2) editing of contigs ends to remove untrimmed vector sequence, low quality sequence data, and chimeric clones that prevent merging of contigs; 3) resequencing of missing mates and short sequences at contig ends; 4) sequencing of shotgun clones spanning sequence gaps using primers at the ends of the gaps; 5) PCR with genomic DNA to span physical gaps; and 6) use of the transposon insertion method to close very AT-rich gaps. In practice, GROUPER is run on the set of contigs produced by an assembly and some or all of steps 1 -6 are performed until no further progress is possible. Another assembly is then performed with the edited contigs, new sequences {e.g. primers walks and missing mates), and unassembled sequences left over from the previous assembly. The new assembly will incorporate new sequences such as primer walks produced during closure, sequences edited during closure, and other sequences that did not get merged into the previous assembly, thereby providing new starting points for additional work. This process is repeated until the sequence is closed. As noted above, due to cross-contamination of the chromosome 14 DNA with sheared nuclear DNA, up to 20% of the sequence data is derived from chromosomes other than chromosome 14. In order to focus the closure efforts on chromosome 14 contigs, chromosome 14 markers are used to identify which contigs and groups of contigs are from chromosome 14. With chromosome 2 about 30 markers were available (1 marker per 30 kb). In contrast, for chromosome 14 there are 98 STS markers derived from YACs (provided by Alister Craig) plus an additional 101 SSLP markers [9], providing a marker about every 17-20 kb. The higher density of markers will allow identification of more chromosome 14 contigs and should simplify the gap closure process. In addition, with funding provided by the BWF, David Schwartz's group has completed a 2-enzyme optical restriction map of the P.falciparum genome [7]. We will use the optical map and the chromosome 14 markers to determine the order of contig groups on the chromosome; this should permit us to reduce the number of PCR reactions required for closure of the physical gaps. To date we have performed 2 cycles of the closure procedure on the chromosome 14 contigs and have nearly completed the third cycle (Table 1). In the first cycle between 12/98 and 2/99, most of our efforts were focused on editing of the contig ends and on performing the sequencing reactions for the missing mates and short sequences identified at physical gaps. The most time consuming and labor-intensive part of the process is editing. Three individuals spent 6 weeks editing the ends of contigs from the initial assembly in order to remove untrimmed vector and low-quality sequences that prevented the merging of overlapping contigs. In subsequent rounds of closure we have re-sequenced an additional 1412 missing mates and short sequences from sequence gaps and have performed 755 primer walks. Between 12/98 and 7/99 we closed 47% of the physical gaps and 65% of the sequence gaps, and one-fourth of the chromosome is now covered by contigs larger than 100 kb. About one-third of the primer walks have yet to be completed and additional editing is underway. Once these steps are completed another assembly will be performed. We expect that > 80% of sequence gaps will have been closed at this point. The remaining gaps are likely to be composed of very AT-rich sequence; closure of these AT- rich gaps will require use of the transposon insertion technique that was used for closure of AT- rich gaps in chromosome 2. As shown in Table 1, closure of physical gaps has lagged behind closure of the sequence gaps. This is primarily due to the fact that most of our work, apart from the use of database queries to identify missing mates and short sequences at physical gaps, has focused on closure of the sequence gaps. This was done in order to obtain larger contigs that could be placed more accurately on the YAC, SSLP, and optical restriction maps of the chromosome. By locating groups of contigs on the chromosome map PCR reactions using primers from adjacent groups can be used to close physical gaps. About one-third of the physical gaps on chromosome 2 were closed in this way. Once these gaps are closed the remaining gaps can be closed by performing a series of combinatorial PCRs using one primer from a mapped group and another primer from an unmapped group. Table 1. Progress in gap closure of P. falciparum chromosome 14. 12/98 2/99 3/99 7/99 Sequences 74,292 74,994 75,92 76,406 9 Contigs 1,750 1,555 1,466 1,418 Largest contig (kb) 99 124 124 164 Total groups 458 293 291 ND;1 Chr 14 groups 63 37 34 ND Cum. Length (Mb) 2.99 3.49 3.45 ND Physical gaps 62 36 33 ND Sequence gaps 184 180 112 -64 'ND, not determined. Recently, however, we prepared primers for all of the physical gaps and have performed PCR reactions with the primers and genomic DNA in order to span these gaps. So far, by performing PCR reactions with primers from the ends of adjacent groups on the chromosome, we have obtained products spanning about 75% of the physical gaps and are in the process of sequencing these products. Many of these PCR products are very AT-rich and have been difficult to sequence. As was done with chromosome 2, many of these PCR products may need to be cloned and subjected to the transposon insertion protocol in order to obtain good sequence data in the AT-rich areas. To obtain PCR products from the remaining physical gaps we have begun a combinatorial PCR procedure in which a primer from one end of a mapped group is tested in series of PCR reactions with primers from the ends of unmapped groups. This process has already generated several new PCR products that are currently being sequenced. We are also investigating use of a multiplex PCR strategy in which pools of four or more primers are used in PCR reactions [10]. This reduces the number of PCR reactions that must be performed during closure and has been very successful in accelerating closure of microbial genomes. The AT- richness of Plasmodium DNA makes multiplex PCR more difficult than with other genomes, but we recently obtained PCR products for several physical gaps via multiplexing that are being sequenced. Perhaps the biggest obstacle faced during the closure process is sequencing through long stretches ( up to 50 bp) of As or Ts. We and others have found that the sequence quality deteriorates rapidly as the Taq polymerase passes through these homopolymer stretches, such that accurate sequence data is very difficult to attain in these regions. These regions of lower than average sequence quality have the effect of introducing sequence gaps, which in this case are regions of DNA for which good sequence data cannot be attained. The solution we devised in the chromosome 2 project was to use the transposon insertion method to insert primer binding sites into the AT-rich areas. Frequently, by priming the sequencing reaction within or very close to the homopolymer regions, adequate sequence data could be obtained. However, this is a very labor intensive process and entails performing 50-100 sequence reactions for every gap caused by a homopolymer stretch. To try to improve sequencing of these regions, we are currently testing modifications to our standard sequencing reactions, including changes in extension temperatures, nucleotides mixes, salt concentrations, etc. If these simple modifications improve sequence quality in the AT-rich regions, the gap closure process could be accelerated. Once all gaps have been closed, the sequence will be evaluated with the program check_coverage to ensure that a) all regions of the assembly are covered by at least two shotgun clones, and b) that every base pair in the sequence has been sequenced in both directions with one chemistry, or in one direction with two chemistries. These criteria ensure that the sequence has been assembled correctly and validate individual base calls. The latter criterion is often satisfied by performing 10% of the sequence reactions with dye-primer chemistry. However, given the frequency of sequence artifacts in AT-rich regions observed with the dye-primer chemistry, this may not be appropriate for P. falciparum. As we discovered with chromosome 2, inclusion of sequences containing artifacts in an assembly inhibits contig formation and increases the number of sequence gaps in the assembly and the effort required to close them. Consequently, all chromosome 14 sequencing were done with dye-terminator chemistry, and late in the closure phase the coverage status of the assembly will be assessed. Regions with one- direction coverage will be identified, and additional dye-terminator reactions selected from the database will be performed to convert as many as possible to two-direction coverage. Regions with one-direction coverage that remain and which have unresolved sequence ambiguities will then be re-sequenced with dye-primer chemistry. This process will ensure that the coverage criteria are satisfied and minimize potential assembly problems arising from use of dye-primer chemistry. Finally, the sequence will be edited using the program TIGR_Editor, which displays all gel reads and electropherograms for each base in the sequence. Discrepancies will be noted and additional sequencing reactions will be performed to resolve ambiguities. As a last step to confirm colinearity of the assembled sequence and genomic DNA, restriction maps predicted from the sequence will be compared with the chromosome 14 optical restriction maps described above.. Elucidation of gene structure will be performed with the program GlimmerM, a eukaryotic gene-finding developed at TIGR specifically for the malaria genome project (see section below). Before the annotation of chromosome 14 begins, GlimmerM will be refined to improve accuracy and the training set will be updated with newly-published sequences, so that a more robust gene-finding tool will be available once the sequence is completed. Predicted coding regions will be searched against the sequence and protein databases using our standard methods. Repetitive elements and other features will also be identified and annotated. Since many genes will have no database matches, defining the boundaries of genes will be challenging. Most of the software necessary for annotation was tested during the chromosome 2 project, and will require only a few minor modifications for use on chromosome 14 The annotation performed under this grant will by necessity be preliminary. Our goal is to provide a starting point for further biological characterization. We will facilitate public access to the sequence by release of preliminary and finished sequence on the TIGR web site (http://www.tigr.org/tdb/mdb/pfdb/pfdb.html). This will include full text- and sequence-based searching of chromosomes 2 and 14, as well as links to other sources of P. falciparum sequence data such as the Sänger Center and Stanford University. Since the start of the random sequencing phase raw shotgun sequences and contigs from test assemblies have been released on the TIGR web site. Upon completion of the random phase of the project the complete set of > 74,000 shotgun sequences and the contigs from the first full assembly were placed on the web site. These contigs have been updated approximately every 6-8 weeks as gap closure has progressed. In addition, early this year we installed a new BLAST server that returns the BLAST output as ' well as the FASTA-formatted sequence of the best hit plus 1 kb on either side. This enables 10