Daphnia pulex genome assembly and annotation plan
The Joint Genome Institute and the Daphnia Genomics Consortium are pleased to announce the availability of a draft assembly and automated annotation of the Daphnia pulex genome sequence. Dpul JAZZ 1.0 is an 8.7x genome coverage assembly of 2,729,325 shotgun sequence reads from an isolate called the Chosen One.
Daphnia pulex the Chosen One – arenata clade
Assembly build date
Main genome assembly
Total reads in the assembly
Total number of scaffolds
Total size of scaffolds
N50 scaffold number
N50 scaffold size
Total number of contigs
Total size of contigs
166.2 Mb (therefore 26.8% gap)
N50 contig number
N50 contig size
8.72x +/- 0.10x
The data can be found at the JGI Daphnia Genome Portal, which has links to the sequences via ftp and a blast server:
The data with additions and genome exploration tools will soon be found in wFleaBase at:
Updates and the launching of new research tools will be announced to the DGC mailing list:
Participate in the community manual annotation of the genome sequence. How do I begin?
- Notify us if your contact information is not listed in the DGC people database.
- Subscribe to the DGC mailing list.
- Read this document and explore the sequence data.
- Register as an annotator. This registration is needed to record your identification with your annotation work.
- Manual annotations are primarily conducted via the JGI Genome Portal Track Editor and Annotation Page.
- Complete and submit the Annotation Plan form. First deadline is April 1, 2007.
- Once your project proposal is approved:
- Keep in close contact with the section leaders via the Listservs and the WIKI pages.
- Submit your annotations to be incorporated into the database by following the Standard Operation Procedures (coming soon).
- Quickly respond to requests from JGI/wFleaBase experts and from the section leaders.
- Make significant discoveries and meet the established deadlines for publishing this work in a journal volume devoted to this annotation project.
The release of this first draft assembly of a crustacean genome sequence represents a significant landmark for a collaborative venture among researchers aiming to create a model system for ecological and evolutionary genomic investigations. This opportunity to explore the genome biology of the waterflea D. pulex is made possible by the US Department of Energy's Joint Genome Institute - which conducted all of the sequencing for this project - and the Daphnia Genomics Consortium, whose members received funding by the National Science Foundation and by grants from member institutions.
Although we often hear claims that biology enters the post-genomic era following the release of genome sequences, this declaration is far from the truth. By having access to a first release of the Daphniagenome sequence, we now face the enormous task of producing a reliable genome sequence assembly, of annotating genes and other functional elements within the Daphnia genome and of finding and filling sequence gaps. In short, our research community is now called to participate in the DaphniaGenome Project, beginning with our immediate need to validate the assembly and give meaning to the long strings of nucleotides. This effort will add value to the sequence data by informing everyone about Daphnia's biology and provide insights into the organization of a crustacean genome.
The purpose of this document is to outline a plan for producing an accurate sequence assembly and to organize a consortium-wide annotation effort. The Steering Committee is requesting your input by notifying to us how you plan on contributing to this project, so that we can reduce redundancy and coordinate the publication of manuscripts. The major headings of this plan are:
- Description of the available data on the JGI and wFleaBase servers
- Validating the genome sequence assembly
- Preliminary annotations based on ab initio methods, by EST sequence matching and by Blast
- Analysis support for a consortium-wide annotation
- Support for publications describing the first analysis of the genome sequence
- Timeline and description of future releases
1. Description of the available data
The genome sequence data are linked below:
2. Validating the genome sequence assembly
Jeong-Hyeon Choi(CGB, Indiana University) leads a project that assigns confidence limits to the contigs and scaffolds, finds potential misassembled regions, and methods for correcting errors in the genome scaffolds due to repeated segments. Read more
3. Automated annotations
To facilitate the community effort of annotating defined features of the genome, the following groups either have or will be conducting automated analyses of the full sequence data. The results will be added to the genome database as they become available with announcements being made at the DGC website and by the mailing list. Note that there is some overlap among the groups and that reconciliation efforts will follow.
JGI annotation plan
Andrea Aerts, Igor Grigoriev, Jeff Boore and their teams at the JGI offer an interface to several bioinformatics tools for studying genomes through a Genome Portal.
- Simple Search allows you to search for InterPro domain predictions or Smith-Waterman alignments to one or more JGI-predicted genes (gene models). You can also jump to a specific JGI gene model.
- Advanced Search provides powerful search features such as wildcard searches, searches against homologous proteins, and iterative searches. The tool also provides links to the Genome Browser and Protein page views of results and to the locus and trancscript annotation areas.
- Alignment Search (BLAST and other programs) provides an interface to NCBI's Basic Local Alignment Search Tool, BLAT and other alignment search programs for queries against JGI genomes.
- Genome Browser lets you browse through JGI-predicted genes, view sequences, and study detailed alignments with nucleotide and amino acid sequences from relevant sequence databases.
- GO Browser uses the Gene Ontology Consortium's controlled vocabulary for organisms to present information about JGI-predicted genes that have automatically assigned GO terms.
- KEGG Browser uses the Kyoto Encyclopedia of Genes and Genomes and its hierarchies of metabolic and regulatory pathways to present information about JGI-predicted genes that have automatically assigned KEGG terms.
- KOG Browser uses euKaryotic Orthologous Groups from NCBI, a classification system based on orthologous relationships between genes in eukaryotes, to present information about JGI-predicted genes that have automatically assigned KOG identification numbers.
wFleaBase annotation plan
Don Gilbert and the wFleaBase team will provide the following automated annotations:
- Nine eukaryote proteome annotations (tBLASTn) (1) including Human, Mouse, Fruitfly, Worm, Arabidopsis, Yeast, others organism
- EST and genetic marker locations on genome (BLASTn) (1)
- Various experimental data: Predicted protein mass spec (from J. Krijgsveld lab), SNPs (from K. Thomas lab)
- Gene predictor (SNAP/homology with location, transcripts, proteins) (1,3)
- Gene mapper (Exonerate in GeneWise mode) (2,3)
- Gene prediction combiner (Glean or Jigsaw) (2,3)
- Protein motif prediction: InterPro Scan analysis (2,3)
- GO gene function/process/location assignments (1)
- KEGG metabolic pathway assignments (2)
- UniProt annotations for matching eukaryote proteomes (2)
(1) Finished, but may need updating
(2) Under consideration; depends on collaborators' contributions
(3) Gene Predictions/Mappings. Recent gene prediction qualities for twelve Drosophila genomes have been assessed and summarized here, forming the basis for these choices. SNAP (Korf 2004) - guided by protein homology evidence - is one of the best ab-initio predictors when (a) new genes are sought; (b) there are no close relatives with an experimentally verified genome annotation. SNAP works well on the range of eukaryote genomes (plant to animal, small to big) with minimal homology data. The draw-back is that SNAP overpredicts, but in a way that identifies gene-like features better than other predictors. Exonerate (Slater & Birney 2005) is a recent improvement on GeneWise, both of which essentially map homologous genes to a genome. Exonerate will locate proteins of related organisms and organism ESTs with good fidelity. Other prediction tools have various qualities and may be better or worse for a given genome. The group of gene-prediction combiners, Glean, Jigsaw and others, can produce a better consensus gene prediction set given input of several predictors. InterPro Scan provides protein motif classifications; this useful for identifying potential partial functions for predicted genes that lack strong homology.
CGB annotation plan
- These data will be analyzed with respect to the distribution of clones across the libraries representing a variety of ecological conditions and developmental stages and, with Justen Andrews and group, correlated with microarray experiments. The results will be archived in a relational database for querying.
- Using the assembled sequence data aligned to the genome, transcription factor binding sites will be identified and mapped in promoter regions. Their distribution across the genome will be outlined on the genome browser.
- ncRNA (e.g., miRNA) will be identified by combining - if possible - both wet-lab experimentation (size-fractionated cDNA library construction and sequencing) and bioinformatics.
Sun Kim and his group will annotate the sequence in relation to several pathway databases including KEGG and further characterize enyzmes with motif tools we have developed as well as those in the public domain. They will use three systems: compath.org, platcom.org/CGAS, abd platom.org/CLASSEQ, which will be reconfigured for annotation. The motif tools can be found at: http://bio.informatics.indiana.edu/projects/MOTIF/
HCGS annotation plan
Kelley Thomas and his group will produce the following automated genome sequence descriptions.
- Polymorphisms and their distributions (SNPs, indels, etc.)
- Size and distribution of repeated DNA, including simple repeats and segmental duplications
EMBL and NCBI annotation plans
- Delineation of orthologous groups based on an 'all against all' sequence similarity search across various eukaryotic genomes. This will be the basis for a number of follow-up analyses. This analysis will take 2-3 weeks to complete after a consensus gene set has been agreed on.
- Ortholog distribution analysis across selected metazoans. This study will form the basis of gene gain/loss expansion/reduction discussions, which are a key part of functional interpretations. This analysis will take 1 week to complete after (1) is done.
- Computation of a phylogenetic tree based on single copy orthologs across metazoans. This study will require 4 weeks after (1) is completed.
- An Interpro analysis. Evgeny Zdobnov who will do this, developed Interpro and Interpro Scan and knows to overcome some of the flaws with this approach. This part will take 2-3 weeks after (1) is done.
- Interpro expansion analysis. This is based on a hierarchical Interpro Scan and gives p-values to family expansions etc. For Daphnia, we expect this analysis to be very useful as no close relative is sequenced. This section will take 1 week to complete after (4) is done.
- Synteny analysis based on single copy orthologs, which might reveal some microsyntenies and even functionally constrained regions, but given the distance we will have to see. This analysis will take 4 weeks to complete.
- Reconstruction of gene and intron loss and gain processes in Crustacea, and on the biological implications of some of the most suggestive gene families.
University of Edinburgh annotation plan
Mark Blaxter and his group
More help needed - please
We still need to identify a group that will focus on a manual annotation of a selected 10 Mb region of the genome to validate of the automated annotations.
4. Analysis support for a consortium-wide annotation
The Steering Committee is leading the community effort to analyze and annotate the genome. This committee has identified topics of interest based on Daphnia's biology and its evolutionary position among other sequenced organisms. Section leaders coordinate the analysis of the data along these topics, develop the text for sections of the main paper, and act as section editors for a special volume of related manuscripts that will accompany the first description of the Daphnia genome sequence (see below). Therefore, the annotation process is primarily powered by individual investigators (you) who are most familiar and interested about a particular aspect of the genome. The JGI and wFleaBase will facilitate this annotation process by providing:
- FTP sites for retrieving the data
- Blast databases
- Gene predictions
- Results from similarity searches against other genomes
- Maps of the Daphnia genome that displays various annotated features
- Computational infrastructure for making and recording discoveries
- Means to communicate with other participants for the annotation
- Online tutorials and documentation on how to use these tools
The most interesting findings and summaries of this annotation effort will be published in a main paper that also reports data relating to the genome sequencing and assembly. By virtue of our collaborative effort, this main paper targeting a high profile journal will be multi-authored by the major contributors to this project. Individual investigators will also report in full their findings within companion papers that will be published simultaneously with the main paper. The collection of research articles will reinforce the significance of this project for a variety of disciplines, including ecology, evolution, ecotoxicology and arthropod biology. wFleaBase will archive and support the assembly/annotation by creating "frozen versions" of the data throughout the process of refining the assembly and annotation long after this first release.
5. Support for publications describing the first analysis of the genome sequence
The Steering Committee will provide information about who is planning to do what analyses and coordinate the publication of manuscripts. It will be up to each participating group to reach deadlines for completing their analyses and for submitting their papers. Outstanding manuscripts will be considered for publication as companions to the main paper, printed within the same journal. Negotiations are also underway to identify an appropriate publisher for a special issue of peer-reviewed research papers entitled "Genome Biology of the Model Crustacean Daphnia". The Tome is currently partitioned into five sections and can be viewed and edited on this DGC Wiki site. This webpage is updated with topics of interest and by assigning people to these topics, aiming to provide a thorough analysis of the genomic data.
At this time, we are requesting information about the analysis that your group wants to contribute to this genome annotation process. Please complete and submit the Annotation Plan form. The five sections are currently as follows.
6. Timeline and description of future releases
The following is a timeline for the annotation work leading to the publication of manuscripts describing the Daphniagenome sequence. This plan maximizes community participation and is possibly an ideal way to help build the research community around a new model system. The main paper serves to highlight findings that are detailed in companion papers and in studies simultaneously published in a special issue of a journal devoted to this project.
February 2007 - JGI provides early access to the Dpul JAZZ 1.0 assembly of the genome sequence with automated annotations, which is made available on the Genome Portal website. The DGC will have preliminary access to this data, so to return feedback on the utility of the genome browser and annotation tools and to design projects for analyzing and annotating the genome. Investigators are requested to communicate their research plans to coordinate the community-wide annotation effort.
March 2007 - One month after the prerelease, and depending on the feedback, the Genome Portal will be made public. The data with additional tools for analyses will also be made available on wFleaBase. Investigators submit Genome Annotation Forms describing their research plans. First deadline is April 1, 2007.
April 14, 2007 - The Steering Committee reviews the proposed research plans and informs the investigators on who is doing what analyses. Revisions are made to the Tome's outline for final negotiations with the publishers.
May 2007 - Upon finalizing agreements with a publisher, completed manuscripts can be submitted to begin the peer review process leading to a manuscript in press. Pre-prints will be published on the journal's official website until all manuscripts are ready to be collated for the special issue on the genome sequence.
September 2007 - Manuscripts for the special issue of peer-reviewed research papers are due. The manual annotations are frozen to produce Dpul JAZZ 1.1. A Jamboree is held at a location to be determined, where the Steering Committee plus around 20 invited experts will finalize the annotation by offering corrections and analyzing portions of the genome data that need additional attention. The main paper will be drafted at this time for distribution to all of the other co-authors for revisions.
We aim to publish all works with the release of Dpul JAZZ 1.1 and submission to GenBank in December, 2007.