Joshua Fortriede Xenbase Gene Nomenclature Administrator This readme file gives information on the creation and statistics of gene model annotation 1.8.3 (annotation version 3 of the 9.1 X. laevis genome build). Files: XL9.1_v3.genes.gff3.gz - contains all genes (45099) and ALL transcript models (46813) XL9.1_v3.transcripts.primary.gff3.gz - contains all genes (45099) and primary transcript models (45099) XL9.1_v3.transcripts.primary.cds.fa.gz - CDS FASTA file of primary transcript models XL9.1_v3.transcripts.primary.mRNA.fa.gz - mRNA FASTA file of primary transcript models XL9.1_v3.transcripts.primary.peptide.fa.gz - peptide FASTA file of primary transcript models XL9.1_v3.transcripts.alternate.gff3.gz - contains alternate transcripts (1714) and their parent genes (1617) ===================================================================== Description: Xenbase updates to X. laevis Gene Model Symbol Annotation: Merging and Validation: Gene model symbol annotations were merged and validated from the following sources: 1) Xenopus Genome Consortium (XGC), 2) Xenbase, 3) Dr. Anne-H‚lŠne Monsoro-Burq (anne-helene.monsoro-burq@curie.fr) and Jean-Louis Plouhinec, and 4) Dr. Emil Karaulanov (E.Karaulanov@imb-mainz.de). These combined efforts have named 27325 of the 45099 (61%) gene models, an increase of 5624 from the original XGC annotation. Method of Annotation: 1) The XGC method of annotation involved both a semi-automated pipeline written by Adam Session, as well as over 1000 manual annotations. This resulted in 21701 gene models annotated with gene symbols, and an additional 2712 having X. tropicalis 9.0 gene model IDs. 2) The Xenbase semi-automated pipeline involved gene mRNA BLAST against the genome, protein pairwise alignment of gene and gene model overlaps, and recursively adding gene and gene models to distinct clusters based on high quality pairwise alignments. These clusters were then screened for either manual or automated annotation, based on the constituency of its gene models and genes. This method allowed a high level of confidence and resulted in updating 3885 annotations, adding an additional 3434 annotations to unannotated models, and removing 23 erroneous annotations. This included updating symbols to official nomenclature. 3-4) Dr. Monsoro-Burq and Dr. Karaulanov protocols used only best-hit BLAST assignments, and collectively updated another 825 gene model annotations, and added an additional 1210 annotations to unannotated models. All of these annotations were verified by a low level automated human synteny due to the lower level of confidence in the best-hit BLAST assignment method. Method of Data Validation and Merging: The compendium of provisional annotation data from all groups was aggregated and analyzed to identify and verify symbol assignment to the gene models. Annotations that were congruent between all annotators were automatically assigned. For all other cases, each annotation was assessed to identify the human equivalent. For each of these annotations, a low level, automated symbol-matching synteny analysis was performed between the human and Xenopus loci. The symbols from up to fifteen neighbors both upstream and downstream of the target gene model in both species were used. Annotations that were confirmed by at least low level synteny were considered verified and were preferred over non-synteny annotations. In cases where a model had more than one annotation with synteny support, manual validation was used to identify the correct assignment (324 models). For models where no annotations had synteny support, assignments from the XGC and Xenbase were used. Differences between the XGC and Xenbase were resolved via manual annotation (>400 models). All 1709 models annotated with an X. tropicalis 9.0 gene model ID are provisional and will require future work. Method of Assigning .L or .S: Homeologs were used to identify the correct subgenome for scaffolds without L or S designations. Scaffolds/chromosomes were identified as homeologous scaffolds if they shared at least three homeolog genes. Once linked, if one scaffold in the pair has a known subgenome, the other could be established. For instance, chr9_10S and scaffold22 share at least 10 homeologs. Because chr9_10S belongs to the S subgenome, we can conclude that scaffold22 should belong to the L subgenome. Further, scaffold22 is homeologous with scaffold29 and scaffold87, with 18 and 10 homeologs, respectively, allowing S designations for scaffold29 and scaffold87. This iterative process allowed a daisy-chain effect for designating over 100 scaffolds with subgenome designations.