Lytechinus variegatus v. 3.0 Genome Assembly README This README describes the files associated with the sea urchin Lytechinus variegatus v. 3.0 genome assembly available for download via Echinobase's (www.echinobase.org) file transfer protocol (FTP). The files include the assembly itself as well as gene models, protein models, and functional annotations of these models through three pre- existing databases: 1) Strongylocentrotus purpuratus v. 4.2 gene models ("SPUs"); 2) Uniprot Knowledgebase (UniProtKB) protein database; 3) NCBI RefSeq Invertebrate protein database. CONTACT If you have questions about the assembly or datasets, please direct them to: Phillip Davidson, phillip.davidson@duke.edu Greg Wray, gwray@duke.edu CITATION If you use these data for your own work, please cite the following publication: Phillip L Davidson, Haobing Guo, Lingyu Wang, Alejandro Berrio, He Zhang, Yue Chang, Andrew L Soborowski, David R McClay, Guangyi Fan, Gregory A Wray, Chromosomal-Level Genome Assembly of the Sea Urchin Lytechinus variegatus Substantially Improves Functional Genomic Analyses, Genome Biology and Evolution, Volume 12, Issue 7, July 2020, Pages 1080–1086, https://doi.org/10.1093/gbe/evaa101 FILE DESCRIPTIONS The following files are available for download: 1) Lvar_3.0_scaffolds.fasta.gz Scaffold-level assembly of the L. variegatus v. 3.0 genome 2) Lvar_3.0_contigs.fasta.gz Contig-level assembly of the L. variegatus v. 3.0 genome 3) Lvar_3.0_annotations_v1.0.tar.gz Directory containing the version 1.0 gene models, protein models, and model annotations for the L. variegatus v 3.0 assembly. These are same models that were included in the print-publication of the genome (see above) and were last updated May, 2020. This directory includes: a) Lvar_3.0_v1.0.gff GFF file describing the feature locations of the gene models b) Lvar_3.0_v1.0.transcripts.noUTR.fasta FASTA file of the gene models with UTR sequence excluded c) Lvar_3.0_v1.0.transcripts.fasta FASTA file of the gene models with UTR sequence included when available d) Lvar_3.0_v1.0.proteins.fasta FASTA file of the translated gene models e) Lvar_3.0_v1.0.topSPUhit.txt Gene model hits of the single best match to the S. purpuratus v. 4.2 gene models via BLAST-P f) Lvar_3.0_v1.0.topUNIPROThit.txt Gene model hits of the single best match to the UniprotKB protein models via BLAST-P g) Lvar_3.0_v1.0.topREFSEQhit.txt Gene model hits of the single best match to the RefSeq Invertebrate protein models via BLAST-P 4) Lvar_3.0_annotations_v2.0.tar.gz Directory containing the version 2.0 gene models, protein models, and model annotations for the L. variegatus v 3.0 assembly. These models were last updated July, 2020. This directory includes: a) Lvar_3.0_v2.0.gff GFF file describing the feature locations of the gene models b) Lvar_3.0_v2.0.transcripts.noUTR.fasta FASTA file of the gene models with UTR sequence excluded c) Lvar_3.0_v2.0.transcripts.fasta FASTA file of the gene models with UTR sequence included when available d) Lvar_3.0_v2.0.proteins.fasta FASTA file of the translated gene models e) Lvar_3.0_v2.0.topSPUhit.txt Gene model hits of the single best match to the S. purpuratus v. 4.2 gene models via BLAST-P f) Lvar_3.0_v2.0.topUNIPROThit.txt Gene model hits of the single best match to the UniprotKB protein models via BLAST-P g) Lvar_3.0_v2.0.topREFSEQhit.txt Gene model hits of the single best match to the RefSeq Invertebrate protein models via BLAST-P 5) README.txt ASSEMBLY BASIC STATISTICS ASSEMBLY   Assembly size  870.4 Mb  # scaffolds  104    N50 scaffold length  45.6 Mb    Longest scaffold  96.7 Mb    No. scaffolds >10 Mb  19    No. contigs  466    N50 contig length  5.85 Mb    No. contigs >1 Mb  285    N (%)  0.02    GC (%)  36.31  BUSCO    Complete  95.50%   Duplicated  0.60%   Fragmented  0.80%   Missing  3.40% ANNOTATIONS  VERSION 1.0 (from publication, 5/20)   # gene models 27,232    Average gene length  12.6 kb    % Start codon  93.4    % Start and stop codon  90.3  VERSION 2.0 (from 7/20 update) # gene models 29,837    Average gene length  15.2 kb    % Start codon  99.8    % Start and stop codon  99.6  METHODOLOGY ASSEMBLY For a detailed description of the assembly process, please see the publication referenced above (Davidson et al. 2020, GBE, https://doi.org/10.1093/gbe/evaa101). Briefly, PACBIO Long Read sequences were used to initially assemble the genome at the contig-level using Canu (Koren et al. 2017). Redundant (diploid) regions of the genome were removed and the near-haploid level assembly was polished with Pilon (Walker et al. 2014) for accuracy using 10x Genomics short-read sequencing data. Lastly, Hi-C chromatin conformation sequencing data was used to assembly the contig-level assembly into chromosome-level scaffolds using HiC-Pro (Servant et al. 2015), Juicer (Durand et al. 2016), and 3D-DNA (Dudchenko et al. 2017) ANNOTATIONS For a detailed description of the annotation process for the v 1.0 gene models, please see the publication referenced above (Davidson et al. 2020, GBE, https://doi.org/10.1093/gbe/evaa101). Briefly, a combination of Maker (Campbell et al. 2014) and gene prediction tools including Augustus (Stanke et al. 2006) and SNAP (Korf 2004) were utilized. These tools were informed by pre- existing RNAseq (Israel et al. 2016) and protein (S. purpuratus v. 5.0) datasets (www.echinobase.org). Models were annotated with the three pre-existing protein databases mentioned above using BLAST-P (Camacho et al. 2008): 1) Strongylocentrotus purpuratus v. 4.2 gene models ("SPUs"); 2) Uniprot Knowledgebase (UniProtKB) protein database; 3) NCBI RefSeq Invertebrate protein database. For the version 2.0 gene models from the July 2020 update, a different approach for gene model assembly and annotation was taken. Briefly, pre-existing RNAseq (Israel et al. 2016) and protein (S. purpuratus v. 5.0) datasets (www.echinobase.org) were integrated into the BRAKER genome annotation pipeline (Hoff et al. 2019). Next, gene models annotated as transposable elements were filtered out. Finally, gene models were corrected and extended with Trinity- assembled transcripts (Grabherr et al. 2011) using PASA (Haas et al. 2003). Then, as before, models were annotated with the three pre-existing protein databases mentioned above using BLAST-P (Camacho et al. 2008): 1) Strongylocentrotus purpuratus v. 4.2 gene models ("SPUs"); 2) Uniprot Knowledgebase (UniProtKB) protein database; 3) NCBI RefSeq Invertebrate protein database.