The draft genome of horseshoe crab Tachypleus tridentatus reveals its evolutionary scenario and well-developed innate immunity [post]

2019 unpublished
Horseshoe crabs are ancient marine arthropods with a long evolutionary history extending back approximately 450 million years that rely entirely on their innate immune system and have developed multiple defence systems. However, the genetic mechanisms underlying their abilities of distinguishing and defending against invading microbes are still unclear. Results: Here, we describe the 2.06 Gbp genome assembly of Tachypleus tridentatus with 24,222 predicted proteincoding genes. Comparative
more » ... Comparative genomics shows that T. tridentatus and the Atlantic horseshoe crab Limulus polyphemus have the most orthologues shared among two species, including genes involved in the immune-related JAK-STAT signalling pathway. Divergence time dating results show that the last common ancestor of Asian horseshoe crabs (including T. tridentatus) and L. polyphemus appeared approximately 130 , and the split of the two Asian horseshoe crabs was dated to approximately 63 Mya (57-69). Hox gene analysis suggests two clusters in both horseshoe crab assemblies. Surprisingly, selective analysis of immune-related gene families revealed the high expansion of conserved pattern recognition receptors. Genes involved in the IMD and JAK-STAT signal transduction pathways also exhibited a certain degree of expansion in both genomes. Intact coagulation cascade-related genes were present in the T. tridentatus genome with a higher number of coagulation factor genes. Moreover, most reported antibacterial peptides have been identified in T. tridentatus with their potentially effective antimicrobial sites. Conclusions: The draft genome of T. tridentatus would provide an important source for eliminating the uncertainty in the evolutionary relationship of Chelicerata. The expansion of conserved immune signalling pathway genes, coagulation factors and intact antimicrobial peptides in T. tridentatus constitutes its robust and effective innate immunity for self-defence in marine environments with an enormous number of invading pathogens and may affect the quality of the adaptive properties with regard to complicated marine environments. Background Horseshoe crabs are marine arthropods, representing an ancient family with an evolutionary history record extending back approximately 450 million years (1). Based on their static morphology and 4 their position in the arthropod family tree, they have been therefore labelled "living fossils" for a long time (2). There are now few types of existing horseshoe crabs with narrow distribution. Tachypleus tridentatus (Leach, 1819), an extant horseshoe crab species, is mainly distributed from coastal Southeast China to western Japan and in a few islands in Southeast Asia (3). Similar to other invertebrates, T. tridentatus relies entirely on its innate immune system, including haemolymph coagulation, phenoloxidase activation, cell agglutination, release of antibacterial substances, active oxygen formation and phagocytosis (4-8), which operates on pattern-recognition receptors (PRRs) upon the detection of pathogen-associated molecular patterns (PAMPs) present on surface of microbes, such as lipopolysaccharides, lipoproteins and mannans (9). Upon recognition, PRRs trigger diverse signal transduction pathways, including the Toll pathway, IMD pathway, JAK-STAT and JNK pathways, that can produce immune-related effectors (10). Previous studies have investigated important signalling pathways and gene families from other arthropods, such as insects, crustaceans and myriapods, revealing extensive conservation and functional diversity among innate immune components across arthropods (11, 12) . Currently, the immune molecular mechanisms of how horseshoe crabs achieve distinguishing "self" and "non-self" antigenic epitopes, also known as pathogen-associated molecular patterns (PAMPs), still remain to be explored. The Atlantic horseshoe crab, Limulus polyphemus (Linnaeus 1758), is the most extensively investigated species of horseshoe crabs, occupying a large latitudinal range of coastal and estuarine habitats along the west Atlantic coast from Maine to Florida in eastern North America and along the eastern Gulf and around the Yucatán peninsula of Mexico (3, 13, 14) . Considerable research efforts have been devoted to the basic mechanisms of the structure and physiology of Limulus visual system, including the dark and light adaptation (15, 16) and effects of the central circadian clock regulation on visual sensitivity (17). Previous work has also focused on its opsin repertoire for further understanding the evolution and diversification of visual systems in arthropods (18). However, the comprehensive comparative genomic study of immunity within T. tridentatus and L. polyphemus has not yet been realized. Here, we present an analysis of the T. tridentatus genome sequence together with comparative of 78×. The final draft assembly consists of 143,932 scaffolds with an N50 scaffold size of 165 kb, among which the longest scaffold size is 5.28 Mb and the shortest is 1 kb. The GC content of the genome is 32.03% (Table 1) . A total of 24,222 protein-coding genes were conservatively predicted in the T. tridentatus genome in this study. The average exon and intron lengths predicted for the assembly are 333 bp and 3,792 bp, respectively. A total of 88.25% of the predicted genes were assigned and annotated by comparing to the NCBI non-redundant database, KEGG database (19) and InterPro database (20) . Repeat annotation The screening of repeat contents from the RepeatMasker (21) analysis based on similarity alignments identified 20.29 Mb in T. tridentatus, representing 0.99% of the genome size. Most of the identified repeat sequences were simple repeats (0.77%). To estimate of repeat sequences which are more difficult to detect in the draft assembly, RepeatModeler (22) was used to predict potential existing but unidentified repeats. Based on this analysis, repeat elements totalled 34.83% in T. tridentatus, including a 13.26% proportion of transposable elements. Meanwhile, long interspersed elements (LINEs) composed the largest portion at 6.21%. LTR elements (1.72%) and DNA elements (5.33%) 6 were also detected in the T. tridentatus genome. To determine the reliability of the repeat contents screening by RepeatMasker and RepeatModeler, we also performed repeat analysis of the L. polyphemus genome for reference. Similar results were obtained with the identification of repeat sequences representing 1.11% and 34.24% in L. polyphemus, respectively. Given that RepeatMasker use similarity of known repeat sequences in the Repbase database to identify repeats in the input sequence, this suggests that the repeat sequences from horseshoe crabs have a great difference compared with existing homologous repeats. Assembly assessment The completeness of the T. tridentatus genome assembly was assessed using the transcriptome data of the embryonic sample at Stage 21 (the hatch-out stage) of T. tridentatus (23). It was found that 99.04% of the transcriptome contigs were aligned to the assembly scaffolds, with an e-value cut-off of 10 -30 . To further confirm the completeness of the predicted genes, the commonly used genome assembly validation pipeline BUSCO (24) gene mapping method with 1,066 BUSCO Arthropoda gene sets were utilized. The predicted genes of T. tridentatus reveals 98.7% conserved proteins of homologous species with 1,052 BUSCOs (76.6% complete single-copy BUSCOs, 10.8% complete duplicated BUSCOs and 11.3% fragmented BUSCOs). Only 1.3% of the benchmarked universal singlecopy orthologous groups of arthropod genes were missing in the assembly. This demonstrated that most of the evolutionarily conserved core genes were found in T. tridentatus genome, suggesting a remarkable completeness of genome assembly and predicted gene repertoire of T. tridentatus. Phylogeny analysis and divergence time dating Two L. polyphemus assemblies have been previously documented (18, 25), one of which was selected to perform comparative genomics according to a relatively higher assembly level. The OrthMCL (26) calculation resulted in a total of 12,116 orthologous groups in the genomes of T. tridentatus and L. polyphemus. Of these, 10,968 orthologues contained genes found in both horseshoe crab genomes, with 15,905 T. tridentatus and 20,390 L. polyphemus genes included; moreover, approximately 6,880 7 of the shared genes were single-copy. Functional enrichment analysis showed that these shared genes were involved in several important pathways (p-value < 0.05), such as metabolic pathways (pyruvate, glycerolipid, amino sugar, nucleotide sugar and so on), ribosome biogenesis and DNA replication. The analysis also identified 1,418 protein-coding genes that were only present in T. tridentatus. In total, 1,956 genes were only specific to L. polyphemus. To place T. tridentatus with the most current understanding of the evolution of Chelicerata species, phylogenetic and comparative genomic analyses of T. tridentatus and 11 other Chelicerata as well as one Myriapoda outgroup were conducted. The phylogenetic tree was rooted using the centipede S. maritima as the outgroup ( Figure 1a ). Strong bootstrap support was obtained for spider, mite and tick clades, forming a monophyletic group. T. tridentatus and L. polyphemus were grouped together, forming the Xiphosura clade. The comparative genomic analysis of the 14 species revealed 14,479 orthologous groups containing genes in at least two different species, among which 1,993 shared groups were commonly distributed in all sampled species, with 111 single-copy orthologues (Figure 1b) . The single-copy genes enriched for KEGG pathways such as ribosome, oxidative phosphorylation, proteasome, metabolic pathways, and carbon metabolism. Additionally, T. tridentatus and L. polyphemus had the most orthologues shared among these two species (2,720 (22.2%) and 2,648 (21.5%)). Pathway enrichment of these genes showed significant enrichment (p-value < 0.01) for neuroactive ligand-receptor interaction, FoxO signalling pathway and AGE-RAGE signalling pathway in diabetic complications. The latter two KEGG pathways include the important JAK-STAT signalling pathway genes related to innate immunity in arthropods. With respect to species-specific genes, 1,124 genes were unique to T. tridentatus. C. sculpturatus had the most (7,328) expanded species-unique genes, followed by 6,247 N. clavipesspecific gene families. In contrast, only 161 genes were unique to T. mercedesae. The numbers of species-specific genes in T. tridentatus and L. polyphemus were in between, with 1,124 and 857, respectively. Nevertheless, considering the fragmentation of the draft genome, there may exist more coding genes in the analysed genomes. The species-specific genes described here only refer to the results based on the draft genomes. The divergence time estimate results for the 7 Chelicerata species showed that the last common 3.3) (89). The predicted genes were annotated by comparing to the NCBI non-redundant database (NR), Kyoto Encyclopedia of Genes and Genomes (KEGG) database (19) and InterPro database (20) with an E-value threshold of 10 −5 . The three annotation results were combined as the annotation of the predicted genes. The benchmarking sets of universal single-copy orthologues (BUSCOs) (24) were used to assess the completeness of the predicted genes with 1066 Arthropoda datasets. The repeat contents of T. tridentatus and L. polyphemus were first analysed using RepeatMasker (version open-4.0.5) (21) with merostomata as the query species and running with rmBLASTn (version 2.2.27+) (90) and RepBase (version 20140131) (91). RepeatModeler (version open-1.0.11) (22) was used to build the repeat database, which was further masked by RepeatMasker. Transcriptome analysis RNA-seq raw data obtained from the embryonic sample at Stage 21 (the hatch-out stage) of T. tridentatus were downloaded from the NCBI SRA database (accession number SRX330201) (23). De novo assembly of the transcriptome was performed with Trinity (92) with default parameters. Proteincoding sequences and the longest transcript ORFs were predicted via Transdecoder (92). Orthology and phylogeny analysis The published genome assembly, coding sequences and protein sequences for the Atlantic horseshoe crab L. polyphemus submitted by Washington University were downloaded from NCBI with RefSeq ID 2304488, accession GCF_000517525.1. The non-redundant protein sequences in L. polyphemus were selected by sorting the protein scaffold positions and filtering out overlapped proteins. Protein sequences of other 11 Chelicerata species, including three Araneae (Nephila clavipes, Parasteatoda tepidariorum, and Stegodyphus mimosarum), seven Acari (Varroa jacobsoni, Tropilaelaps mercedesae, Sarcoptes scabiei, Metaseiulus occidentalis, Tetranychus urticae, Varroa destructor, and Ixodes scapularis), and one scorpion (Centruroides sculpturatus) were downloaded from the GenBank database. Additional outgroup protein sequences from the centipede Strigamia maritima were Declarations Ethics approval and consent to participate Not applicable
doi:10.21203/rs.2.9427/v3 fatcat:hprvcct2unahjjanmqxti7uztu