2 Systematic prediction and annotation of included genomes

There are many web services for model organisms that are helpful to researchers. These platforms are scarce for non-model organisms, particularly medicinal plants. Though there are more than 100 medicinal plants having their genomes sequenced, only 20 of them are released with gene annotations in NCBI. Even though, some of these gene annotations contain obvious information loss. This makes it difficult to use these reference genomes in a consistent manner and makes it more difficult to identify shared gene IDs for use in comparing data from different studies.

Here we built a Nextflow-based pipeline that integrates ab initio gene prediction, RNA-seq-based gene prediction, and homology-protein-based gene prediction for systematic gene prediction. Then the final gene sets were fed to EggNOG and PFAM for function annotation to generate gene ontology and KEGG annotation for these genes.

The basic workflow shows the process of gene annotation.

Figure 2.1: The basic workflow shows the process of gene annotation.

2.1 Species with multiple genome assembles

Currently, we only choose one assemble for each species due to time limitations. Normally the assemble labeled as reference in NCBI is chosen.

Next, we will add other high-quality assembles to supply more information about these species.

2.2 Species without published genome assembles

  1. Perform de novo RNA-seq assembling to generate reference genes and following annotations, and quantifications to include those species into IMP.