Name	Last modified	Size

Parent Directory		-
calJac4/	2022-02-16 20:28	-
hg38/	2022-02-16 20:28	-
macFasRKS1912v2/	2022-02-16 20:28	-
md5sum.txt	2022-02-28 13:00	3.4K
mm39/	2022-02-16 20:29	-
release_22.02.txt	2022-02-25 21:47	10K
rn7/	2022-02-16 20:29	-

D3G Release 22.02

D3G provides a set of genomic data to support development of oligonucleotide therapeutics. It consists of data derived from own experiments as well as publicly available databases, where genome sequences, gene models, and expression data sets are compiled for five species, human, crab-eating macaque, common marmoset, mouse, and rat.

What’s new

All of the genome assemblies and gene models are updated
Our genome assembly of crab-eating macaque was additinally polished to be the reference in RefSeq
An experimental track of human long read RNA-seq is included in the genome browser visualization

Data use / embargo

Here we open our original data sets to support development of oligonucleotide therapeutics under CC-BY. We encourage anyone (for example, in academic institute, commercial company, regulatory agency, or others) to use the data sets freely for development or assessment of drugs.

Meanwhile, we request users to respect the embargo on the publication of genome-wide analysis based on this data set, as we are currently preparing manuscripts to provide genome-scale analysis of the non-human primates with complete description of the experimental and computational details with raw data. Exceptions to the policy are for analyses on a couple of locus, gene families, and oligonucleotide sequences, rather than comprehensive large-scale analysis.

Data source

Human
- Genome assembly: GRCh38/hg38.p13
- Gene models
  - RefSeq: 109.20211119 (2021-11-23)
  - Gencode: V39
Crab-eating macaque (cynomolgus macaque)
- Genome assembly: macFasRKS1912v2 (GCF_012559485.2)
- Gene model
  - RefSeq: NCBI Macaca fascicularis Annotation Release 102 (direct download from NCBI)
  - TRaC: 22.02
    - built from CAGE and RNA-seq profiles (data from PRJDB9546 and others), and MinION sequencing of full-length cDNAs prepared by cap-trapping.
- Expression: CAGE profiles obtained from the above
Common marmoset
- Genome assembly: calJac4
- Gene models
  - RefSeq: NCBI Callithrix jacchus Annotation Release 105 (2020-07-11)
  - TRaC: 22.02
    - built from CAGE profiles and RNA-seq profiles (data from PRJDB9547 and others), and MinION sequencing of full-length cDNAs prepared by cap-trapping.
- Expression: CAGE profiles obtained from the above
Mouse
- Genome assembly: GRCm39/mm39
- Gene models
  - RefSeq: NCBI Mus musculus Annotation Release 109 (2020-09-23)
  - Gencode: V28
Rat
- Genome assembly: mRatBN7.2/rn7
- Gene models
  - RefSeq: NCBI Rattus norvegicus Annotation Release 108 (2021-01-21)

The genome assemblies of crab-eating macaque, a polished version of the ones previously described ¹, has been registered as GCF_012559485.2 and gene models are constructed by the RefSeq team. The data files, genome assemblies and gene models, are directly downloaded from NCBI and compiled for D3G. Expression data for crab-eating macaque and common marmoset is based on our own CAGE data profiles. The other data sets, reference genome ² and gene models (RefSeq ³ and Gencode ⁴) for human, mouse, and rat) were obtained from the UCSC Genome Browser Database on Feb 2022.

Data files

|-- calJac4
|   |-- expression
|   |   |-- calJac4_ncbiRefseqPromoterExpCounts.txt.gz
|   |   `-- calJac4_ncbiRefseqPromoterExpCpm.txt.gz
|   |-- gene
|   |   |-- calJac4_Trac2202.bed12.gz
|   |   |-- calJac4_ncbiRefSeq.bed12.gz
|   |   |-- calJac4_ncbiRefSeq_prespliced.fa.gz
|   |   `-- calJac4_ncbiRefSeq_spliced.fa.gz
|   `-- genome
|       `-- calJac4.fa.gz
|-- hg38
|   |-- gene
|   |   |-- hg38_ncbiRefSeqCurated.bed12.gz
|   |   |-- hg38_ncbiRefSeqCuratedProteinCoding.bed12.gz
|   |   |-- hg38_ncbiRefSeqCuratedProteinCoding_prespliced.fa.gz
|   |   |-- hg38_ncbiRefSeqCuratedProteinCoding_spliced.fa.gz
|   |   |-- hg38_ncbiRefSeqCurated_prespliced.fa.gz
|   |   |-- hg38_ncbiRefSeqCurated_spliced.fa.gz
|   |   |-- hg38_ncbiRefSeqPredicted.bed12.gz
|   |   `-- hg38_wgEncodeGencodeCompV39.bed12.gz
|   `-- genome
|       `-- hg38.p13.fa.gz
|-- macFasRKS1912v2
|   |-- expression
|   |   |-- macFasRKS1912v2_ncbiRefseqPromoterExpCounts.txt.gz
|   |   `-- macFasRKS1912v2_ncbiRefseqPromoterExpCpm.txt.gz
|   |-- gene
|   |   |-- macFasRKS1912v2_Trac2202.bed12.gz
|   |   |-- macFasRKS1912v2_ncbiRefSeq.bed12.gz
|   |   |-- macFasRKS1912v2_ncbiRefSeq_prespliced.fa.gz
|   |   `-- macFasRKS1912v2_ncbiRefSeq_spliced.fa.gz
|   `-- genome
|       `-- macFasRKS1912v2.fa.gz
|-- md5sum.txt
|-- mm39
|   |-- gene
|   |   |-- mm39_ncbiRefSeqCurated.bed12.gz
|   |   |-- mm39_ncbiRefSeqCuratedProteinCoding.bed12.gz
|   |   |-- mm39_ncbiRefSeqCuratedProteinCoding_prespliced.fa.gz
|   |   |-- mm39_ncbiRefSeqCuratedProteinCoding_spliced.fa.gz
|   |   |-- mm39_ncbiRefSeqCurated_prespliced.fa.gz
|   |   |-- mm39_ncbiRefSeqCurated_spliced.fa.gz
|   |   |-- mm39_ncbiRefSeqPredicted.bed12.gz
|   |   `-- mm39_wgEncodeGencodeCompVM28.bed12.gz
|   `-- genome
|       `-- mm39.fa.gz
|-- release_22.02.txt
`-- rn7
    |-- gene
    |   |-- rn7_ncbiRefSeqCurated.bed12.gz
    |   |-- rn7_ncbiRefSeqCuratedProteinCoding.bed12.gz
    |   |-- rn7_ncbiRefSeqCuratedProteinCoding_prespliced.fa.gz
    |   |-- rn7_ncbiRefSeqCuratedProteinCoding_spliced.fa.gz
    |   |-- rn7_ncbiRefSeqCurated_prespliced.fa.gz
    |   |-- rn7_ncbiRefSeqCurated_spliced.fa.gz
    |   `-- rn7_ncbiRefSeqPredicted.bed12.gz
    `-- genome
        `-- rn7.fa.gz

Files with .fa.gz suffix in their names contain nucleotide sequences in FASTA format, ones with .bed12.gz suffix contain exon-intron structure of gene models in BED12 format (BED format with 12 columns), and ones with .matrix.txt.gz suffix contain tab-delimited text of expression intensities.

Nucleotide sequences for gene models are compiled by using their genomic coordinates and the genome sequences, only for a selected subset of the gene models. Files containing prespliced in their names represent immature transcripts before splicing, the same to pre-mRNA for protein coding transcripts (the term “pre-spliced” is used to be compatible with long noncoding RNAs). Note that the protein coding transcripts here include ones on both nuclear and mitochondrial DNAs. It means that protein coding transcripts compiled based on RefSeq [^2] includes YP_ (for human) or NP_ (for mouse) entries in addition to NM_ ones.

Identifier (ID) for RefSeq gene models consists of REFSEQ_TRANSCRIPT_ID|GENE_SYMBOL;ENTREZ_GENE_ID, as seen in the example of NM_000454.4|SOD1;6647 in the hg38NcbiRefSeqCurated.bed12.gz file. In FASTA files of nucleotide sequences, other information such as genomic coordinates are concatenated after |.

As for expression data , files containing ExpCounts in their names represent counts of 5'-ends of CAGE read alignments with the genome sequences, and the ones with ExpCpm represent expression intensities where the read counts were normalized by CPM, counts per million with RLE (relative log expression) method ⁵. The files which names include ‘ncbiRefseqPromoter’ indicate promoter level expression data (according to CAGE peaks) of RefSeq gene models (not CAGE peaks).

Experimental data and their computational processing

We generated the following data to construct 5'-end complete gene models by a novel approach (TRaC, Transcript models based on RNA-seq, CAGE, and long read sequencing by Oxford Nanopore sequencing).

RNA-seq
CAGE, based on ssCAGE protocol ⁶
Long read (Oxford Nanopore) sequencing of full-length cDNA prepared by cap-trapping

Leadership and contact

The original data is made in a collaboration among RIKEN PMI, RIKEN IMS, Shiga University of Medical Science, CIEA (Central Institute for Experimental Animals), Keio University, RIKEN BDR, DBCLS (Database Center for Life Science), NIBIOHN (National Institutes of Biomedical Innovation, Health and Nutrition), and TMIMS (Tokyo Metropolitan Institute of Medical Science). Please contact us via e-mail below for any questions, comments, suggestions, or collaboration:

d3g@ml.riken.jp

How to cite

Please refer our database like this:

D3G: Database for Drug Development based on Genome and RNA sequences, https://d3g.riken.jp, 2022

Acknowledgement

We thank to AMED (Japan Agency for Medical Research and Development), NIHS (National Institute of Health Sciences), JPMA (Japan Pharmaceutical Manufacturers Association), and the FANTOM consortium for relevant advices and fruitful discussion. The experiments and the database is financially supported by AMED under Grant Number JP17kk0305008 and JP20kk0305013.

Jayakumar V, Nishimura O, Kadota M, Hirose N, Sano H, Murakawa Y, Yamamoto Y, Nakaya M, Tsukiyama T, Seita Y, Nakamura S, Kawai J, Sasaki E, Ema M, Kuraku S, Kawaji H, Sakakibara Y. Chromosomal-scale de novo genome assemblies of Cynomolgus Macaque and Common Marmoset. Sci Data. 2021;8:159. doi: https://dx.doi.org/10.1038/s41597-021-00935-6 ↩
Church DM, Schneider VA, Graves T, et al. Modernizing reference genome assemblies. PLoS Biol. 2011;9(7):e1001091. doi:10.1371/journal.pbio.1001091 ↩
O'Leary NA, Wright MW, Brister JR, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44(D1):D733-D745. doi:10.1093/nar/gkv1189 ↩
Harrow J, Frankish A, Gonzalez JM, et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 2012;22(9):1760-1774. doi:10.1101/gr.135350.111↩
Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol 2010; 11:R106 doi: http://dx.doi.org/10.1186/gb-2010-11-10-r106↩
Morioka MS, Kawaji H, Nishiyori-Sueki H, et al. Cap Analysis of Gene Expression (CAGE): A Quantitative and Genome-Wide Assay of Transcription Start Sites. Methods Mol Biol. 2020;2120:277-301. doi:10.1007/978-1-0716-0327-7_20↩