D3G Release 21.01
==================

D3G provides a set of genomic data to support development of oligonucleotide therapeutics. It consists of data derived from own experiments as well as publicly available databases, where genome sequences, gene models, and expression data sets are compiled for five species, human, crab-eating macaque, common marmoset, mouse, and rat. 

What's new
-----------
* Data set for rat (rn6) is included
* Entrez Gene IDs are incorporated in the RefSeq gene models
* Gene models (TRaC) for crab-eating macaque and common marmoset are updated
* Gene-level expression tables are generated (based on promoter level expression tables)

Data use / embargo
-------------------
Here we open our original data sets to support development of oligonucleotide therapeutics under [CC-BY](https://creativecommons.org/licenses/by/4.0/). We encourage anyone (for example, in academic institute, commercial company, regulatory agency, or others) to use the data sets freely for development or assessment of drugs.

Meanwhile, we request users to respect the embargo on the publication of genome-wide analysis based on this data set, as we are currently preparing manuscripts to provide genome-scale analysis of the non-human primates with complete description of the experimental and computational details with raw data. Exceptions to the policy are for analyses on a couple of locus, gene families, and oligonucleotide sequences, rather than comprehensive large-scale analysis.


Data source
------------
* Human
    - Genome assembly: GRCh38/hg38.p12
    - Gene models
        - RefSeq: 109.20190905 (2019-09-10)
        - Gencode: V34
* Crab-eating macaque (cynomolgus macaque)
    - Genome assembly: macFasRKS1912 (GCA_012559485.2)
    - Gene models: TRaC 21.01
        - built from 826 CAGE profiles and 66 RNA-seq profiles (data from PRJDB9546 and others)
        - annotated according to ENSEMBL release 95 and Gencode V34
    - Expression: CAGE profiles obtained from 285 adult samples
* Common marmoset
    - Genome assembly: calJacRKC1912 (GCA_013373975.1)
    - Gene models: TRaC 21.01
        - built from 467 CAGE profiles and 18 RNA-seq profiles (data from PRJDB9547 and others)
        - annotated according to ENSEMBL release 91 and Gencode V34
    - Expression: CAGE profiles obtained from 258 adult samples
* Mouse
    - Genome assembly: GRCm38/mm10
    - Gene models
        - RefSeq: GCF_000001635.25_GRCm38.p5 (2017-08-04)
        - Gencode: V25
* Rat
    - Genome assembly: RGSC Rnor_6.0/rn6
    - Gene models
        - RefSeq: Rattus norvegicus Annotation Release 106 (2019-10-28)

Data sets for the non-human primates are based on our own experiments and computational processing, except for mitochondrial chromosomes derived from GenBank (KF305937.1 for crab-eating macaque and KM588314.1 for common marmoset). The genome assemblies are described in a preprint [^1]. Data sets from the other species (reference genome [^2] and gene models [^3] [^4] for human, mouse, and rat) were obtained from [the UCSC Genome Browser Database](https://hgdownload.soe.ucsc.edu/downloads.html) on Jan 2021.


[^1]: Jayakumar,V., Nishimura,O., Kadota,M. and Hirose,N. Chromosomal-scale De novo Genome Assemblies of Cynomolgus Macaque and Common Marmoset. bioRxiv 2020.12.04.411207; doi: https://doi.org/10.1101/2020.12.04.411207
[^2]: Church DM, Schneider VA, Graves T, et al. Modernizing reference genome assemblies. PLoS Biol. 2011;9(7):e1001091. doi:10.1371/journal.pbio.1001091
[^3]: O'Leary NA, Wright MW, Brister JR, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44(D1):D733-D745. doi:10.1093/nar/gkv1189
[^4]: Harrow J, Frankish A, Gonzalez JM, et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 2012;22(9):1760-1774. doi:10.1101/gr.135350.111


Data files
-----------

```
.
|-- calJacRKC1912
|   |-- expression
|   |   |-- calJacRKC1912CageRobustPeakExpCounts.matrix.txt.gz
|   |   |-- calJacRKC1912CageRobustPeakExpTpm.matrix.txt.gz
|   |   |-- calJacRKC1912Trac2101SymbolExpCounts.matrix.txt.gz
|   |   `-- calJacRKC1912Trac2101SymbolExpTpm.matrix.txt.gz
|   |-- gene
|   |   |-- calJacRKC1912Trac2101-prespliced.fa.gz
|   |   |-- calJacRKC1912Trac2101-spliced.fa.gz
|   |   |-- calJacRKC1912Trac2101.bed12.gz
|   |   |-- calJacRKC1912Trac2101ProteinCoding-prespliced.fa.gz
|   |   |-- calJacRKC1912Trac2101ProteinCoding-spliced.fa.gz
|   |   `-- calJacRKC1912Trac2101ProteinCoding.bed12.gz
|   `-- genome
|       `-- calJacRKC1912_chrM.fa.gz
|-- hg38
|   |-- gene
|   |   |-- hg38NcbiRefSeqCurated-prespliced.fa.gz
|   |   |-- hg38NcbiRefSeqCurated-spliced.fa.gz
|   |   |-- hg38NcbiRefSeqCurated.bed12.gz
|   |   |-- hg38NcbiRefSeqCuratedProteinCoding-prespliced.fa.gz
|   |   |-- hg38NcbiRefSeqCuratedProteinCoding-spliced.fa.gz
|   |   |-- hg38NcbiRefSeqCuratedProteinCoding.bed12.gz
|   |   |-- hg38NcbiRefSeqPredicted.bed12.gz
|   |   `-- hg38WgEncodeGencodeCompV34.bed12.gz
|   `-- genome
|       `-- hg38.p12.fa.gz
|-- macFasRKS1912
|   |-- expression
|   |   |-- macFasRKS1912CageRobustPeakExpCounts.matrix.txt.gz
|   |   |-- macFasRKS1912CageRobustPeakExpTpm.matrix.txt.gz
|   |   |-- macFasRKS1912Trac2101SymbolExpCounts.matrix.txt.gz
|   |   `-- macFasRKS1912Trac2101SymbolExpTpm.matrix.txt.gz
|   |-- gene
|   |   |-- macFasRKS1912Trac2101-prespliced.fa.gz
|   |   |-- macFasRKS1912Trac2101-spliced.fa.gz
|   |   |-- macFasRKS1912Trac2101.bed12.gz
|   |   |-- macFasRKS1912Trac2101ProteinCoding-prespliced.fa.gz
|   |   |-- macFasRKS1912Trac2101ProteinCoding-spliced.fa.gz
|   |   `-- macFasRKS1912Trac2101ProteinCoding.bed12.gz
|   `-- genome
|       `-- macFasRKS1912_chrM.fa.gz
|-- mm10
|   |-- gene
|   |   |-- ncbiRefSeqCurated-prespliced.fa.gz
|   |   |-- ncbiRefSeqCurated-spliced.fa.gz
|   |   |-- ncbiRefSeqCurated.bed12.gz
|   |   |-- ncbiRefSeqCuratedProteinCoding-prespliced.fa.gz
|   |   |-- ncbiRefSeqCuratedProteinCoding-spliced.fa.gz
|   |   |-- ncbiRefSeqCuratedProteinCoding.bed12.gz
|   |   |-- ncbiRefSeqPredicted.bed12.gz
|   |   `-- wgEncodeGencodeCompVM25.bed12.gz
|   `-- genome
|       `-- mm10.fa.gz
`-- rn6
    |-- gene
    |   |-- ncbiRefSeqCurated-prespliced.fa.gz
    |   |-- ncbiRefSeqCurated-spliced.fa.gz
    |   |-- ncbiRefSeqCurated.bed12.gz
    |   |-- ncbiRefSeqCuratedProteinCoding-prespliced.fa.gz
    |   |-- ncbiRefSeqCuratedProteinCoding-spliced.fa.gz
    |   |-- ncbiRefSeqCuratedProteinCoding.bed12.gz
    |   `-- ncbiRefSeqPredicted.bed12.gz
    `-- genome
        `-- rn6.fa.gz
`-- release_21.01.txt
```

Files with `.fa.gz` suffix in their names contain nucleotide sequences in FASTA format, ones with `.bed12.gz` suffix contain exon-intron structure of gene models in [BED12 format (BED format with 12 columns)](https://genome.ucsc.edu/FAQ/FAQformat.html#format1), and ones with `.matrix.txt.gz` suffix contain tab-delimited text of expression intensities. 

Nucleotide sequences for gene models are compiled by using their genomic coordinates and the genome sequences, only for a selected subset of the gene models. Files containing `prespliced` in their names represent immature transcripts before splicing, the same to pre-mRNA for protein coding transcripts (the term "pre-spliced" is used to be compatible with long noncoding RNAs). Note that the protein coding transcripts here include ones on both nuclear and mitochondrial DNAs. It means that protein coding transcripts compiled based on RefSeq [^2] includes `YP_` (for human) or `NP_` (for mouse) entries in addition to `NM_` ones.

Identifier (ID) for RefSeq gene models consists of `REFSEQ_TRANSCRIPT_ID|GENE_SYMBOL;ENTREZ_GENE_ID`, as seen in the example of `NM_000454.4|SOD1;6647` in the `hg38NcbiRefSeqCurated.bed12.gz` file. In FASTA files of nucleotide sequences, other information such as genomic coordinates are concatenated after `|`.

As for expression data , files containing `ExpCounts` in their names represent counts of 5'-ends of CAGE read alignments with the genome sequences, and the ones with `ExpTpm` represent expression intensities where the read counts were normalized by TPM, tags per million (the same to CPM, counts per million). The files which names include 'CageRobustPeak' indicate promoter level expression data (according to CAGE peaks), and the ones having 'Symbol' indicates gene level expression. CAGE peaks proximal to TRaC 5'-ends within 100bp are aggregated to produce the expression intensities per gene.


Experimental data and their computational processing
-----------------------------------------------------
We generated the following data to construct high-quality genomic references for the non-human primates (raw data is being submitted to a public repository)

* DNA
    - PacBio SMRT Sequencing
    - Hi-C, based on iconHi-C protocol [^5]
* RNA
    - RNA-seq
    - CAGE, based on ssCAGE protocol [^6]

We performed _de novo_ genome assembly by contig construction based on PacBio long reads followed by Hi-C based scaffolding [^1]. The resulting assemblies reached to the chromosome level based solely experimental data, without extrapolation based on other species. We also constructed 5'-end complete gene models by a novel approach (TRaC, Transcript models based on RNA-seq and CAGE), and compiled promoter-level expression profiles by using the CAGE data in the same way to the previous study [^7].


[^5]: Kadota M, Nishimura O, Miura H, Tanaka K, Hiratani I, Kuraku S. Multifaceted Hi-C benchmarking: what makes a difference in chromosome-scale genome scaffolding?. Gigascience. 2020;9(1):giz158. doi:10.1093/gigascience/giz158
[^6]: Morioka MS, Kawaji H, Nishiyori-Sueki H, et al. Cap Analysis of Gene Expression (CAGE): A Quantitative and Genome-Wide Assay of Transcription Start Sites. Methods Mol Biol. 2020;2120:277-301. doi:10.1007/978-1-0716-0327-7_20
[^7]: FANTOM Consortium and the RIKEN PMI and CLST (DGT), Forrest AR, Kawaji H, et al. A promoter-level mammalian expression atlas. Nature. 2014;507(7493):462-470. doi:10.1038/nature13182


Leadership and contact
----------------------
The original data is made in a collaboration among [RIKEN PMI](https://www.riken.jp/research/labs/pmi/), [RIKEN IMS](https://www.riken.jp/research/labs/ims/), [Shiga University of Medical Science](https://www.shiga-med.ac.jp/), [CIEA (Central Institute for Experimental Animals)](https://www.ciea.or.jp/), [Keio University](https://www.keio.ac.jp/), [RIKEN BDR](https://www.riken.jp/research/labs/bdr/), [DBCLS (Database Center for Life Science)](http://dbcls.rois.ac.jp/), [NIBIOHN (National Institutes of Biomedical Innovation, Health and Nutrition)](https://www.nibiohn.go.jp/), and [TMIMS (Tokyo Metropolitan Institute of Medical Science)](http://www.igakuken.or.jp/). Please contact us via e-mail below for any questions, comments, suggestions, or collaboration:

    d3g@ml.riken.jp


How to cite
------------
Please refer our database like this:

    D3G: Database for Drug Development based on Genome and RNA sequences, https://d3g.riken.jp, 2020


Acknowledgement
----------------
We thank to [AMED (Japan Agency for Medical Research and Development)](https://www.amed.go.jp/), [NIHS (National Institute of Health Sciences)](http://www.nihs.go.jp/), [JPMA (Japan Pharmaceutical Manufacturers Association)](http://www.jpma.or.jp/), and [the FANTOM consortium](https://fantom.gsc.riken.jp/) for relevant advices and fruitful discussion. The experiments and the database is financially supported by AMED under Grant Number JP17kk0305008 and JP20kk0305013.