NCI Computer Services

You are here

Data Sharing Under the Genomic Data Sharing (GDS) Policy

Data sharing allows data generated from one research study to be used to explore a range of additional research questions. Enabling the combination of data from multiple projects amplifies the scientific value of data.
NCI supports and complies with all NIH data sharing policies. The NIH Genomic Data Sharing (GDS) Policy was issued to:

  • promote broad and robust sharing of human and non-human data from a wide range of genomic research
  • ensure appropriate protections for research involving human data and oversight of research conduct, data quality, data management, data sharing, and data use

Share Genomic Data With NIH/NCI Repositories

Because of the variation in how NCI intramural and extramural operate, the process for data submission will be different depending on whether you are an Intramural Investigator or an extramural grantee.

View the process pertinent to you:

Data Sharing Expectations

Data reuse is facilitated when the data conform to accepted GDS data sharing practices. This helps minimizes potential errors from misunderstanding the data or metadata. Those depositing data to GDS repositories are encouraged to utilize existing, well-documented data standards to help ensure the quality and usefulness of the submitted datasets, and create a more efficient process.

GDS data sharing practices

  • Terms for disease, cell type, tissue type, and other annotations should be linked to the NCI Thesaurus (NCIt).
  • If an NCIt identifier is not available, utilize other identifiers, such as Uniform Medical Language Systems (UMLS) or an ontology term from an existing ontology.
  • Wherever possible, use existing common data elements (CDEs). For clinical specimens, the same data elements reported to clinicaltrials.gov are required.
  • Data should generally be submitted once it has been cleaned (e.g., the analytical dataset is finalized).
  • Data pertinent to the interpretation of genomic data—such as associated phenotype data (e.g., clinical information), exposure data, and descriptive information (e.g., protocol or methodologies used) should be shared. Metadata around the experiment or study and annotations that are necessary to reproduce any published table or analysis must be included with genomic data submissions.
  • Specimen acquisition, experimental procedures, and data processing and analysis methods (e.g., alignment algorithms, software versions, etc.) are required with data submission.

Examples of Data Submission Formats

Different data types undergo different levels of data processing, which determine expectations for data submission and data release. Please work with your program officer to determine specific data submission requirements as they may differ based on individual program and data type. 
The Office of Science Policy provides the following guidance by level of genomic data:

  • Level 0: Raw data generated directly from the instrument platform.
  • Level 1: Initial sequence reads, the most fundamental form of the data after the basic translation of raw input
  • Level 2: Data after an initial round of analysis or computation to clean the data and assess basic quality measures
  • Level 3: Analysis to identify genetic variants, gene expression patterns, or other features of the dataset
  • Level 4: Final analysis that relates the genomic data to phenotype or other biological states
  • Metadata: Information around the experiment or study

Table 1 describes examples for each level. NIH will review these expectations at regular intervals, and will publish updates on the GDS website and notify the research community through appropriate communication methods (e.g., NIH Guide for Grants and Contracts).
Note that necessary information to interpret controlled-access genomic data, such as study protocols, data instruments, and survey tools, should be submitted to share on an unrestricted basis (i.e., through unrestricted access) concurrent with the relevant Level 1, 2, 3, or 4 genomic data.
 

Data Type Level 1 Level 2 Level 3 Level 4

SNP array data from > 500K single nucleotide polymorphisms (SNPs)

(e.g., GWAS data)

  • .CEL
  • .TXT
  • .IDAT

Note: submission of .IDAT files for human sample data will be decided on a case-by-case basis

N/A .TXT .TXT

DNA sequence data from < 100 genes or regions of interest

(e.g., targeted sequencing)

N/A .BAM

Arrays:

  • .TXT

NGS:

  • .MAF
  • .VCF 
  • .PED
.TXT

DNA sequence data from ≥ 100 genes, regions of interest

(e.g., targeted sequencing, whole exome sequencing, whole genome sequencing)

N/A .BAM

Arrays:

  • .TXT

NGS:

  • .MAF 
  • .VCF
  • .PED
.TXT

RNA sequencing (RNA-seq) data

(e.g., transcriptomic and targeting RNAseq data)

  • .FASTQ
  • .SFF
  • .HDFS
  • Complete genomics native

Note: required for human sample data only
 

N/A

Arrays:

  • .TXT

NGS:

  • .WIG
  • .TXT
.TXT

Genome-wide DNA methylation data

(e.g., bisulfite sequencing data)

N/A .BAM

Arrays:

  • .TXT

NGS:

  • .MAF
  • .VCF
  • .TXT
  • .BED
 

Genome-wide chromatin immunoprecipitation sequencing (ChIP-seq) data

(e.g., transcription factor ChIP-seq, histone modification ChIP-seq)

N/A .BAM

Arrays:

  • .TXT

NGS:

  • .WIG
  • .TXT 
  • .BED
.TXT

Metagenome (or microbiome) sequencing data

(e.g., 16S rRNA sequencing, shotgun metagenomics, whole-genome microbial sequencing)

N/A .BAM

NGS:

  • .WIG
  • .TXT
.TXT

Metatranscriptome sequencing data

(e.g., microbial/microbiome transcriptomics)

N/A .BAM

NGS:

  • .WIG
  • .TXT
.TXT
Metadata

Study metadata and annotations necessary to reproduce any published table or analysis must be included with genomic data submissions. In particular, data pertinent to the interpretation of genomic data are expected to be shared such as:

  • associated phenotype data (e.g., clinical information)
  • exposure data, relevant metadata
  • descriptive information (e.g., protocols or methodologies used)

NIH/NCI Genomic Data Repositories Help Resources

Investigators may use the following resources to submit datasets to National Institute of Health, National Cancer Institute, and National Center for Biotechnology Information (NCBI) data repositories. For additional questions about data sharing, please contact the NCI Office of Data Sharing (NCIOfficeofDataSharing [at] mail.nih.gov). 

Repositories Help Resources
NIH Database of Genotypes and Phenotypes (dbGaP) dbGaP Contact Form
NCI Genomic Data Commons (GDC) support [at] nci-gdc.datacommons.io (GDC Help Desk)
NCBI Sequence Read Archive (SRA) sra [at] ncbi.nlm.nih.gov (SRA Help Desk)
Other NCBI Data Repositories National Library of Medicine Help Desk

 

Key Documents For Genomic Data Sharing

Data Sharing Plan (DSPs)

Prior to the start of GDS policy-covered research, all investigators must develop and have in place an approved Data Sharing Plan (DSP). NCI expects that DSPs will be collected and reviewed at the research planning process. NCI staff will assess whether the project falls within the scope of the GDS policy, and if so, whether the DSP is adequate based on NIH Guidance for Investigators in Developing Data Sharing Plans.

Extramural Programs DSP:

Extramural investigators submit their DSP as part of their funding application. DSP requirements should be discussed as early in the research planning process as possible. The approved DSP should be submitted at Just-in-Time (JIT), along with the Institutional Certification. Program Officers must approve the DSP prior to funding.

Intramural Programs:

Intramural investigators submit their DSP in accordance with scientific review. Differences in study type (e.g., studies involving model organisms) and how scientific review takes place within the NCI intramural research programs will dictate when the DSP can be reviewed.

  • Prospective Scientific Review: The DSP should be submitted to, and reviewed by, the scientific director (SD), or delegate, and genomic program administrator (GPA) at the time the funding decision is made.
  • Retrospective Scientific Review (e.g., quadrennial site visits): The DSP should be submitted to, and reviewed by, the SD (or delegate) prior to data generation.
     

Institutional Certifications

The Institutional Certification assures that projects planning to submit genomic data to NIH will meet the expectations of the GDS policy. The certification, provided by the principal investigator and the institutional signing official (SO) of the submitting institution, clearly delineates any “data use limitations (DULs)” on the research use of the data, as agreed to in the informed consent documents signed by study participants.
For multicenter studies (with samples collected at several institutions), NIH understands that the submitting institution is not necessarily the local institution or IRB of record for all sites. However, the submitting institution should assure NIH that it believes, based on either its own review or assurance from other institutions, that the expectations of the policy are met for the entire dataset. Institutions may choose to collect and submit a single-site certification from each site contributing samples or submit a multi-site certification. The Institutional Certifications for both intramural and extramural studies can be found on the GDS website.
An Institutional Certification should be submitted at the earliest possible point in time. The certification should be provided to NCI prior to award, along with any other JIT Information (for extramural researchers) or at the time of scientific review (for intramural researchers).
 

Basic Study Information

The Basic Study Information for intramural, extramural, and non-NIH funded investigators should be submitted once the data generation and cleaning is completed.