NCI Biomedical Informatics Blog
NCI Cancer Research Data Commons
|On This Page|
Basic and clinical cancer research is increasingly focused on identifying the molecular basis for disease and matching targeted therapies that factor in each patient’s unique biology. This approach, known as Precision Medicine, received national attention when President Obama announced the Precision Medicine Initiative (PMI) in 2015. To progress towards this goal, the cancer research community will need to access, integrate, and analyze many different types of data, including genomics, proteomics, microbiomics, metabolomics, clinical research and outcomes, multi-resolution, multi-modality imaging data, population-based data, and data contributed by health care providers and patients themselves. Investment in the informatics and infrastructure to fully leverage these diverse data types is imperative.
In 2016, The Beau Biden Cancer Moonshot Blue Ribbon Panel Enhanced Data Sharing Working Group recommended the creation of a data science infrastructure necessary to connect repositories, analytical tools, and knowledge bases and to allow data to be aggregated, queried, analyzed, and visualized in unique and powerful ways within and across data types. An integrated NCI Cancer Research Data Commons (CRDC) is one element of this infrastructure. The vision for the CRDC is a virtual, expandable infrastructure that provides secure access to diverse data types, allowing users to analyze, share, and store results, leveraging the storage and elastic compute of the cloud.
NCI has created components — the Genomic Data Commons (GDC) and the NCI Cloud Resources — that define some of the core capabilities necessary for realizing a CRDC. Building on these components and the experience NCI gained in developing them, NCI is initiating several activities to create the foundational elements of a CRDC.
NCI plans to develop a reusable, expandable framework that will define the core components of the CRDC. This environment will provide user workspaces to analyze, share, and view the data as well as a platform to bring their own tools to the data for processing, analysis, and visualization.
The Data Commons Framework will provide components required to stand up and maintain a CRDC node, including:
- secure user authentication and authorization
- metadata validation tools
- an approach for development of consistent, domain-specific data models
- an API and container environment for tools and pipelines
- access to elastic compute resources
- workspaces for storing data, tools, and results and for collaboration among researchers.
Each research domain will have its own CRDC "node," a branch of the CRDC where related, harmonized data are brought together with infrastructure for security and elastic compute capability. Each CRDC node will have a data-specific submission and curation process, determined by domain experts, that harmonizes the data and applies the standard metadata necessary for sharing and analysis. Data will be mirrored in commercial clouds for redundancy and stability. Since each node will be centered around a scientific domain, the community will determine the appropriate analytic and visualization tools, and provide opportunities for innovation. Support for each node can be tailored to the needs of the researchers who will be accessing and analyzing the data.
NCI is working with the community to stand-up several more CRDC nodes, in addition to the GDC that was launched in 2016. Efforts are underway to create CRDC nodes for Imaging and Proteomics.
The vision of the NCI Cancer Research Data Commons is one that contains multiple nodes, with researchers, tool developers, clinicians, and patients contributing and accessing tools and data.
Q: What CRDC nodes are available now?
A: Currently, the Genomic Data Commons is the only node that is available. Others are under development.
Q: What CRDC nodes are under development?
A: Work has begun on nodes for multi-modal imaging and proteomics.
Q: How do I learn more about the status of the development of these new CRDC nodes?
A: Information about the new CRDC nodes will be published on this website, as well as publicized through our social media channels. Additionally, NCI is planning outreach and workshops to gather input from the community to help guide the direction and priorities of the CRDC.
Q: How much will it cost for programs to store their data in a CRDC node?
A: Currently, submitting data for inclusion in the GDC is without cost to researchers. It is anticipated that the same will be true for data submitted to future nodes.
Q: Will the data be public access?
A: Yes, but some of the data will be controlled access, requiring approval for access, depending on the Data Use Agreements in place and on whether the node contains individual-level genotype and phenotype data that have been de-identified. Each node will contain information about the data it hosts, such as descriptions and metadata, as well as instructions on how to gain access to it.
Q: How can I gain access to controlled access genomic data?
A: If you are an NIH researcher or already have an eRA Commons account, then visit https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi to request dbGaP access for yourself or members of your team who will be working with controlled-access data. If you are not an NIH researcher and do not have an eRA Commons account, you will need to register for one. First, you will need an eRA Commons account. If you do not have one, please visit the eRA Commons website and complete the registration form. Once you have an eRA Commons account, you can visit https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi to request dbGaP access. If you have any questions during the application process, please contact the dbGaP Help Desk.
Data Commons Framework
Q: What are the components of a Data Commons Framework?
A. The Data Commons Framework components are: secure user authentication and authorization; metadata validation tools; approach for creation of domain-specific data models; API and container environment for tools and pipelines; workspaces for storing data, tools, and results, and for collaboration.
Q: Who is creating the Data Commons nodes?
A: In the near term, we anticipate that CRDC nodes will be deployed and managed by NCI. However, the long-term vision is that groups outside the NCI will be able to add nodes to the CRDC.
NCI Cloud Resources
Q: How do the NCI Cloud Resources relate to the NCI CRDC?
A: The NCI Cloud Resources — platforms that provide researchers with access to genomic and other data along with the elastic compute of the cloud — were developed in parallel with the GDC. The Cloud Resources are a component of the CRDC, providing the tools, pipelines, and compute capability for use with the data stored in the nodes.
Links for more information on NCI initiatives:
- NCI Cloud Resources
- Genomic Data Commons
- NCIP blog post: Towards a Cancer Research Data Commons, by Allen Dearry, Ph.D.