NCI Biomedical Informatics Blog
- Shape the Data Sharing Landscape: Make a Difference
- NCI’s Office of Data Sharing: Setting a “Gold” Standard for Childhood Cancer
- The Promise and the Challenge of Deep Learning in Pathology
- Predictive Modeling for Pre-clinical Drug Screening: Improving Models Derived From Observational Studies Using Machine Learning and Simulation
NCI Cloud Resources
The Cloud Resources bring together data and computation in a cloud-based environment to enable cancer research and discovery
The NCI Cloud Resources, launched as a pilot program in 2016, are providing cancer researchers access to genomic data co-located with elastic compute and a variety of analytic tools and pipelines, in a cloud environment. Many in the cancer research community have come to rely on the Cloud Resources built by the Broad Institute, the Institute for Systems Biology (ISB), and Seven Bridges (SB) for their cancer research.
What do the Cloud Resources provide?
The Cloud Resources provide secure access to data from The Cancer Genome Atlas (TCGA), and Therapeutically Applicable Research to Generate Effective Treatments (TARGET). Within the next year, these data will be accessed through a cloud-based version of the Genomic Data Commons (GDC), synchronized with the GDC instance at the University of Chicago. Controlled access data will remain available to dbGaP-authorized users.
All three Cloud Resources provide support for data access through a web UI and API, access to analytic tools and workflows, and the capability of sharing results with collaborators. Each Cloud Resource is also developing new functionality to improve the user experience and add new tools for researchers.
The Cloud Resources will continue efforts to provide analytic support for data types beyond genomics, including radiologic images, digital pathology slides, and proteomic data. In the past year, some of these additional data types have been incorporated into the Cloud Resources - for example, images from The Cancer Imaging Archive (TCIA) and data from the Clinical Proteomic Tumor Analysis Consortium (CPTAC).
Institute for Systems Biology (ISB)
Read the blog post by Dr. Tony Kerlavage about how the CGC Pilots (now the NCI Cloud Resources) and the GDC fit into the National Cancer Data Ecosystem.
Access the Cloud Resources
You can register to use one of more of the Cloud Resources using the instructions below.
This document also has detailed information and links to help you get started.
You can also download the Cloud Resources Overview document for more information.
The Broad Institute’s FireCloud democratizes access to TCGA data and facilitates collaboration by providing a robust, scalable platform accessible to the community at large. Using the elastic compute capacity of Google Cloud, FireCloud empowers analysts, tool developers and production managers to perform large-scale analysis, engage in data curation, and store or publish results. Users can upload their own analysis methods and data to workspaces or run the Broad Institute’s best practice tools and pipelines on pre-loaded data.
Institute for Systems Biology
The ISB Cancer Genomics Cloud (ISB-CGC) is a cloud-based platform that provides interactive and programmatic access to TCGA data, leveraging many aspects of the Google Cloud Platform. The interactive ISB-CGC web-app allows scientists to interactively define and compare cohorts, examine the underlying molecular data for specific genes or pathways of interest, and share insights with collaborators. For computational users, programmatic interfaces and GCP tools such as BigQuery, Genomics, and Compute Engine allow users to perform complex queries from R or Python scripts, or run Dockerized workflows on sequence data available in Cloud Storage.
The Seven Bridges Cloud is a platform that enables researchers to collaborate on the analysis of large cancer genomics datasets in a secure, reproducible and scalable manner. A rich query system allows researchers to find exactly the data they are interested in and combine it with their own private data. Native implementation of the Common Workflow Language specification makes it easy for developers, analysts and bench biologists to deploy, customize and run reproducible analysis methods to learn from genomics data faster.
Anyone interested in working with controlled access data on any of the cloud platforms will need dbGaP Access. All the members of your lab team must have such access, or be authorized downloaders.
What data are available in the Cloud Resources? The Cloud Resources will have all the data currently stored in the Genomic Data Commons (GDC). Data from The Cancer Imaging Archive (TCIA) and the Clinical Proteomic Tumor Analysis Consortium (CPTAC) program are also available through the Cloud Resources. Users may also upload their own data to the Cloud Resources and may choose to share their data with collaborators or keep their data private.
How do I apply for access to controlled TCGA data? If you are an NIH researcher or already have an eRA Commons account, then visit https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi to request dbGaP access for yourself or members of your team who will be working with controlled-access data.
If you are not an NIH researcher and do not have an eRA Commons account, you will need to register for one. First, you will need an eRA Commons account. If you do not have one, please visit the eRA Commons website and complete the registration form. Once you have an eRA Commons account, you can visit https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi to request dbGaP access.
If you have any questions during the application process, please contact the support [at] nci-gdc.datacommons.io (dbGaP Help Desk).
What is the Genomic Data Commons (GDC), and how does this relate to the Cancer Genomics Cloud Resources? The National Cancer Institute has established the NCI Genomic Data Commons to store, analyze and distribute cancer genomics data generated by NCI and other research organizations. The GDC provides an interactive system for researchers to access data, with the goal of advancing the molecular diagnosis of cancer and suggest potential therapeutic targets based on genomic information. The GDC contains all the data currently stored in The Cancer Genome Atlas (TCGA), as well as other genomic data.
It is important to keep in mind that TCGA data hosted on the GDC and on the Cloud Resources may currently not be completely synchronized. This is because the timing of downloads by each of the platforms, and because the GDC hosts a broader set of data than the Cloud Resources (e.g., archived data). This issue will be addressed in the future, as the Cloud Resources switch from hosting their own set of data to accessing the data maintained by the GDC in a commercial cloud.
Another way to think about the GDC and the Cloud Resources is as components of the CRDC. The GDC is a Data Node of the CRDC which stores genomic data, and the Cloud Resources are portals into the data, providing tools, pipelines and compute capability to act upon that data.