Author: Matillion
Date Posted: Aug 14, 2024
Last Modified: Nov 29, 2024
TCGA Genomic Data Commons ETL pipelines
Prepare raw data to analyze 60,000 gene expressions against clinical diagnoses with this set of Data Productivity Cloud pipelines.
The Cancer Genome Atlas (TCGA) is a collaboration between the National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI) that has characterized tumor and normal tissues from 11,000 patients, covering 33 cancer types. The Genomic Data Commons receives, processes, and distributes genomic, clinical, and biospecimen data from cancer research programs.
TCGA Data Model
The data model for this analysis is based on anonymized Cases, with associated genome data from tissue samples available for download in Files.
Every File contains the Gene Expression Quantification data for more than 60,000 individual genes. This data provides a detailed view of gene characteristics and their expression levels across different normalization and sequencing strategies, which is crucial for understanding gene function and regulation within genomic studies.
The data fields are as follows:
- gene_id - A unique identifier assigned to a gene. It is typically provided by a comprehensive database such as Ensembl or NCBI, e.g. ENSG00000139618
- gene_name - The common name or symbol of the gene. This is often a short-hand abbreviation recognized by the scientific community.
- gene_type - The category or classification of the gene based on its function or other characteristics. It indicates whether the gene is a protein-coding gene, a non-coding RNA gene, etc.
- unstranded - Refers to gene expression data obtained without distinction between the sense and antisense strands of genetic material. The values here represent raw gene counts from unstranded RNA-seq. A numerical count indicating the number of reads mapped to a gene.
- stranded_first - Indicates gene expression levels for data obtained with stranded RNA-seq, where reads mapping to the first strand are distinguished.
- stranded_second - Refers to gene expression levels for data captured with stranded RNA-seq, where reads mapping to the second strand (the complementary strand) are distinguished.
- tpm_unstranded - Transcripts Per Million (TPM) calculated from unstranded RNA-seq data. TPM is a normalization method that accounts for both the length of the gene/transcript and sequencing depth. These normalized values indicate the abundance of the transcript.
- fpkm_unstranded - Fragments Per Kilobase of transcript per Million mapped reads (FPKM) derived from unstranded RNA-seq data. This is another normalization method similar to TPM, taking into account gene length and sequencing depth.
- fpkm_uq_unstranded - Upper Quartile normalized FPKM (FPKM-UQ) from unstranded RNA-seq data. This normalization method adjusts for sequencing depth and ensures consistency among samples by normalization to the upper quartile of the data.
Data Architecture
The Data Productivity Cloud pipelines extract, transform and integrate the data using a Medallion architecture, in which the tables and views are organized into three logical layers:
- Bronze - original source data
- Silver - structured and integrated data
- Gold - data optimized for external consumption
How to run the pipelines
Import the TCGA Genomic Data Commons Data Portal connector from the Matillion Exchange.
The pipelines use three project-level variables that you must create next, with default values appropriate to your configuration:
- A Text project variable named
s3_staging_bucket
. Set its default value to the name of one of your S3 buckets that has been set up as a writeable External Location. Do not include the s3:// nor any substructure: just the bucket name. - A Text project variable named
default_catalog
, and set it to the Default Catalog of your Matillion Environment. - A Text project variable named
default_schema
, and set it to the Default Schema of your Matillion Environment.
Bronze layer
Start with the TCGA GDC - case extract
orchestration. This is a paged “Search and Retrieval” request that extracts case data in pages of 1000 records at a time. Expect 40 or 50 rows to be created in the stg_cases
table.
Open the TCGA GDC - case tx bronze
transformation pipeline. Note how the structure in the Explode component is derived from Databricks itself. Run the pipeline and note the rowcount (44637 at the time of writing).
Run the TCGA GDC - files extract
orchestration pipeline. It just loads a single record (comprising a large JSON array) into stg_expressions_info
.
Run the TCGA GDC - files tx bronze
transformation pipeline. At the time of writing it creates 11275 records in bronze_expressions_info
.
The “Explode” and “Relationalize” components flatten and convert the JSON data into records and columns.
Run the TCGA GDC - expressions extract
orchestration pipeline. This downloads some of the expression profile datafiles, and loads them all into bronze_gdc_file. There are a lot of files to download, so this pipeline does not attempt to download them all at once, which would place a sudden high load on the public GDC API. Repeat until no new files have been downloaded. This might take some time to finish! The pipeline is idempotent in case of temporary failures.
Silver layer
Run the TCGA GDC - case expressions
transformation pipeline, to create the silver_case_expressions
table, containing all the gene expression data by case.
To finish the case level tables in the silver layer, run the TCGA GDC - diagnoses demogs exposures
to create:
silver_cases_demographics
- case level data focussing on the demographicssilver_cases_diagnoses
- ditto, for the diagnosessilver_cases_exposures
- ditto, for the exposure history
The silver layer typically contains data in its most normalized representation, meaning that it acts as a source of truth. But the silver layer data structures can be hard to read. They usually benefit from further processing to make the data most consumable.
The gene expression data is a good example of this. The normalized representation is in “narrow” format, with more than 60,000 records per case: one per gene expression.
Now that the silver layer tables have been created, it’s time to move on to the gold layer.
Gold layer
Run TCGA GDC - top 20 diagnoses
to create gold_top_20_diagnosis. This is a summary of the most common diagnoses in the data set.
Run TCGA GDC - gene variance pivot
to create the view silver_cx_pivot
and gold_cx_pivot
tables. These contain the fpkm_unstranded
values of the top 100 gene ids with the most variance. These can be used with dimension reduction for clustering. You can vary the number of genes selected by altering the value in the Query Result To Scalar
component in this pipeline.
Data in the gold_cx_pivot
table is in “wide” format, with one column for each of the 100 most variable genes.
The wide format data is ready for statistical analysis and machine learning techniques such as dimensionality reduction and clustering, to help us understand which gene expressions are most associated with which cancers.
Downloads
Licensed under: Matillion Free Subscription License
- Download TCGA-GDC_Databricks_AWS.zip
- Platform: AWS
- Target: Databricks