Data Productivity Cloud Pipeline

Author: Matillion
Date Posted: Aug 14, 2024
Last Modified: Nov 29, 2024

TCGA Genomic Data Commons ETL pipelines

Prepare raw data to analyze 60,000 gene expressions against clinical diagnoses with this set of Data Productivity Cloud pipelines.

The Cancer Genome Atlas (TCGA) is a collaboration between the National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI) that has characterized tumor and normal tissues from 11,000 patients, covering 33 cancer types. The Genomic Data Commons receives, processes, and distributes genomic, clinical, and biospecimen data from cancer research programs.

TCGA Data Model

The data model for this analysis is based on anonymized Cases, with associated genome data from tissue samples available for download in Files.

Image ofTCGA Data Model
TCGA Data Model

Every File contains the Gene Expression Quantification data for more than 60,000 individual genes. This data provides a detailed view of gene characteristics and their expression levels across different normalization and sequencing strategies, which is crucial for understanding gene function and regulation within genomic studies.

The data fields are as follows:

  • gene_id - A unique identifier assigned to a gene. It is typically provided by a comprehensive database such as Ensembl or NCBI, e.g. ENSG00000139618
  • gene_name - The common name or symbol of the gene. This is often a short-hand abbreviation recognized by the scientific community.
  • gene_type - The category or classification of the gene based on its function or other characteristics. It indicates whether the gene is a protein-coding gene, a non-coding RNA gene, etc.
  • unstranded - Refers to gene expression data obtained without distinction between the sense and antisense strands of genetic material. The values here represent raw gene counts from unstranded RNA-seq. A numerical count indicating the number of reads mapped to a gene.
  • stranded_first - Indicates gene expression levels for data obtained with stranded RNA-seq, where reads mapping to the first strand are distinguished.
  • stranded_second - Refers to gene expression levels for data captured with stranded RNA-seq, where reads mapping to the second strand (the complementary strand) are distinguished.
  • tpm_unstranded - Transcripts Per Million (TPM) calculated from unstranded RNA-seq data. TPM is a normalization method that accounts for both the length of the gene/transcript and sequencing depth. These normalized values indicate the abundance of the transcript.
  • fpkm_unstranded - Fragments Per Kilobase of transcript per Million mapped reads (FPKM) derived from unstranded RNA-seq data. This is another normalization method similar to TPM, taking into account gene length and sequencing depth.
  • fpkm_uq_unstranded - Upper Quartile normalized FPKM (FPKM-UQ) from unstranded RNA-seq data. This normalization method adjusts for sequencing depth and ensures consistency among samples by normalization to the upper quartile of the data.

Data Architecture

The Data Productivity Cloud pipelines extract, transform and integrate the data using a Medallion architecture, in which the tables and views are organized into three logical layers:

  • Bronze - original source data
  • Silver - structured and integrated data
  • Gold - data optimized for external consumption
Image ofTCGA Data Architecture
TCGA Data Architecture

How to run the pipelines

Import the TCGA Genomic Data Commons Data Portal connector from the Matillion Exchange.

The pipelines use three project-level variables that you must create next, with default values appropriate to your configuration:

  • A Text project variable named s3_staging_bucket. Set its default value to the name of one of your S3 buckets that has been set up as a writeable External Location. Do not include the s3:// nor any substructure: just the bucket name.
  • A Text project variable named default_catalog, and set it to the Default Catalog of your Matillion Environment.
  • A Text project variable named default_schema, and set it to the Default Schema of your Matillion Environment.

Bronze layer

Start with the TCGA GDC - case extract orchestration. This is a paged “Search and Retrieval” request that extracts case data in pages of 1000 records at a time. Expect 40 or 50 rows to be created in the stg_cases table.

Open the TCGA GDC - case tx bronze transformation pipeline. Note how the structure in the Explode component is derived from Databricks itself. Run the pipeline and note the rowcount (44637 at the time of writing).

Run the TCGA GDC - files extract orchestration pipeline. It just loads a single record (comprising a large JSON array) into stg_expressions_info.

Run the TCGA GDC - files tx bronze transformation pipeline. At the time of writing it creates 11275 records in bronze_expressions_info.

Image ofTCGA GDC - files tx bronze
TCGA GDC - files tx bronze

The “Explode” and “Relationalize” components flatten and convert the JSON data into records and columns.

Image ofExplode and Relationalize
Explode and Relationalize

Run the TCGA GDC - expressions extract orchestration pipeline. This downloads some of the expression profile datafiles, and loads them all into bronze_gdc_file. There are a lot of files to download, so this pipeline does not attempt to download them all at once, which would place a sudden high load on the public GDC API. Repeat until no new files have been downloaded. This might take some time to finish! The pipeline is idempotent in case of temporary failures.

Image ofTCGA GDC - expressions extract
TCGA GDC - expressions extract

Silver layer

Run the TCGA GDC - case expressions transformation pipeline, to create the silver_case_expressions table, containing all the gene expression data by case.

To finish the case level tables in the silver layer, run the TCGA GDC - diagnoses demogs exposures to create:

  • silver_cases_demographics - case level data focussing on the demographics
  • silver_cases_diagnoses - ditto, for the diagnoses
  • silver_cases_exposures - ditto, for the exposure history
Image ofBronze to Silver table joins
Bronze to Silver table joins

The silver layer typically contains data in its most normalized representation, meaning that it acts as a source of truth. But the silver layer data structures can be hard to read. They usually benefit from further processing to make the data most consumable.

The gene expression data is a good example of this. The normalized representation is in “narrow” format, with more than 60,000 records per case: one per gene expression.

Image ofNarrow format - silver layer data
Narrow format - silver layer data

Now that the silver layer tables have been created, it’s time to move on to the gold layer.

Gold layer

Run TCGA GDC - top 20 diagnoses to create gold_top_20_diagnosis. This is a summary of the most common diagnoses in the data set.

Run TCGA GDC - gene variance pivot to create the view silver_cx_pivot and gold_cx_pivot tables. These contain the fpkm_unstranded values of the top 100 gene ids with the most variance. These can be used with dimension reduction for clustering. You can vary the number of genes selected by altering the value in the Query Result To Scalar component in this pipeline.

Image ofPivot most variable genes to wide format
Pivot most variable genes to wide format

Data in the gold_cx_pivot table is in “wide” format, with one column for each of the 100 most variable genes.

Image ofGene expression by case in wide format
Gene expression by case in wide format

The wide format data is ready for statistical analysis and machine learning techniques such as dimensionality reduction and clustering, to help us understand which gene expressions are most associated with which cancers.


Downloads

Licensed under: Matillion Free Subscription License

Installation Instructions

How to Install a Data Productivity Cloud Pipeline