Data Productivity Cloud Pipeline

Author: Matillion
Date Posted: Mar 10, 2025
Last Modified: Mar 12, 2025

Slowly Changing Dimensions (Manual)

Manage Slowly Changing Dimension tables in a Data Productivity Cloud pipeline, including type 1 and type 2 SCDs.

These pipelines maintain Slowly Changing Dimension tables for “event” reference data. Events are identified by an event_nk natural key, and have an event_type and an event_category.

This method of maintaining SCD tables is presented as an alternative to automatically maintaining SCD views.

Start by running Physical SCD begin setup to create and initialize all the tables.

Image ofSlowly Changing Dimension Tables — Slowly Changing Dimension Tables

Afterwards, run Physical SCD run multiple times to try the pipelines with test data. Every run creates new, random data in the input table STG_EVENT_DIM_EXAMPLE, which is fed into the dimension tables.

Image ofRun test data through the Slowly Changing Dimension tables — Run test data through the Slowly Changing Dimension tables

Type 1 Slowly Changing Dimension

A “type 1” SCD table doesn’t record any history of the past values of event_type or event_category: it always contains the latest data.

New column added:

Parameter	Description
event_sk	Surrogate key for the dimension. Take a look at these surrogate key techniques for reference

The Physical SCD EVENT_DIM_T1 pipeline reads from STG_EVENT_DIM_EXAMPLE and adds or updates any changes into the EVENT_DIM_T1 table.

The target dimension table is also used as an input, for multiple purposes:

to detect new event records (the ‘C’ indicator meaning ‘changed’)
to detect updated event records (the ‘N’ indicator meaning ’new’)
to generate new surrogate key values, with the New Row Rank and max_event_sk components. This example shows one of many ways to generate surrogate keys

Image ofMaintain a SCD type 1 table — Maintain a SCD type 1 table

Note that this solution does not consider logical deletes. If one of the events does not appear in the input, the pipeline assumes it has not changed.

Type 2 Slowly Changing Dimension

A “type 2” SCD table records every change to the event_type or event_category columns. Every new or changed value generates a new record in the table.

New columns added:

Parameter	Description
event_sk	Surrogate key for the dimension. Same as above, there are more surrogate key techniques available as alternatives
valid_from	Timestamp when this version of the event became current
valid_to	Timestamp when this version of the event ceased to be current
is_latest	Indicates if this version of the event is the latest known
version	Version number for the event, ascending as changes accumulate

The Physical SCD EVENT_DIM_T2 (Low Code) pipeline reads from STG_EVENT_DIM_EXAMPLE and adds or updates any changes into the EVENT_DIM_T2 table.

In this pipeline the target dimension table is also used as an input, to detect new or updated event records: the ‘C’ or ‘N’ indicator, same as in the type 1 SCD pipeline. Note that the comparison is made against the latest values in the table (where is_latest is true).

Validity time ranges valid_from and valid_to contain the timestamps when each version of the record is (or was) correct:

There is no time information in the input data, so the timestamp of the pipeline run is used
The “previous” version of every record is set to expire one second before the next one
The “current” version of an event expires at a hardcoded date far in the future

The main complexity of maintaining SCD type 2 tables is that changes require DML on two records:

An update to the previous version of the record, to set is_latest attribute to false, and give it a valid_to timestamp
Insert a new record

Brand new events require only one change:

Insert a new record

Image ofMaintain a SCD type 1 table (low code) — Maintain a SCD type 1 table (low code)

Two more possible scenarios are not considered by this pipeline:

Logical deletes are ignored, same as the SCD type 1 above. When an event does not appear in the input, the pipeline assumes it has not changed. Handling logical deletes would involve extra processing: for example requiring that all events are expected every time, and closing the validity of any that are not present.
Historical updates also present an additional complication for manually-maintained type 2 SCDs. Adjustments that need to be backdated involve carefully altering historical record validity ranges in the dimension table.

Type 2 Slowly Changing Dimension (high code version)

On some platforms, the exact same SCD type 2 logic is demonstrated in a high code version of the pipeline.

Image ofMaintain a SCD type 1 table (high code) — Maintain a SCD type 1 table (high code)

Note the sophisticated join clauses in the Type 2 SCD logic which are needed to handle cases where an optional attribute changes from NULL to non-NULL.

Downloads

Licensed under: Matillion Free Subscription License

Download SCD-Physical-Databricks.zip
- Target: Databricks
Download SCD-Physical-Redshift.zip
- Platform: AWS
- Target: Redshift
Download SCD-Physical-Snowflake.zip
- Target: Snowflake

Installation Instructions

How to Install a Data Productivity Cloud Pipeline