Slowly Changing Dimensions (Automated)

Manage Slowly Changing Dimension views in a Data Productivity Cloud pipeline, including type 1, type 2, type 3 and type 6 SCDs.

These pipelines maintain Slowly Changing Dimension views for “event” reference data. Events are identified by an event_nk natural key, and have an event_type and an event_category.

This method of automatically maintaining SCD views from a single reference table is presented as an alternative to manually maintaining SCD tables.

Start by running Logical SCD begin setup to create and initialize all the database objects.

Image ofSlowly Changing Dimension Views
Slowly Changing Dimension Views

Afterwards, run Logical SCD run multiple times to try the pipelines with test data. Every run creates new, random data in the input table STG_EVENT_DIM_EXAMPLE, which is added to NORM_EVENT_REF so it automatically appears in the dimension views.

Image ofRun test data through the Slowly Changing Dimension views
Run test data through the Slowly Changing Dimension views

The backing table NORM_EVENT_REF

This is the only permanent storage, and is used to generate all the different dimension variations.

NORM_EVENT_REF records all available information about the event dimension, and the timestamp when data was recorded. The columns are:

ParameterDescription
event_skSurrogate key column, automatically generated, and used for the surrogate key columns in the dimensions. Take a look at these surrogate key techniques for reference
event_nkThe natural key of the event
event_typeAn attribute of “events”, that may change over time
event_categoryAnother attribute of “events”, that also may change over time
as_atTimestamp when this version of the event was recorded

Storing data in the backing table is a fast, append-only operation. You may choose to add a filter to avoid storing duplicate data.

Image ofAppend to backing table with deduplication
Append to backing table with deduplication

Type 1 Slowly Changing Dimension

A “type 1” SCD doesn’t record any history of the past values of event_type or event_category: it always contains the latest data.

Deriving this from NORM_EVENT_REF is done by filtering out the latest record (by as_at timestamp) for every event_nk. The Data Productivity Cloud components are a Rank and Filter combination.

Image ofSCD type 1 view
SCD type 1 view

Type 2 Slowly Changing Dimension

A “type 2” SCD records every change to the event_type or event_category columns. Every new or changed value generates a new record.

This is the simplest SCD type to derive from the backing table, because the granularity is already correct. No filtering is needed. The time variance metadata columns are derived like this:

  • valid_from - just the as_at timestamp
  • valid_to - the as_at timestamp of the following record with the same event_nk using a LEAD analytic function. If there is none, use a hardcoded date far in the future
  • is_latest - true if this record has the highest as_at timestamp for its event_nk, using a ROW_NUMBER analytic function
  • version - an ascending integer sequence per event_nk, set with another ROW_NUMBER analytic function
Image ofSCD type 2 view
SCD type 2 view

Type 3 Slowly Changing Dimension

A “type 3” SCD is identical to a type 2, with the addition of “prior” values of event_type and event_category. The new columns are:

  • prior_event_type - the event_type of the previous record with the same event_nk calculated using a LAG analytic function
  • prior_event_category - same, but for event_category

These extra columns bring the useful ability for reports to quickly compare against the “previous” situation after a wide-ranging change. A common example is reporting before and after changes to sales region boundaries.

Image ofSCD type 3 view
SCD type 3 view

Type 6 Slowly Changing Dimension

A “type 6” SCD combines the history tracking of a type 2, with the addition of the “current” values of all attributes taken from the type 1 dimension. This is done using a Join component between the two logic branches. This dimension type is harder to implement as a real table, because every update also requires an update to the “current” values of all prior versions of the record.

The new columns are:

  • current_event_type - the event_type from the type 1 dimension
  • current_event_category - same, but for event_category

A type 6 SCD is a good default option, because it gives users a very simple choice between using the “as-was” attributes (correct at the time of the transaction) and the “as-is” attributes (correct right now). No decision ever needs to be made about which attributes are history-tracked.

Image ofSCD type 6 view
SCD type 6 view

SCD types 4 and 5

These are not well defined, and are not implemented among these examples. SCD types 4 and 5 are usually interpreted as deploying two database objects per dimension. Either adding a mini-dimension just for volatile attributes, or having one “current” and one “historical” dimension. Sometimes this includes giving fact tables multiple foreign keys to the dimension: for example one for the current attributes and one for the prior.

Logical deletes and updates out of sequence

These pipelines ignore the possibility of logical deletes. When an event does not appear in the input, the pipeline assumes it has not changed. Handling logical deletes would involve extra processing: for example requiring that all events are expected every time, and closing the validity of any that are not present.

Historical updates - sometimes known as updates out-of-sequence - occur when an earlier update to the dimension is at first omitted for some reason; perhaps an artificial historical adjustment. Adding these changes is no problem: it just means adding a new record to NORM_EVENT_REF with an as_at timestamp that falls somewhere in the past. The logic of the dimension views immediately takes care of adjusting all the validity ranges, current-flag and version numbers. The business may require updates to any related fact tables, since transactions that occurred in the time of the historical change may now link to the wrong dimension record.

Adding fixed “unknown” values

Most dimensions always include one or more records with a fixed surrogate key, for cases where the actual dimension can’t be linked. This usefully allows fact tables to always inner join to all dimensions without losing records. In reports, this manifests as transaction fact records with no “event” information showing “unknown” event_type and event_category, rather than being missing from the report.

The usual convention is that -1 represents “unknown”. This value can be hardcoded in pipelines that create fact table records when the dimension lookup fails.

The fixed records can be introduced with a Fixed Flow and a Unite component in UNION ALL mode (i.e. with “Remove Duplicates” set to No) immediately before the view is defined.

Image ofAdding 'unknown' to an SCD view
Adding 'unknown' to an SCD view

Performance considerations

If your dimensions are large, with either many columns or many records, you may consider materializing them as tables as the final part of an ETL run. This exchanges runtime compute for marginally more storage consumed.


Downloads

Licensed under: Matillion Free Subscription License

Installation Instructions

How to Install a Data Productivity Cloud Pipeline