Data Productivity Cloud Pipeline

Author: Matillion
Date Posted: Nov 22, 2024
Last Modified: Dec 2, 2024

Multi Tier Data Architecture

Build a multi tier data architecture, including raw, staging, ODS, 3NF, Data Vault, Star Schema and aggregations with these demonstration pipelines.

The source data is a small sample taken over 90 seconds from an industrial process. An IoT device recorded the gas output from a catalyzed chemical reaction. The gas was bubbling through an airlock, and the data recorded one “event” every time a bubble passed. Details about the data - including how it was extracted from the original, unstructured source - are in this article on Structured, Unstructured, and Semi-Structured Data.

The events occur in a piece of industrial equipment, which has a status over time.

Image ofIoT Event data model
IoT Event data model

Here’s a summary of how the data flows between tiers:

Image ofData tier progression
Data tier progression

How to use the demonstration pipelines

After import you will find the following pipeline and directory structure:

Image ofMulti tier demonstration pipelines
Multi tier demonstration pipelines

For each part 01, 02 etc, open and run the first pipeline in each subdirectory.

Raw data loading

Open example datamodel load raw iot event and follow the instructions in the notes about configuring your project:

  • Project Variables - to choose working areas in your own cloud storage and cloud data platform
  • Cloud Credentials - mandatory if you are using a Full SaaS Matillion Hosted Agent

Then run the pipeline to load the sample data into the raw_iot_event table.

Staging

This is an example of a raw-to-staging data transformation. Examples are fixing datatypes, and applying column naming standards.

Staging usually only contains a small set of recent changes, whereas an ODS contains all the data.

Run example datamodel raw to staging in this demonstration.

ODS

Run example datamodel setup ODS to create an Operational Data Store (ODS) table, sometimes also known as a Persistent Staging Area table. ODS tables are permanent copies from source. They do not get dropped and recreated, so the option here is “Create if not exists”.

Data structures in an ODS mirror those of the source system, so data is converted from its transport model (which is usually flat or semi-structured).

An ODS is a good source for independent operational reporting, as it should contain all the data from one source system.

3NF - Third Normal Form

Run example datamodel 3NF setup to perform the ODS-to-3NF data transformation.

3NF data structures represent the data in the most highly structured way. Data is converted from its ODS model into this centralized, unified, subject-oriented model.

An 3NF layer should contain all the data from all the relevant source systems.

Data Vault

This is an alternative way to model data in the data integration tier. Like 3NF, Data Vault represents the data in a system-independent, highly structured way.

Run example datamodel orchestrate data vault build to perform the ODS-to-Data Vault 2.0 data transformation.

Creating Data Vault objects tends to be quite technical, so this step uses templates that create Hub and Link tables (plus their Satellite and Error tables). You can find more detail on this in the Data Vault 2.0 Pipelines page.

In order to standardize the input to the templates, the pipelines actually read from a layer of views on top of the ODS data.

Star Schema

If you created the 3NF tables earlier, run example datamodel run 3nf to star now to transform the data from a 3NF model into a Star Schema.

If you created the Data Vault tables earlier, run example datamodel run dv to star now to similarly transformat the data into a Star Schema starting from its Data Vault representation.

Whichever path you chose, the result in the Star Schema ends up the same.

Aggregate

Run example datamodel star to aggregate to create an “Aggregate fact table” - grouping an existing Fact table on a smaller number of Dimensions.

In this demonstration the aggregate is fully recreated every time, using a Rewrite Table component.

Summary

A multi tier data architecture addresses two modern data engineering problems:

  • Application proliferation - data from different sources never works well together without transformation and integration
  • Format proliferation - data can be stored and processed in so many different ways
Data TierAcquiring dataIntegrating dataPresentation
Reason for tierDeal with format proliferationDeal with application proliferationMake data easy to consume
Raw, Staging, ODS3NF, Data VaultStar Schema, Aggregations
Data is integrated?NoYesYes
Data is easy to use?NoNoYes

Downloads

Licensed under: Matillion Free Subscription License

Installation Instructions

How to Install a Data Productivity Cloud Pipeline