Author: Matillion
Date Posted: Nov 22, 2024
Last Modified: Dec 2, 2024
Multi Tier Data Architecture
Build a multi tier data architecture, including raw, staging, ODS, 3NF, Data Vault, Star Schema and aggregations with these demonstration pipelines.
The source data is a small sample taken over 90 seconds from an industrial process. An IoT device recorded the gas output from a catalyzed chemical reaction. The gas was bubbling through an airlock, and the data recorded one “event” every time a bubble passed. Details about the data - including how it was extracted from the original, unstructured source - are in this article on Structured, Unstructured, and Semi-Structured Data.
The events occur in a piece of industrial equipment, which has a status over time.
Here’s a summary of how the data flows between tiers:
How to use the demonstration pipelines
After import you will find the following pipeline and directory structure:
For each part 01
, 02
etc, open and run the first pipeline in each subdirectory.
Raw data loading
Open example datamodel load raw iot event
and follow the instructions in the notes about configuring your project:
- Project Variables - to choose working areas in your own cloud storage and cloud data platform
- Cloud Credentials - mandatory if you are using a Full SaaS Matillion Hosted Agent
Then run the pipeline to load the sample data into the raw_iot_event
table.
Staging
This is an example of a raw-to-staging data transformation. Examples are fixing datatypes, and applying column naming standards.
Staging usually only contains a small set of recent changes, whereas an ODS contains all the data.
Run example datamodel raw to staging
in this demonstration.
ODS
Run example datamodel setup ODS
to create an Operational Data Store (ODS) table, sometimes also known as a Persistent Staging Area table. ODS tables are permanent copies from source. They do not get dropped and recreated, so the option here is “Create if not exists”.
Data structures in an ODS mirror those of the source system, so data is converted from its transport model (which is usually flat or semi-structured).
An ODS is a good source for independent operational reporting, as it should contain all the data from one source system.
3NF - Third Normal Form
Run example datamodel 3NF setup
to perform the ODS-to-3NF data transformation.
3NF data structures represent the data in the most highly structured way. Data is converted from its ODS model into this centralized, unified, subject-oriented model.
An 3NF layer should contain all the data from all the relevant source systems.
Data Vault
This is an alternative way to model data in the data integration tier. Like 3NF, Data Vault represents the data in a system-independent, highly structured way.
Run example datamodel orchestrate data vault build
to perform the ODS-to-Data Vault 2.0 data transformation.
Creating Data Vault objects tends to be quite technical, so this step uses templates that create Hub and Link tables (plus their Satellite and Error tables). You can find more detail on this in the Data Vault 2.0 Pipelines page.
In order to standardize the input to the templates, the pipelines actually read from a layer of views on top of the ODS data.
Star Schema
If you created the 3NF tables earlier, run example datamodel run 3nf to star
now to transform the data from a 3NF model into a Star Schema.
If you created the Data Vault tables earlier, run example datamodel run dv to star
now to similarly transformat the data into a Star Schema starting from its Data Vault representation.
Whichever path you chose, the result in the Star Schema ends up the same.
Aggregate
Run example datamodel star to aggregate
to create an “Aggregate fact table” - grouping an existing Fact table on a smaller number of Dimensions.
In this demonstration the aggregate is fully recreated every time, using a Rewrite Table component.
Summary
A multi tier data architecture addresses two modern data engineering problems:
- Application proliferation - data from different sources never works well together without transformation and integration
- Format proliferation - data can be stored and processed in so many different ways
Data Tier | Acquiring data | Integrating data | Presentation |
---|---|---|---|
Reason for tier | Deal with format proliferation | Deal with application proliferation | Make data easy to consume |
Raw, Staging, ODS | 3NF, Data Vault | Star Schema, Aggregations | |
Data is integrated? | No | Yes | Yes |
Data is easy to use? | No | No | Yes |
Downloads
Licensed under: Matillion Free Subscription License
- Download Multi-Tier-Databricks-AWS.zip
- Platform: AWS
- Target: Databricks
- Download Multi-Tier-Snowflake-AWS.zip
- Platform: AWS
- Target: Snowflake
- Download Multi-Tier-Snowflake-Azure.zip
- Platform: Azure
- Target: Snowflake