Data Productivity Cloud Pipeline

Author: Matillion
Date Posted: Jun 10, 2024
Last Modified: Nov 29, 2024

Data Vault 2.0 Pipelines

Template your Hub, Link and Satellite Data Vault 2.0 tables with these Data Productivity Cloud pipelines.

To use the pipelines:

Create and manage a database object (table or view) that contains the source data for the Data Vault 2.0 tables. The data can be an increment rather than the full set.
Run these pipelines, having configured them to read from your database object by setting their parameters. Example parameters are shown below.

A Satellite table is added to both Hub and Link tables, with a semi-structured column that contains all the non-business-key columns. The pipelines also create an Error table, for rows that can not be added to the Data Vault.

The pipelines create all the tables if they don’t already exist. When running them for the first time, links must be processed after all the hubs. This restriction does not apply once the Data Vault tables have been created, or if you create them beforehand yourself.

The pipelines create a database function named f_dv2_hash that generates an MD5 hash code from an array of IDs. This is an excellent solution in most circumstances, but you may choose to replace this function with your own hash code generation logic.

To generate a Hub:

Call Data Vault Quickstart - Stage to Hub
Parameters include a “stub” name for the Data Vault objects. If you set the stub to “customer”, for example, it will create and populate h_customer hub, s_customer satellite and e_customer error table
Your source must have a business key column. Rows with duplicate values or NULL business keys will be sent to the error table.

Image ofData Vault 2.0 Create Hub and Satellite tables — Data Vault 2.0 Create Hub and Satellite tables

To generate a Link:

Call Data Vault Quickstart - Stage to Link
You must nominate the hubs to join, by providing a list of stub names
Provide a source table that has all the business key columns of the hubs being joined
Parameters include a “link_suffix” that allows you to differentiate between multiple link tables that join an identical set of hub tables. For example, if you set the stubs to “customer” and “product”, and the link_suffix to “link1”, it will create and populate l_customer_product_link1, s_customer_product_link1 and e_customer_product_link1

Image ofData Vault 2.0 Create Link and Satellite tables — Data Vault 2.0 Create Link and Satellite tables

Hub and Satellite tables

The Data Vault Quickstart - Stage to Hub Data Productivity Cloud pipeline is made up of Python logic to set the necessary variables, followed by SQL scripts to run the DDL and DML.

Here is an example of how to set the parameters:

Image ofScalar parameters for Data Vault Quickstart - Stage to Hub — Scalar parameters for Data Vault Quickstart - Stage to Hub

With these parameters, the pipeline will read from "PROD_DB"."DV2"."v_hub_equipment_source", and use the data to populate h_equipment, s_equipment and e_equipment, creating them if they do not already exist. The record source of the data will be marked as “Operations”.

The grid variable must contain a list of the business key column names. In the below screenshot there is only one: equipment_id. This column must exist on the source object v_hub_equipment_source.

Image ofGrid parameter for Data Vault Quickstart - Stage to Hub — Grid parameter for Data Vault Quickstart - Stage to Hub

Link and Satellite tables

The Data Vault Quickstart - Stage to Link Data Productivity Cloud pipeline is made up of Python logic to set the necessary variables, followed by SQL scripts to run the DDL and DML.

Here is an example of how to set the parameters:

Image ofScalar parameters for Data Vault Quickstart - Stage to Link — Scalar parameters for Data Vault Quickstart - Stage to Link

With these parameters, the pipeline will read from "PROD_DB"."DV2"."v_bubble_event_equipment_link", and use the data to populate the link table, its satellite, and an associated error table if necessary. The table names will end with _link, the actual name being determined by the grid variable settings. The record source of the data will be marked as “Operations”.

The grid variable must contain a list of the hub tables being joined. In the below screenshot there are two: equipment and bubble_event. So the pipeline will use hub tables h_equipment and h_bubble_event, and will require that the business key columns of both those tables exist on the source object v_bubble_event_equipment_link.

The names of the database objects associated with the Link will be l_bubble_event_equipment_link, s_bubble_event_equipment_link and e_bubble_event_equipment_link.

About the example

This Data Vault Quickstart contains a pipeline named Create Sample Data, pictured below. It’s a hypothetical IoT example, in which temperature measurements are taken by pieces of “equipment”.

Image ofData Vault Quickstart - Create Sample Data — Data Vault Quickstart - Create Sample Data

Run the Data Vault Quickstart - Run Demo pipeline to create the Data Vault objects:

A hub table for the “equipment”, with only one row in this example.
A hub table for measurements, with 13 rows.
A link table joins the two hubs, to indicate which piece of equipment took every measurement.

All the contextual information about the hubs and the link is contained in the associated satellite tables. You should not encounter errors when you run this example, so the error tables will be created but should remain empty.

Downloads

Licensed under: Matillion Free Subscription License

Download Data-Vault-2-Quickstart_Snowflake.zip
- Target: Snowflake

Installation Instructions

How to Install a Data Productivity Cloud Pipeline