Author: Matillion
Date Posted: Jun 10, 2024
Last Modified: Nov 29, 2024
Data Vault 2.0 Pipelines
Template your Hub, Link and Satellite Data Vault 2.0 tables with these Data Productivity Cloud pipelines.
To use the pipelines:
- Create and manage a database object (table or view) that contains the source data for the Data Vault 2.0 tables. The data can be an increment rather than the full set.
- Run these pipelines, having configured them to read from your database object by setting their parameters. Example parameters are shown below.
A Satellite table is added to both Hub and Link tables, with a semi-structured column that contains all the non-business-key columns. The pipelines also create an Error table, for rows that can not be added to the Data Vault.
The pipelines create all the tables if they don’t already exist. When running them for the first time, links must be processed after all the hubs. This restriction does not apply once the Data Vault tables have been created, or if you create them beforehand yourself.
The pipelines create a database function named f_dv2_hash
that generates an MD5 hash code from an array of IDs. This is an excellent solution in most circumstances, but you may choose to replace this function with your own hash code generation logic.
To generate a Hub:
- Call
Data Vault Quickstart - Stage to Hub
- Parameters include a “stub” name for the Data Vault objects. If you set the stub to “
customer
”, for example, it will create and populateh_customer
hub,s_customer
satellite ande_customer
error table - Your source must have a business key column. Rows with duplicate values or NULL business keys will be sent to the error table.
To generate a Link:
- Call
Data Vault Quickstart - Stage to Link
- You must nominate the hubs to join, by providing a list of stub names
- Provide a source table that has all the business key columns of the hubs being joined
- Parameters include a “link_suffix” that allows you to differentiate between multiple link tables that join an identical set of hub tables. For example, if you set the stubs to “
customer
” and “product
”, and the link_suffix to “link1
”, it will create and populatel_customer_product_link1
,s_customer_product_link1
ande_customer_product_link1
Hub and Satellite tables
The Data Vault Quickstart - Stage to Hub
Data Productivity Cloud pipeline is made up of Python logic to set the necessary variables, followed by SQL scripts to run the DDL and DML.
Here is an example of how to set the parameters:
With these parameters, the pipeline will read from "PROD_DB"."DV2"."v_hub_equipment_source"
, and use the data to populate h_equipment
, s_equipment
and e_equipment
, creating them if they do not already exist. The record source of the data will be marked as “Operations
”.
The grid variable must contain a list of the business key column names. In the below screenshot there is only one: equipment_id
. This column must exist on the source object v_hub_equipment_source
.
Link and Satellite tables
The Data Vault Quickstart - Stage to Link
Data Productivity Cloud pipeline is made up of Python logic to set the necessary variables, followed by SQL scripts to run the DDL and DML.
Here is an example of how to set the parameters:
With these parameters, the pipeline will read from "PROD_DB"."DV2"."v_bubble_event_equipment_link"
, and use the data to populate the link table, its satellite, and an associated error table if necessary. The table names will end with _link
, the actual name being determined by the grid variable settings. The record source of the data will be marked as “Operations
”.
The grid variable must contain a list of the hub tables being joined. In the below screenshot there are two: equipment and bubble_event. So the pipeline will use hub tables h_equipment
and h_bubble_event
, and will require that the business key columns of both those tables exist on the source object v_bubble_event_equipment_link
.
The names of the database objects associated with the Link will be l_bubble_event_equipment_link
, s_bubble_event_equipment_link
and e_bubble_event_equipment_link
.
About the example
This Data Vault Quickstart contains a pipeline named Create Sample Data
, pictured below. It’s a hypothetical IoT example, in which temperature measurements are taken by pieces of “equipment”.
Run the Data Vault Quickstart - Run Demo
pipeline to create the Data Vault objects:
- A hub table for the “equipment”, with only one row in this example.
- A hub table for measurements, with 13 rows.
- A link table joins the two hubs, to indicate which piece of equipment took every measurement.
All the contextual information about the hubs and the link is contained in the associated satellite tables. You should not encounter errors when you run this example, so the error tables will be created but should remain empty.
Downloads
Licensed under: Matillion Free Subscription License
- Download Data-Vault-2-Quickstart_Snowflake.zip
- Target: Snowflake