Incremental Load Strategy Medallion Schema - Salesforce example

Incremental data refresh strategy with a Medallion data architecture, demonstrated using Salesforce accounts.

Three pipelines that demonstrate an incremental data replication strategy, using a Medallion (bronze/silver/gold) data architecture.

Open the orchestration pipeline qs-incremental-load-salesforce first.

Image ofIncremental Load Strategy Medallion data architecture - orchestration — Incremental Load Strategy Medallion data architecture - orchestration

To configure the orchestration pipeline, follow Variables (x) and then the edit icon to set the default value of the SalesforceOAuthName variable to the name of your own Salesforce OAuth connection.

Image ofSet pipeline variable default value — Set pipeline variable default value

After updating the default value of the SalesforceOAuthName variable the pipeline should validate cleanly.

Salesforce Accounts incremental load

This orchestration Data Productivity Cloud pipeline copies all records from your Salesforce Accounts table that have been updated since the last time the pipeline ran. The first time it runs, all the records will be copied.

Target: bronze layer table mtln_qs_sf_account
Granularity: one record per Salesforce Account

Note this table is relatively throwaway: it gets dropped and recreated every time the connector component runs. Furthermore the schema is liable to change according to which columns are chosen in the connector.

The incremental load works in two stages

Check the maximum LastModifiedDate that has already been copied - done by the Set HighWaterMark component, which udpates the HighWaterMark pipeline variable
In the Salesforce Query, use the HighWaterMark in the Data Source Filter to only request more recently updated records. The HighWaterMark variable is declared as a Text type to avoid implicit format conversions. Note the explicit format mask in the SQL query inside the Set HighWaterMark component.

The orchestration pipeline ensures that the silver layer ODS table mtln_qs_sf_ods_account exists, along with a sequence to provide a surrogate key. The ODS table is the permanent replicated copy, so is only created if it does not already exist.

To complete the orchestration, the pipeline runs the bronze-to-silver and the silver-to-gold data transformations. Note that these are identical to the transformation pipelines in the full data refresh strategy, thanks to the logic separation that comes from using the medallion data architecture.

Bronze to Silver - Account ODS table

The first Data Productivity Cloud transformation pipeline takes the raw, bronze layer account data and performs some basic transformations to change the data into a more generic operational data store (ODS) schema.

The Rank and Filter combination ensures there is no duplication in the data
The Rename component converts the schema from Salesforce Account into a general purpose Account table
The Table Update performs a MERGE into the target, with either an INSERT or an UPDATE according to the primary key.
The surrogate key of the ODS table is an alternative primary key, and is generated from the sequence database object

Prerequisites for using an incremental load strategy

The table must have an always-ascending attribute such as a last-updated timestamp, which the source system updates every time a record is created or modified. Salesforce implements this pattern consistently and reliably. In the absence of a timestamp you may be able to use an ascending integer such as an audit field if it exists.

The table must have a primary key. This allows the Table Update to distinguish between inserts and updates.

Hard deletes are difficult to track using this technique. If a record has been deleted from source it will never appear among the updated records because it doesn’t exist any longer. More discussion on this subject in this article on Understanding Full Load Data Replication.

Using a Snowflake database sequence as a surrogate key generator

This is a simple solution, implemented by setting the default value of the surrogate_key column to "mtln_qs_sf_ods_account_seq".NEXTVAL in the Create Silver ODS Table component. After providing a dynamic default value in this way, you can safely omit that column from DML operations. The default will always be applied upon insert.

But sequence objects are known to leave gaps in the generated sequence numbers. This is an unavoidable consequence of the internal performance enhancement of issuing new sequence numbers in batches. Exactly the same applies to AUTOINCREMENT and IDENTITY columns.

As an alternative you may consider using a ROW_NUMBER() function in a Rank component that issues an ordinal number to new rows. This does require at least one extra join back to the source table, so is more sophisticated at the cost of being marginally slower at runtime.

Silver to Gold - Account fact and dimension

The second Data Productivity Cloud transformation pipeline starts with the ODS account table, and transforms it into an easily consumable star schema. It includes an account fact table and an account dimension table.

The dimension table mtln_qs_dim_account has one record per account, with its surrogate key taken from the ODS account table. It is a simple type 1 dimension

The fact table mtln_qs_fact_account also contains one record per account. Three foreign key columns are added:

dim_account_sk - foreign key to mtln_qs_dim_account
dim_date_sk - foreign key to a date dimension (not implemented in this example)
dim_time_sk - foreign key to a time dimension (not implemented in this example)

How to run the pipeline

Validate and run qs-incremental-load-salesforce after setting your Salesforce OAuth as described above.

Run the pipeline again some time later and note how the ODS and star schema tables are updated to show all changes and additions that have been made at source.

Downloads

Licensed under: Matillion Free Subscription License

Download incremental-load-salesforce_Snowflake.zip
- Target: Snowflake

Installation Instructions

How to Install a Data Productivity Cloud Pipeline