Author: Ian
Date Posted: Feb 2, 2024
Last Modified: Nov 29, 2024
Incremental Load Strategy Medallion Schema - Salesforce example
Incremental data refresh strategy with a Medallion data architecture, demonstrated using Salesforce accounts.
Three pipelines that demonstrate an incremental data replication strategy, using a Medallion (bronze/silver/gold) data architecture.
Open the orchestration pipeline qs-incremental-load-salesforce
first.
To configure the orchestration pipeline, follow Variables (x)
and then the edit icon to set the default value of the SalesforceOAuthName
variable to the name of your own Salesforce OAuth connection.
After updating the default value of the SalesforceOAuthName
variable the pipeline should validate cleanly.
Salesforce Accounts incremental load
This orchestration Data Productivity Cloud pipeline copies all records from your Salesforce Accounts table that have been updated since the last time the pipeline ran. The first time it runs, all the records will be copied.
- Target: bronze layer table
mtln_qs_sf_account
- Granularity: one record per Salesforce Account
Note this table is relatively throwaway: it gets dropped and recreated every time the connector component runs. Furthermore the schema is liable to change according to which columns are chosen in the connector.
The incremental load works in two stages
- Check the maximum
LastModifiedDate
that has already been copied - done by theSet HighWaterMark
component, which udpates theHighWaterMark
pipeline variable - In the Salesforce Query, use the
HighWaterMark
in the Data Source Filter to only request more recently updated records. TheHighWaterMark
variable is declared as aText
type to avoid implicit format conversions. Note the explicit format mask in the SQL query inside theSet HighWaterMark
component.
The orchestration pipeline ensures that the silver layer ODS table mtln_qs_sf_ods_account
exists, along with a sequence to provide a surrogate key. The ODS table is the permanent replicated copy, so is only created if it does not already exist.
To complete the orchestration, the pipeline runs the bronze-to-silver and the silver-to-gold data transformations. Note that these are identical to the transformation pipelines in the full data refresh strategy, thanks to the logic separation that comes from using the medallion data architecture.
Bronze to Silver - Account ODS table
The first Data Productivity Cloud transformation pipeline takes the raw, bronze layer account data and performs some basic transformations to change the data into a more generic operational data store (ODS) schema.
- The Rank and Filter combination ensures there is no duplication in the data
- The Rename component converts the schema from Salesforce Account into a general purpose Account table
- The Table Update performs a
MERGE
into the target, with either anINSERT
or anUPDATE
according to the primary key. - The surrogate key of the ODS table is an alternative primary key, and is generated from the sequence database object
Prerequisites for using an incremental load strategy
The table must have an always-ascending attribute such as a last-updated timestamp, which the source system updates every time a record is created or modified. Salesforce implements this pattern consistently and reliably. In the absence of a timestamp you may be able to use an ascending integer such as an audit field if it exists.
The table must have a primary key. This allows the Table Update to distinguish between inserts and updates.
Hard deletes are difficult to track using this technique. If a record has been deleted from source it will never appear among the updated records because it doesn’t exist any longer. More discussion on this subject in this article on Understanding Full Load Data Replication.
Using a Snowflake database sequence as a surrogate key generator
This is a simple solution, implemented by setting the default value of the surrogate_key
column to "mtln_qs_sf_ods_account_seq".NEXTVAL
in the Create Silver ODS Table
component. After providing a dynamic default value in this way, you can safely omit that column from DML operations. The default will always be applied upon insert.
But sequence objects are known to leave gaps in the generated sequence numbers. This is an unavoidable consequence of the internal performance enhancement of issuing new sequence numbers in batches. Exactly the same applies to AUTOINCREMENT
and IDENTITY
columns.
As an alternative you may consider using a ROW_NUMBER()
function in a Rank component that issues an ordinal number to new rows. This does require at least one extra join back to the source table, so is more sophisticated at the cost of being marginally slower at runtime.
Silver to Gold - Account fact and dimension
The second Data Productivity Cloud transformation pipeline starts with the ODS account table, and transforms it into an easily consumable star schema. It includes an account fact table and an account dimension table.
The dimension table mtln_qs_dim_account
has one record per account, with its surrogate key taken from the ODS account table. It is a simple type 1 dimension
The fact table mtln_qs_fact_account
also contains one record per account. Three foreign key columns are added:
dim_account_sk
- foreign key tomtln_qs_dim_account
dim_date_sk
- foreign key to a date dimension (not implemented in this example)dim_time_sk
- foreign key to a time dimension (not implemented in this example)
How to run the pipeline
Validate and run qs-incremental-load-salesforce
after setting your Salesforce OAuth as described above.
Run the pipeline again some time later and note how the ODS and star schema tables are updated to show all changes and additions that have been made at source.
Downloads
Licensed under: Matillion Free Subscription License
- Download incremental-load-salesforce_Snowflake.zip
- Target: Snowflake