Article/How To

Incremental Load Shared Job Pattern

This article describes the pattern used by many Matillion ETL Shared Jobs to perform incremental loading.

Incremental Loading

Incremental Loading is only supported if the data source supports it.

All Shared Jobs following this pattern utilize the concept of a high water mark loading. This uses an appropriate column present in the source data. This column is used to extract data after the initial run which is greater than (or sometimes equal to or greater than, depending on the data type of the column) the maximum value held in the target table created by the ETL process.

The incremental column is specified among the “Tables and Columns” paramter, and must be one of the following datatypes:

Date
Datetime
Numeric or Integer

For relational or NoSQL databases, the incremental column and primary key(s) are both selected by the user.

For fixed schema data sources, such as API-based sources, available incremental columns and primary keys are added by the Matillion ETL process and do not need to be selected. If these columns are not available, a full truncate load of the target table will be performed during each run.

Schema Drift

Schema drift support is included, which will accommodate source data changes, such as missing columns as a result of the source database or API change. Any missing tables will no longer be loaded. However, your shared job will fail. All other tables specified as part of the configuration in the shared job will be loaded. If this scenario occurs, edit your shared job Table and Column grid variable to remove the missing table.

You can also manually add or remove any new tables or columns to the Tables and Columns grid variable at any time and the process will handle those changes.

As the process lends itself to a conscious user choice of tables and columns, if a new table or column is added to your source, it is not added as a default behavior. However, any new tables or columns can be added as above.

Any missing columns will be loaded as NULL. Data Type changes will also be accommodated but if these are not compatible changes for the target cloud platform, the current column will be renamed as <column_name>_datetime and the column re-purposed as the new data type.

The format of the datetime extension is _yyyymmddhhmmss, e.g. _20210113110334. This will be the same for all columns in the same table in the same shared job configuration. The new column will be NULL up to the date of change. This should be considered for downstream dependencies such as views, reports, etc.

Additional Audit Columns

Two audit columns are added as part of the ETL process, for traceability and audit.

Column	Description
matillion_batch_id	This is the `run_history_id` associated with the orchestration process that the shared job is invoked by. If you have multiple processes executed within a parent orchestration, they will have the same batch id. Different runs of the same process will have different batch ids
matillion_updated_timestamp	The timestamp when the data is inserted or updated into the target table

Common Parameters

All Shared Jobs using this incremental load pattern use the below parameters, with slight variations.

Parameter	Description
Authentication Method	The type of authentication desired for the job. Options are `OAuth` (the default) and `User/Password`
OAuth	The name of the OAuth entry to be used from the Matillion OAuth Manager to connect to the source. These must be set up in advance
Username	(For username/password authentication only) The username to be supplied to connect to the source
Password Manager Entry	(For username/password authentication only) The name in the Matillion Password Manager for the password relating to the authentication method
Tables and Columns	A grid variable containing the list of tables and columns (and in some cases an incremental_column to specify which column the load should be incremented on [accepts values of 0 or 1]) to be processed
Connection Options	A grid variable list of values and parameters. Parameters and their allowed values are database/driver specific. Referring to the driver documentation will provide insight of what you could provide here
Load Type	Choose `Sequential` (the default) to have iterations done in sequence, waiting for each to complete before starting the next. Choose `Concurrent` to have iterations run concurrently. This requires all “Variables to Iterate” to be defined as copied variables, so that each iteration gets its own copy of the variable isolated from the same variable being used by other concurrent executions
Stage Prefix	A prefix value that will be added to the start of the stage table names. E.g. If a Stage Prefix of ‘stage_’ is specified and the table being processed is named `test_data` then the target table will be named `stage_test_data`
Stage Database	The database name where the staging data will be stored
Stage Schema	The schema name where the staging data will be stored
Stage Warehouse	The warehouse name where the staging data will be stored
Target Prefix	A prefix value that will be added to the start of the target table names. E.g. If a Target Prefix of `target_` is specified and the table being processed is named `test_data` then the target table will be named `target_test_data`
Target Database	The database name where the target data will be stored
Target Schema	The schema name where the target data will be stored
Target Warehouse	The warehouse name where the target data will be stored
Encryption	(AWS only) Decide on how the files are encrypted inside the S3 Bucket. This property is available when using an Existing Amazon S3 Location for Staging. Specify as follows. `None`: No encryption. `SSE KMS`: Encrypt the data according to a key stored on KMS. `SSE S3`: Encrypt the data according to a key stored on an S3 bucket
KMS Key Id	The ID of the KMS encryption key you have chosen to use in the ‘Encryption’ property
Staging Variable	(AWS only) Specify `Snowflake Managed` to allow Matillion ETL to create and use a temporary internal stage on Snowflake for staging the data. This stage, along with the staged data, will cease to exist after loading is complete. Specify `Existing Amazon S3 Location` to avail the user of properties to specify a custom staging area on S3
S3 Bucket Name	(AWS only) The name of an S3 bucket for temporary storage. Ensure your access credentials have S3 access and permission to write to the bucket. For more information, refer to the “IAM in AWS” section in this article on RBAC in the Cloud. The temporary objects created in this bucket will be removed again after the load completes, they are not kept. This property is available when using an Existing Amazon S3 Location for Staging
Azure Storage Account	(Azure Only) Azure Storage Account to allow Matillion ETL to specify a blob container object on Snowflake for staging data
Azure Blob Container	(Azure Only) Azure Blob Container to allow Matillion ETL to use the blob storage container object on Snowflake for staging data
Log Metrics Variable	Option to switch off Metrics logging in the full product - default is TRUE which records the metrics for the batch run

Customizing a Shared Job

You can unpack the shared job and modify it with your own customizations. Possible examples include:

To switch off schema drift support, alter the value of fix_target_data_types in the call to Decide Load Strategy in the Stage 3 orchestration job, changing it from Yes to No
To use another datatype for the incremental column, alter the job variable skip_inc_col_check set in the call to Decide Load Strategy in the Stage 3 orchestration job, setting it to Y.

Please note that if you unpack and re-generate a shared job, it will be generated as user-defined and not read-only. If you don’t increment the revision, it will update the existing revision.

Author: Matillion
Date Posted: 7 Nov 2023