This article describes the pattern used by many Matillion ETL Shared Jobs to perform incremental loading.
Incremental Loading is only supported if the data source supports it.
All Shared Jobs following this pattern utilize the concept of a high water mark loading. This uses an appropriate column present in the source data. This column is used to extract data after the initial run which is greater than (or sometimes equal to or greater than, depending on the data type of the column) the maximum value held in the target table created by the ETL process.
The incremental column is specified among the “Tables and Columns” paramter, and must be one of the following datatypes:
- Numeric or Integer
For relational or NoSQL databases, the incremental column and primary key(s) are both selected by the user.
For fixed schema data sources, such as API-based sources, available incremental columns and primary keys are added by the Matillion ETL process and do not need to be selected. If these columns are not available, a full truncate load of the target table will be performed during each run.
Schema drift support is included, which will accommodate source data changes, such as missing columns as a result of the source database or API change. Any missing tables will no longer be loaded. However, your shared job will fail. All other tables specified as part of the configuration in the shared job will be loaded. If this scenario occurs, edit your shared job Table and Column grid variable to remove the missing table.
You can also manually add or remove any new tables or columns to the Tables and Columns grid variable at any time and the process will handle those changes.
As the process lends itself to a conscious user choice of tables and columns, if a new table or column is added to your source, it is not added as a default behavior. However, any new tables or columns can be added as above.
Any missing columns will be loaded as
NULL. Data Type changes will also be accommodated but if these are not compatible changes for the target cloud platform, the current column will be renamed as
<column_name>_datetime and the column re-purposed as the new data type.
The format of the datetime extension is
_20210113110334. This will be the same for all columns in the same table in the same shared job configuration. The new column will be
NULL up to the date of change. This should be considered for downstream dependencies such as views, reports, etc.
Additional Audit Columns
Two audit columns are added as part of the ETL process, for traceability and audit.
|This is the
run_history_id associated with the orchestration process that the shared job is invoked by. If you have multiple processes executed within a parent orchestration, they will have the same batch id. Different runs of the same process will have different batch ids
|The timestamp when the data is inserted or updated into the target table
All Shared Jobs using this incremental load pattern use the below parameters, with slight variations.
|The type of authentication desired for the job. Options are
OAuth (the default) and
|The name of the OAuth entry to be used from the Matillion OAuth Manager to connect to the source. These must be set up in advance
|(For username/password authentication only) The username to be supplied to connect to the source
|Password Manager Entry
|(For username/password authentication only) The name in the Matillion Password Manager for the password relating to the authentication method
|Tables and Columns
|A grid variable containing the list of tables and columns (and in some cases an incremental_column to specify which column the load should be incremented on [accepts values of 0 or 1]) to be processed
|A grid variable list of values and parameters. Parameters and their allowed values are database/driver specific. Referring to the driver documentation will provide insight of what you could provide here
Sequential (the default) to have iterations done in sequence, waiting for each to complete before starting the next. Choose
Concurrent to have iterations run concurrently. This requires all “Variables to Iterate” to be defined as copied variables, so that each iteration gets its own copy of the variable isolated from the same variable being used by other concurrent executions
|A prefix value that will be added to the start of the stage table names. E.g. If a Stage Prefix of ‘stage_’ is specified and the table being processed is named
test_data then the target table will be named
|The database name where the staging data will be stored
|The schema name where the staging data will be stored
|The warehouse name where the staging data will be stored
|A prefix value that will be added to the start of the target table names. E.g. If a Target Prefix of
target_ is specified and the table being processed is named
test_data then the target table will be named
|The database name where the target data will be stored
|The schema name where the target data will be stored
|The warehouse name where the target data will be stored
|(AWS only) Decide on how the files are encrypted inside the S3 Bucket. This property is available when using an Existing Amazon S3 Location for Staging. Specify as follows.
None: No encryption.
SSE KMS: Encrypt the data according to a key stored on KMS.
SSE S3: Encrypt the data according to a key stored on an S3 bucket
|KMS Key Id
|The ID of the KMS encryption key you have chosen to use in the ‘Encryption’ property
|(AWS only) Specify
Snowflake Managed to allow Matillion ETL to create and use a temporary internal stage on Snowflake for staging the data. This stage, along with the staged data, will cease to exist after loading is complete. Specify
Existing Amazon S3 Location to avail the user of properties to specify a custom staging area on S3
|S3 Bucket Name
|(AWS only) The name of an S3 bucket for temporary storage. Ensure your access credentials have S3 access and permission to write to the bucket. For more information, refer to the “IAM in AWS” section in this article on RBAC in the Cloud. The temporary objects created in this bucket will be removed again after the load completes, they are not kept. This property is available when using an Existing Amazon S3 Location for Staging
|Azure Storage Account
|(Azure Only) Azure Storage Account to allow Matillion ETL to specify a blob container object on Snowflake for staging data
|Azure Blob Container
|(Azure Only) Azure Blob Container to allow Matillion ETL to use the blob storage container object on Snowflake for staging data
|Log Metrics Variable
|Option to switch off Metrics logging in the full product - default is TRUE which records the metrics for the batch run
Customizing a Shared Job
You can unpack the shared job and modify it with your own customizations. Possible examples include:
- To switch off schema drift support, alter the value of
fix_target_data_typesin the call to Decide Load Strategy in the Stage 3 orchestration job, changing it from
- To use another datatype for the incremental column, alter the job variable
skip_inc_col_checkset in the call to Decide Load Strategy in the Stage 3 orchestration job, setting it to
Please note that if you unpack and re-generate a shared job, it will be generated as user-defined and not read-only. If you don’t increment the revision, it will update the existing revision.