Author: Matillion
Date Posted: Nov 7, 2023
Last Modified: Mar 21, 2024
Incremental Load Shared Job Pattern
This article describes the pattern used by many Matillion ETL Shared Jobs to perform incremental loading.
Incremental Loading
Incremental Loading is only supported if the data source supports it.
All Shared Jobs following this pattern utilize the concept of a high water mark loading. This uses an appropriate column present in the source data. This column is used to extract data after the initial run which is greater than (or sometimes equal to or greater than, depending on the data type of the column) the maximum value held in the target table created by the ETL process.
The incremental column is specified among the “Tables and Columns” paramter, and must be one of the following datatypes:
- Date
- Datetime
- Numeric or Integer
For relational or NoSQL databases, the incremental column and primary key(s) are both selected by the user.
For fixed schema data sources, such as API-based sources, available incremental columns and primary keys are added by the Matillion ETL process and do not need to be selected. If these columns are not available, a full truncate load of the target table will be performed during each run.
Schema Drift
Schema drift support is included, which will accommodate source data changes, such as missing columns as a result of the source database or API change. Any missing tables will no longer be loaded. However, your shared job will fail. All other tables specified as part of the configuration in the shared job will be loaded. If this scenario occurs, edit your shared job Table and Column grid variable to remove the missing table.
You can also manually add or remove any new tables or columns to the Tables and Columns grid variable at any time and the process will handle those changes.
As the process lends itself to a conscious user choice of tables and columns, if a new table or column is added to your source, it is not added as a default behavior. However, any new tables or columns can be added as above.
Any missing columns will be loaded as NULL
. Data Type changes will also be accommodated but if these are not compatible changes for the target cloud platform, the current column will be renamed as <column_name>_datetime
and the column re-purposed as the new data type.
The format of the datetime extension is _yyyymmddhhmmss
, e.g. _20210113110334
. This will be the same for all columns in the same table in the same shared job configuration. The new column will be NULL
up to the date of change. This should be considered for downstream dependencies such as views, reports, etc.
Additional Audit Columns
Two audit columns are added as part of the ETL process, for traceability and audit.
Column | Description |
---|---|
matillion_batch_id | This is the run_history_id associated with the orchestration process that the shared job is invoked by. If you have multiple processes executed within a parent orchestration, they will have the same batch id. Different runs of the same process will have different batch ids |
matillion_updated_timestamp | The timestamp when the data is inserted or updated into the target table |
Common Parameters
All Shared Jobs using this incremental load pattern use the below parameters, with slight variations.
Parameter | Description |
---|---|
Authentication Method | The type of authentication desired for the job. Options are OAuth (the default) and User/Password |
OAuth | The name of the OAuth entry to be used from the Matillion OAuth Manager to connect to the source. These must be set up in advance |
Username | (For username/password authentication only) The username to be supplied to connect to the source |
Password Manager Entry | (For username/password authentication only) The name in the Matillion Password Manager for the password relating to the authentication method |
Tables and Columns | A grid variable containing the list of tables and columns (and in some cases an incremental_column to specify which column the load should be incremented on [accepts values of 0 or 1]) to be processed |
Connection Options | A grid variable list of values and parameters. Parameters and their allowed values are database/driver specific. Referring to the driver documentation will provide insight of what you could provide here |
Load Type | Choose Sequential (the default) to have iterations done in sequence, waiting for each to complete before starting the next. Choose Concurrent to have iterations run concurrently. This requires all “Variables to Iterate” to be defined as copied variables, so that each iteration gets its own copy of the variable isolated from the same variable being used by other concurrent executions |
Stage Prefix | A prefix value that will be added to the start of the stage table names. E.g. If a Stage Prefix of ‘stage_’ is specified and the table being processed is named test_data then the target table will be named stage_test_data |
Stage Database | The database name where the staging data will be stored |
Stage Schema | The schema name where the staging data will be stored |
Stage Warehouse | The warehouse name where the staging data will be stored |
Target Prefix | A prefix value that will be added to the start of the target table names. E.g. If a Target Prefix of target_ is specified and the table being processed is named test_data then the target table will be named target_test_data |
Target Database | The database name where the target data will be stored |
Target Schema | The schema name where the target data will be stored |
Target Warehouse | The warehouse name where the target data will be stored |
Encryption | (AWS only) Decide on how the files are encrypted inside the S3 Bucket. This property is available when using an Existing Amazon S3 Location for Staging. Specify as follows. None : No encryption. SSE KMS : Encrypt the data according to a key stored on KMS. SSE S3 : Encrypt the data according to a key stored on an S3 bucket |
KMS Key Id | The ID of the KMS encryption key you have chosen to use in the ‘Encryption’ property |
Staging Variable | (AWS only) Specify Snowflake Managed to allow Matillion ETL to create and use a temporary internal stage on Snowflake for staging the data. This stage, along with the staged data, will cease to exist after loading is complete. Specify Existing Amazon S3 Location to avail the user of properties to specify a custom staging area on S3 |
S3 Bucket Name | (AWS only) The name of an S3 bucket for temporary storage. Ensure your access credentials have S3 access and permission to write to the bucket. For more information, refer to the “IAM in AWS” section in this article on RBAC in the Cloud. The temporary objects created in this bucket will be removed again after the load completes, they are not kept. This property is available when using an Existing Amazon S3 Location for Staging |
Azure Storage Account | (Azure Only) Azure Storage Account to allow Matillion ETL to specify a blob container object on Snowflake for staging data |
Azure Blob Container | (Azure Only) Azure Blob Container to allow Matillion ETL to use the blob storage container object on Snowflake for staging data |
Log Metrics Variable | Option to switch off Metrics logging in the full product - default is TRUE which records the metrics for the batch run |
Customizing a Shared Job
You can unpack the shared job and modify it with your own customizations. Possible examples include:
- To switch off schema drift support, alter the value of
fix_target_data_types
in the call to Decide Load Strategy in the Stage 3 orchestration job, changing it fromYes
toNo
- To use another datatype for the incremental column, alter the job variable
skip_inc_col_check
set in the call to Decide Load Strategy in the Stage 3 orchestration job, setting it toY
.
Please note that if you unpack and re-generate a shared job, it will be generated as user-defined and not read-only. If you don’t increment the revision, it will update the existing revision.