Article

This article describes the pattern used by many Matillion ETL Shared Jobs to perform incremental loading.

Incremental Loading

Incremental Loading is only supported if the data source supports it.

All Shared Jobs following this pattern utilize the concept of a high water mark loading. This uses an appropriate column present in the source data. This column is used to extract data after the initial run which is greater than (or sometimes equal to or greater than, depending on the data type of the column) the maximum value held in the target table created by the ETL process.

The incremental column is specified among the “Tables and Columns” paramter, and must be one of the following datatypes:

For relational or NoSQL databases, the incremental column and primary key(s) are both selected by the user.

For fixed schema data sources, such as API-based sources, available incremental columns and primary keys are added by the Matillion ETL process and do not need to be selected. If these columns are not available, a full truncate load of the target table will be performed during each run.

Schema Drift

Schema drift support is included, which will accommodate source data changes, such as missing columns as a result of the source database or API change. Any missing tables will no longer be loaded. However, your shared job will fail. All other tables specified as part of the configuration in the shared job will be loaded. If this scenario occurs, edit your shared job Table and Column grid variable to remove the missing table.

You can also manually add or remove any new tables or columns to the Tables and Columns grid variable at any time and the process will handle those changes.

As the process lends itself to a conscious user choice of tables and columns, if a new table or column is added to your source, it is not added as a default behavior. However, any new tables or columns can be added as above.

Any missing columns will be loaded as NULL. Data Type changes will also be accommodated but if these are not compatible changes for the target cloud platform, the current column will be renamed as <column_name>_datetime and the column re-purposed as the new data type.

The format of the datetime extension is _yyyymmddhhmmss, e.g. _20210113110334. This will be the same for all columns in the same table in the same shared job configuration. The new column will be NULL up to the date of change. This should be considered for downstream dependencies such as views, reports, etc.

Additional Audit Columns

Two audit columns are added as part of the ETL process, for traceability and audit.

ColumnDescription
matillion_batch_idThis is the run_history_id associated with the orchestration process that the shared job is invoked by. If you have multiple processes executed within a parent orchestration, they will have the same batch id. Different runs of the same process will have different batch ids
matillion_updated_timestampThe timestamp when the data is inserted or updated into the target table

Common Parameters

All Shared Jobs using this incremental load pattern use the below parameters, with slight variations.

ParameterDescription
Authentication MethodThe type of authentication desired for the job. Options are OAuth (the default) and User/Password
OAuthThe name of the OAuth entry to be used from the Matillion OAuth Manager to connect to the source. These must be set up in advance
Username(For username/password authentication only) The username to be supplied to connect to the source
Password Manager Entry(For username/password authentication only) The name in the Matillion Password Manager for the password relating to the authentication method
Tables and ColumnsA grid variable containing the list of tables and columns (and in some cases an incremental_column to specify which column the load should be incremented on [accepts values of 0 or 1]) to be processed
Connection OptionsA grid variable list of values and parameters. Parameters and their allowed values are database/driver specific. Referring to the driver documentation will provide insight of what you could provide here
Load TypeChoose Sequential (the default) to have iterations done in sequence, waiting for each to complete before starting the next. Choose Concurrent to have iterations run concurrently. This requires all “Variables to Iterate” to be defined as copied variables, so that each iteration gets its own copy of the variable isolated from the same variable being used by other concurrent executions
Stage PrefixA prefix value that will be added to the start of the stage table names. E.g. If a Stage Prefix of ‘stage_’ is specified and the table being processed is named test_data then the target table will be named stage_test_data
Stage DatabaseThe database name where the staging data will be stored
Stage SchemaThe schema name where the staging data will be stored
Stage WarehouseThe warehouse name where the staging data will be stored
Target PrefixA prefix value that will be added to the start of the target table names. E.g. If a Target Prefix of target_ is specified and the table being processed is named test_data then the target table will be named target_test_data
Target DatabaseThe database name where the target data will be stored
Target SchemaThe schema name where the target data will be stored
Target WarehouseThe warehouse name where the target data will be stored
Encryption(AWS only) Decide on how the files are encrypted inside the S3 Bucket. This property is available when using an Existing Amazon S3 Location for Staging. Specify as follows. None: No encryption. SSE KMS: Encrypt the data according to a key stored on KMS. SSE S3: Encrypt the data according to a key stored on an S3 bucket
KMS Key IdThe ID of the KMS encryption key you have chosen to use in the ‘Encryption’ property
Staging Variable(AWS only) Specify Snowflake Managed to allow Matillion ETL to create and use a temporary internal stage on Snowflake for staging the data. This stage, along with the staged data, will cease to exist after loading is complete. Specify Existing Amazon S3 Location to avail the user of properties to specify a custom staging area on S3
S3 Bucket Name(AWS only) The name of an S3 bucket for temporary storage. Ensure your access credentials have S3 access and permission to write to the bucket. For more information, refer to the “IAM in AWS” section in this article on RBAC in the Cloud. The temporary objects created in this bucket will be removed again after the load completes, they are not kept. This property is available when using an Existing Amazon S3 Location for Staging
Azure Storage Account(Azure Only) Azure Storage Account to allow Matillion ETL to specify a blob container object on Snowflake for staging data
Azure Blob Container(Azure Only) Azure Blob Container to allow Matillion ETL to use the blob storage container object on Snowflake for staging data
Log Metrics VariableOption to switch off Metrics logging in the full product - default is TRUE which records the metrics for the batch run

Customizing a Shared Job

You can unpack the shared job and modify it with your own customizations. Possible examples include:

Please note that if you unpack and re-generate a shared job, it will be generated as user-defined and not read-only. If you don’t increment the revision, it will update the existing revision.


Author: Matillion