Matillion ETL Shared Job

Author: Matillion
Date Posted: Nov 15, 2023
Last Modified: Nov 16, 2023

Split S3 File

Split a large file in S3 into smaller files.

This utility is for textual, line oriented data files such as CSV and TSV. It is not suitable for semi-structured formats such as JSON, AVRO, ORC, PARQUET or XML.

Split S3 File

It splits the input file on line boundaries, and creates smaller output files that, together, contain the same data. The input file may be optionally gzipped. Line boundaries within the source file can either be a single LF (Linux/Unix style) or a CRLF (Windows style). If some lines in the input file are very long, the sizes of the generated files will be slightly uneven.

Cloud data platform data loaders have inbuilt parallelism, so - for example - it is almost always faster to load two 8Mb files rather than one 16Mb file.

Parameters

ParameterDescription
Source bucket nameName of the S3 bucket containing the source file. Do not include the s3:// prefix. Do not include the object path
Source file nameThe source file name, including path if any
Target bucket nameName of the target S3 bucket. Do not include the s3:// prefix. Do not include a path
Target pathOptional target path. Leave blank if you want the new files to be created in the root folder of the target bucket
Target file name prefixFile name prefix for the generated files
Target file sizeBytes per output file. An integer followed by a unit K, M or G. E.g. 250M, 50K etc

Warnings

If there is a header line in your CSV file, it will not be repeated in the generated files. We strongly recommend you do not use this utility on a CSV file with a header line.

You must make sure the target file size is bigger than the maximum length of any one line in the source file. Lines exceeding the target file size will be split, which will almost certainly cause data loads to fail.

The splitting operation uses the compute power and memory of your Matillion ETL instance. For very large files, other tasks running in parallel may slow down while this shared job is running.

It is possible in some circumstances for this shared job to end successfully even if it could not process the input file.

Prerequisites

The aws command line utility must be installed on your Matillion ETL instance. If the shared job fails with an error line X: aws: command not found then please follow this guide to installing the aws command.

This shared job attempts to read and write to S3. Ensure that the EC2 instance credentials attached to your Matillion ETL instance include the privilege to do this. For more information, refer to the “IAM in AWS” section in this article on RBAC in the Cloud.

The shared job uses zcat and split internally. These utilities must be installed on the Matillion ETL instance.


Downloads

Licensed under: Matillion Free Subscription License

Installation Instructions

How to Install a Matillion ETL Shared Job