Author: Matillion
Date Posted: Nov 15, 2023
Last Modified: Nov 16, 2023
Split S3 File
Split a large file in S3 into smaller files.
This utility is for textual, line oriented data files such as CSV and TSV. It is not suitable for semi-structured formats such as JSON, AVRO, ORC, PARQUET or XML.
It splits the input file on line boundaries, and creates smaller output files that, together, contain the same data. The input file may be optionally gzipped. Line boundaries within the source file can either be a single LF (Linux/Unix style) or a CRLF (Windows style). If some lines in the input file are very long, the sizes of the generated files will be slightly uneven.
Cloud data platform data loaders have inbuilt parallelism, so - for example - it is almost always faster to load two 8Mb files rather than one 16Mb file.
Parameters
Parameter | Description |
---|---|
Source bucket name | Name of the S3 bucket containing the source file. Do not include the s3:// prefix. Do not include the object path |
Source file name | The source file name, including path if any |
Target bucket name | Name of the target S3 bucket. Do not include the s3:// prefix. Do not include a path |
Target path | Optional target path. Leave blank if you want the new files to be created in the root folder of the target bucket |
Target file name prefix | File name prefix for the generated files |
Target file size | Bytes per output file. An integer followed by a unit K, M or G. E.g. 250M , 50K etc |
Warnings
If there is a header line in your CSV file, it will not be repeated in the generated files. We strongly recommend you do not use this utility on a CSV file with a header line.
You must make sure the target file size is bigger than the maximum length of any one line in the source file. Lines exceeding the target file size will be split, which will almost certainly cause data loads to fail.
The splitting operation uses the compute power and memory of your Matillion ETL instance. For very large files, other tasks running in parallel may slow down while this shared job is running.
It is possible in some circumstances for this shared job to end successfully even if it could not process the input file.
Prerequisites
The aws
command line utility must be installed on your Matillion ETL instance. If the shared job fails with an error line X: aws: command not found
then please follow this guide to installing the aws command.
This shared job attempts to read and write to S3. Ensure that the EC2 instance credentials attached to your Matillion ETL instance include the privilege to do this. For more information, refer to the “IAM in AWS” section in this article on RBAC in the Cloud.
The shared job uses zcat
and split
internally. These utilities must be installed on the Matillion ETL instance.
Downloads
Licensed under: Matillion Free Subscription License
- Download METL-aws-1.72.1-split-s3-file.melt
- Platform: AWS
- Target: Any target cloud data platform
- Version: 1.72.1 or higher