Author: Matillion
Date Posted: Nov 17, 2023
Last Modified: Nov 17, 2023
Table Data Profiler
Automated column level data profiling for any table.
This shared job allows you to profile the data in any table, from a Matillion ETL orchestration job.
The information is captured into a table named metl_profiling_data
, which will be created if necessary. Columns include:
- schema_name
- table_name
- collection_date
- metric
- value
Several metrics are collected for every column, depending on the datatype. Built-in metrics include count, count distinct, minimum and maximum values, and the number of missing (NULL) values. For numeric columns the sum, mean, standard deviation and median values are collected per column, along with the upper and lower interquartile ranges.
Run this job every day to build up profiling metrics over time. You will then be able to detect trends. This can be helpful for data quality monitoring, preemptive error detection, and for runtime prediction.
For tables with many rows, a random sample of one million records is evaluated.
You can find much more information in this article on Data Profiling using Matillion Grid Variables.
Parameters
Parameter | Description |
---|---|
Table Name | Name of table to be profiled |
Schema Name | Schema Name, or ${environment_default_schema} to use the environment default |
Value Length | The maximum length that will be recorded in the MIN and MAX metrics. Applies to long string columns |
Downloads
Licensed under: Matillion Free Subscription License
- Download METL-aws-rs-1.66.7-table-data-profiler.melt
- Platform: AWS
- Target: Redshift
- Version: 1.66.7 or higher
- Download METL-az-sy-1.66.7-table-data-profiler.melt
- Platform: Azure
- Target: Synapse
- Version: 1.66.7 or higher
- Download METL-sf-1.66.7-table-data-profiler.melt
- Target: Snowflake
- Version: 1.66.7 or higher