Author: Matillion
Date Posted: Feb 26, 2024
Last Modified: Mar 5, 2024
XML API data in Snowflake using a UDF to convert to JSON
Load, flatten and transform XML data using a Snowflake UDF to convert XML to JSON.
These Data Productivity Cloud pipelines extract XML data from a public RSS feed, load it into Snowflake using a File Format, and then transform and flatten it by converting the XML to JSON using a User Defined Function (UDF). An RSS feed is a simple REST API.
Extract and Load XML data
Start with the XML RSS demo - Extract and Load
pipeline (exactly as used in the XML File Format example). It extracts data from a public AWS RSS blog feed https://aws.amazon.com/blogs/aws/feed/
, and creates a file in S3 storage containing the XML. Before running the pipeline you must edit it to replace your-bucket-name
in the properties of the Data Transfer and the S3 Load components.
The File Format object FF_XML_STRIPOUTER
parses the incoming XML and strips the outermost <rss>
element, leaving a list of <item>
s inside a <channel>
The destination Snowflake table is named stg_aws_rss_feed
. It contains semi-structured data, and so has just one VARIANT column named channel
.
An S3 Load component references the File Format, and loads the data into Snowflake. The entire XML document is loaded into one record, and the Assert verifies that this has successfully happened at runtime.
Every few days the RSS feed will update and the source data will change. So the stg_aws_rss_feed
table is relatively short term. The permanent destination table is named aws_rss_feed
. It has relational columns GUID, TITLE, DESCRIPTION and PUBDATE so a transformation pipeline is needed to feed data into it.
Flatten and Transform XML data
Data transformation - including flattening - happens inside Snowflake. This example takes a more low code approach than the LATERAL FLATTEN from the companion XML File Format example
The first step is to create a Snowflake UDF that can convert XML to JSON. A SQL component in the XML RSS demo - UDF Conversion
orchestration pipeline creates the UDF XML_TO_JSON
to do this.
Next the transformation pipeline XML RSS demo - UDF Transformation
takes the single record from stg_aws_rss_feed
, passes the data through the UDF, and includes an Extract Nested Data component to un-nest the item
array.
Note the following in the Extract Nested Data
component properties:
- It changes the granularity from one record to however many
item
s are in the JSON array (20 in the screenshot above) - The
Columns
property enables you to choose the items of interest from a graphical tree
Now that the semi-structured data has been relationalized, a Table Update component performs the MERGE to the permanent destination table aws_rss_feed
. Every <item
> in the RSS contains a guid
identifier which is used to distinguish between inserts and updates.
Downloads
Licensed under: Matillion Free Subscription License
- Download XML RSS demo - Extract and Load.orch.yaml
- Platform: AWS
- Target: Snowflake
- Download XML RSS demo - UDF Conversion.orch.yaml
- Target: Snowflake
- Download XML RSS demo - UDF Transformation.tran.yaml
- Target: Snowflake
- Download XML_TO_JSON-UDF.sql
- Description: UDF XML_TO_JSON DDL
- Target: Snowflake