Data Productivity Cloud Pipeline

Author: Matillion
Date Posted: Feb 26, 2024
Last Modified: Nov 29, 2024

XML API data in Snowflake using a UDF to convert to JSON

Load, flatten and transform XML data using a Snowflake UDF to convert XML to JSON.

These Data Productivity Cloud pipelines extract XML data from a public RSS feed, load it into Snowflake using a File Format, and then transform and flatten it by converting the XML to JSON using a User Defined Function (UDF). An RSS feed is a simple REST API.

Extract and Load XML data

Start with the XML RSS demo - Extract and Load pipeline (exactly as used in the XML File Format example). It extracts data from a public AWS RSS blog feed https://aws.amazon.com/blogs/aws/feed/, and creates a file in S3 storage containing the XML. Before running the pipeline you must edit it to replace your-bucket-name in the properties of the Data Transfer and the S3 Load components.

Image ofExtract and Load XML API data into Snowflake using a File Format
Extract and Load XML API data into Snowflake using a File Format

The File Format object FF_XML_STRIPOUTER parses the incoming XML and strips the outermost <rss> element, leaving a list of <item>s inside a <channel>

The destination Snowflake table is named stg_aws_rss_feed. It contains semi-structured data, and so has just one VARIANT column named channel.

An S3 Load component references the File Format, and loads the data into Snowflake. The entire XML document is loaded into one record, and the Assert verifies that this has successfully happened at runtime.

Every few days the RSS feed will update and the source data will change. So the stg_aws_rss_feed table is relatively short term. The permanent destination table is named aws_rss_feed. It has relational columns GUID, TITLE, DESCRIPTION and PUBDATE so a transformation pipeline is needed to feed data into it.

Flatten and Transform XML data

Data transformation - including flattening - happens inside Snowflake. This example takes a more low code approach than the LATERAL FLATTEN from the companion XML File Format example

The first step is to create a Snowflake UDF that can convert XML to JSON. A SQL component in the XML RSS demo - UDF Conversion orchestration pipeline creates the UDF XML_TO_JSON to do this.

Image ofConvert XML to JSON using a Snowflake UDF
Convert XML to JSON using a Snowflake UDF

Next the transformation pipeline XML RSS demo - UDF Transformation takes the single record from stg_aws_rss_feed, passes the data through the UDF, and includes an Extract Nested Data component to un-nest the item array.

Image ofUn-nest and flatten a JSON array in Snowflake
Un-nest and flatten a JSON array in Snowflake

Note the following in the Extract Nested Data component properties:

  • It changes the granularity from one record to however many items are in the JSON array (20 in the screenshot above)
  • The Columns property enables you to choose the items of interest from a graphical tree
Image ofExtract Nested Data component
Extract Nested Data component

Now that the semi-structured data has been relationalized, a Table Update component performs the MERGE to the permanent destination table aws_rss_feed. Every <item> in the RSS contains a guid identifier which is used to distinguish between inserts and updates.


Downloads

Licensed under: Matillion Free Subscription License

Installation Instructions

How to Install a Data Productivity Cloud Pipeline