Data Productivity Cloud Pipeline

Author: Matillion
Date Posted: Feb 20, 2024
Last Modified: Nov 29, 2024

Barista demo - using AI to process unstructured data

Use a Large Language Model (LLM) to process unstructured data with this set of Data Productivity Cloud pipelines.

The pipelines demonstrate how to use the Data Productivity Cloud AI Prompt components to convert unstructured text into structured data, to make it ready for analysis. The sample data is a set of imaginary barista coffee reviews.

Image ofUsing AI to process unstructured data - sample record
Using AI to process unstructured data - sample record

Start by opening the orchestration pipeline to load the sample data.

Load the unstructured text data

The first pipeline in the set is Barista Reviews - Extract and Load which copies the unstructured barista review data into a table named STG_BARISTA_REVIEWS in your cloud data platform.

Image ofLoading unstructured text data
Loading unstructured text data

The original data is held in cloud storage. There are many records, and every one includes the identity of the reviewer alongside their comment, plus various other fields such as the review date and location.

Processing text reviews with a Large Language Model (LLM)

The Data Productivity Cloud pipeline Barista Reviews - OpenAI that runs the review comments through your LLM is simple. The only active component in the pipeline is an LLM prompt, which feeds the unstructured text record by record through your LLM. The example in the screenshot below uses OpenAI but you can use alternatives including Azure-managed OpenAI or Amazon Bedrock.

Note that before running this pipeline, OpenAI users must first create a Secret to store your LLM credentials securely. The secret is named OpenAI-Key, which you will find among the OpenAI properties of the component.

Azure OpenAI users must also set the Endpoint and Deployment Name in the component.

Amazon Bedrock users will need to set the AWS region, model category and model name.

Image ofProcessing text reviews with an LLM - OpenAI example
Processing text reviews with an LLM - OpenAI example

Running the pipeline creates a new table named BARISTA_REVIEWS_PROCESSED_OPENAI with new columns calculated by the LLM:

New columnDescription
anger_scoreA score between 0 and 10 depending on the level of anger in the review
anger_reasonOne word describing the reason why the customer is angry
sentimentSentiment analysis, defined here as Positive, Neutral or Negative
anger_summaryA deliberately humorous summary of the customer’s comment
swear_wordsSet to Yes if there were swear words in the review. Otherwise No
product_involvedThe product name involved (if any). This is an example of Named Entity Recognition (NER)
wont_returnSet to Yes if the customer indicates that they will never come again to the shop. Otherwise No

The results from this pipeline will vary from run to run because at normal temperatures the output from an LLM is nondeterministic. This especially applies to the anger_summary text. The sentiment analysis scores should remain broadly consistent, however.

Integrating the LLM output

The new columns added by the LLM are based only on the review comment. A final transformation pipeline Barista Reviews - Transform and Integrate is used to integrate them back together with all the original fields, including the review date and location.

Remember to adjust the name of the AI-processed Barista Reviews component depending which LLM provider you used.

Image ofIntegrating AI-generated content with the original data
Integrating AI-generated content with the original data

Output from this transformation pipeline is another new database table, named BARISTA_REVIEWS_FINAL. It contains all the original columns from the data source, plus the new columns added by the AI. This table is the one to use for downstream analysis and dashboarding. For example:

  • The review date is useful to look for trends over time
  • The location can help spot geospatial variation, or differences between stores

Downloads

Licensed under: Matillion Free Subscription License

Installation Instructions

How to Install a Data Productivity Cloud Pipeline