Data Productivity Cloud Pipeline

Author: Matillion
Date Posted: Jun 17, 2024
Last Modified: Jun 21, 2024

Olympic Athlete Data Analysis

Analyze historical Olympic athlete data with this set of Data Productivity Cloud pipelines.

These pipelines find the most successful athletes, by medal count, in a historical Olympic Games of your choice.

Image ofOlympic Athlete Data Analysis
Olympic Athlete Data Analysis

Load Historical Olympic Athlete Data

Two datafiles are used:

  • s3://devrel.matillion.com/data/structured/olympics/athlete_events.csv.gz - all the athletes in every Olympic event, as a CSV separated by \r\n, with one header line
  • s3://devrel.matillion.com/data/structured/olympics/noc_regions.csv.gz - the National Olympic Committee country codes, as a CSV separated by \r, with one header line
Image ofLoad Historical Olympic Athlete Data
Load Historical Olympic Athlete Data

The pipeline Olympic Data - Extract Load loads these two datafiles into database tables olympic_athletes and noc_region.

Top 10 athletes from the 2016 Summer Games

The pipeline Olympic Data - Transform (in the screenshot at the top) performs the data analysis:

  • Athletes by country joins the two tables on the common NOC column
  • Filter Medal Winners filters only medal winners from the Summer 2016 games
Image ofMedal Winners Summer 2016
Medal Winners Summer 2016
  • Calculate Medal Points is an expression that allocates 3 points for a gold medal, 2 for a silver and 1 for a bronze
  • The Aggregate component groups the results by athlete, summing their medal points
  • Rank Athletes performs a RANK() windowing function to rank the athletes by total medal points descending
  • Top 10 Athletes filters only the top 10 athletes by rank

The top 10 athletes are saved into a table named top_10_athletes_summer_2016

Generate bio using OpenAI, Amazon Bedrock or Cortex Completions

Having saved the top 10 athletes into a new database table, you can use generative AI to create a short bio for them. The Large Language Model (LLM) prompt in these examples is as follows:

You are a reporter on athletic events. Write a short bio on how the named athlete performed in the Summer 2016 Olympic Games

All Data Productivity Cloud users have the choices including OpenAI and Amazon Bedrock:

Image ofBio Generation with OpenAI or Amazon Bedrock
Bio Generation with OpenAI or Amazon Bedrock
  • To use OpenAI, save your API Key as a secret named OpenAI-Key
  • To use Amazon Bedrock, ensure that your Agent has IAM permission for Bedrock, and that you have requested access to the foundation model you intend to use

Snowflake users can additionally choose Cortex Completions for the same purpose, directly within the transformation pipeline:

Image ofBio Generation with Snowflake Cortex Completions
Bio Generation with Snowflake Cortex Completions

Downloads

Licensed under: Matillion Free Subscription License

Installation Instructions

How to Install a Data Productivity Cloud Pipeline