Data Productivity Cloud Pipeline

Author: Matillion
Date Posted: Jun 17, 2024
Last Modified: Dec 11, 2024

Olympic Athlete Data Analysis

Analyze historical Olympic athlete data with this set of Data Productivity Cloud pipelines.

These pipelines find the most successful athletes, by medal count, in a historical Olympic Games of your choice.

Image ofOlympic Athlete Data Analysis — Olympic Athlete Data Analysis

Load Historical Olympic Athlete Data

Two datafiles are used:

s3://devrel.matillion.com/data/structured/olympics/athlete_events.csv.gz - all the athletes in every Olympic event, as a CSV separated by \r\n, with one header line
s3://devrel.matillion.com/data/structured/olympics/noc_regions.csv.gz - the National Olympic Committee country codes, as a CSV separated by \r, with one header line

Image ofLoad Historical Olympic Athlete Data — Load Historical Olympic Athlete Data

The pipeline Olympic Data - Extract Load loads these two datafiles into database tables olympic_athletes and noc_region.

Top 10 athletes from the 2016 Summer Games

The pipeline Olympic Data - Transform (in the screenshot at the top) performs the data analysis:

Athletes by country joins the two tables on the common NOC column
Filter Medal Winners filters only medal winners from the Summer 2016 games

Image ofMedal Winners Summer 2016 — Medal Winners Summer 2016

Calculate Medal Points is an expression that allocates 3 points for a gold medal, 2 for a silver and 1 for a bronze
The Aggregate component groups the results by athlete, summing their medal points
Rank Athletes performs a RANK() windowing function to rank the athletes by total medal points descending
Top 10 Athletes filters only the top 10 athletes by rank

The top 10 athletes are saved into a table named top_10_athletes_summer_2016

Generate bio using OpenAI, Amazon Bedrock or Cortex Completions

Having saved the top 10 athletes into a new database table, you can use generative AI to create a short bio for them. The Large Language Model (LLM) prompt in these examples is as follows:

You are a reporter on athletic events. Write a short bio on how the named athlete performed in the Summer 2016 Olympic Games

All Data Productivity Cloud users have the choices including OpenAI and Amazon Bedrock:

Image ofBio Generation with OpenAI or Amazon Bedrock — Bio Generation with OpenAI or Amazon Bedrock

To use OpenAI, save your API Key as a secret named OpenAI-Key
To use Amazon Bedrock, ensure that your Agent has IAM permission for Bedrock, and that you have requested access to the foundation model you intend to use

Snowflake users can additionally choose Cortex Completions for the same purpose, directly within the transformation pipeline:

Image ofBio Generation with Snowflake Cortex Completions — Bio Generation with Snowflake Cortex Completions

Lab Helpers

If you are following one of Matillion’s online labs, use the following to copy and paste:

Columns for `olympic_athletes`

ID	VARCHAR				Yes	No	
Name	VARCHAR				Yes	No	
Sex	VARCHAR				Yes	No	
Age	VARCHAR				Yes	No	
Height	VARCHAR				Yes	No	
Weight	VARCHAR				Yes	No	
Team	VARCHAR				Yes	No	
NOC	VARCHAR				Yes	No	
Games	VARCHAR				Yes	No	
Year	VARCHAR				Yes	No	
Season	VARCHAR				Yes	No	
City	VARCHAR				Yes	No	
Sport	VARCHAR				Yes	No	
Event	VARCHAR				Yes	No	
Medal	VARCHAR				Yes	No

Columns for `noc_region`

NOC	VARCHAR				Yes	No	
Region	VARCHAR				Yes	No	
Notes	VARCHAR				No	No

Column Mapping in the join between `olympic_athletes` and `noc_region`

noc.NOC	NOC
noc.Name	noc_Name
ath.ID	ID
ath.Name	Name
ath.Sex	Sex
ath.Age	Age
ath.Height	Height
ath.Weight	Weight
ath.Team	Team
ath.Games	Games
ath.Year	Year
ath.Season	Season
ath.City	City
ath.Sport	Sport
ath.Event	Event
ath.Medal	Medal

Downloads

Licensed under: Matillion Free Subscription License

Download Olympic-Athlete-Data-Analysis_Redshift.zip
- Target: Redshift
Download Olympic-Athlete-Data-Analysis_Snowflake.zip
- Target: Snowflake

Installation Instructions

How to Install a Data Productivity Cloud Pipeline