Athena Batch Partition Insert

This job type circumvents the 100 partition insert limit. See https://docs.aws.amazon.com/athena/latest/ug/ctas-insert-into.html

It does this by calculating how many partitions in total will be affected by your query.

If there are more than 100 partitions to insert into, it will run the re-run the query iterating until all partitions are inserted to. Each run will only affect a maximum of 100 partitions.

The query you specify here must be an “INSERT INTO…SELECT” query. See: https://docs.aws.amazon.com/athena/latest/ug/insert-into.html and https://docs.aws.amazon.com/athena/latest/ug/ctas-insert-into-etl.html

This job type allows you to run multiple INSERT…INTO statements. Note that the same partitioning formula(s) will apply to all the statements you specify. So group similar statements together and use a new “Athena Batch Partition Insert” job for each case where the partitioning scheme differs.

“Run Time” is the Time in UTC time zone when you want the job to run.

“Number of Runs per day” is the number of times per day you want the job to run. This could be anywhere from 1 to 1440 times per day (every minute is currently the maximum)

Fill in the “Email address” to be receive job run notifications by email. This only works if you’ve signed up for SendGrid and completed the details in the SendGrid section:

“Pipeline Name” groups jobs together to run consecutively and dependently. This means subsequent job steps eg: Step 2 will only run once Step 1 has completed successfully. To change the Pipeline that the job belongs to, simply change the Pipeline Name field.

“Region” is the AWS region where you want the query to run.

“Job Name” is the unique name of the job. It can’t exist even in a different pipeline.

“Prefix”, “Middle variable” and “Suffix” let you specify where your query files are located in S3.

“Prefix” should be entered once. This is the bucket name together with any “folders” or prefix where all the query files are located.

All files containing the queries you want to run should be stored in the same place ie: bucket and “folder”

“Middle variable” is where you specify the part that is different in each file name. These differences must be separated with a pipe ie “|” or shift backslash on Windows.

“Suffix” is the static part that every file ends in.

eg:

The insert statement must start with “INSERT INTO” followed by the destination dataset a full stop and the destination table

eg: INSERT INTO staging.destination_table

“Source Dataset” is the name of the Athena dataset housing the source tables you are select from.

“Source Table Names” are the tables in the “Source Dataset” you are selecting from.

The source tables are referenced in the SELECT of your “INSERT INTO…SELECT ” queries. See: https://docs.aws.amazon.com/athena/latest/ug/insert-into.html and https://docs.aws.amazon.com/athena/latest/ug/ctas-insert-into-etl.html

Your source tables are used to calculate the number of partitions that will be affected.

“Name of the Partitioning Columns” is the name and formula (if applicable) of the partitioning columns. The end of the formula must finished with the word “AS” followed by the column name

eg: if partitioning by year, month and day:

NOTE: the columns must be separated with a pipe ie: | delimiter

eg: see after “as year|” to separate between “year” and “month”

     substr(cast("date_parse"("substr"(NULLIF(cast(now() AS varchar(10)),''), 1, 10), '%Y-%m-%d') AS varchar(10)),1,4) AS year| 
     substr(cast("date_parse"("substr"(NULLIF(cast(now() AS varchar(10)), ''), 1, 10), '%Y-%m-%d') AS varchar(10)),6,2) AS month|
     substr(cast("date_parse"("substr"(NULLIF(cast(now() AS varchar(10)), ''), 1, 10), '%Y-%m-%d') AS varchar(10)),9,2) AS day

“Athena Query Result Location” is the S3 location where you want the query results saved.

“Athena Database” is the database to use in Athena.

Leave a Comment