Multiple Partitioning columns with Athena Batch Partitioning Insert Job

This time let’s create a partitioned table by multiple partitioning columns.

We’ll use f3 and f6. For f6 which is a date string we’ll use a formula to get year, month and day:

Let’s profile the data first:

select F3, F6,
substr(cast(date_parse(substr(NULLIF(cast(F6 AS varchar(10)),
''), 1, 10), '%Y-%m-%d') AS varchar(10)),1,4) AS year ,substr(cast(date_parse(substr(NULLIF(cast(F6 AS varchar(10)), ''), 1, 10), '%Y-%m-%d') AS varchar(10)),6,2) AS month ,substr(cast(date_parse(substr(NULLIF(cast(F6 AS varchar(10)),''), 1, 10), '%Y-%m-%d') AS varchar(10)),9,2) AS day
FROM headersplitcompressed
WHERE f3 = 'AD';

Then we need to create the new table:

CREATE TABLE headermultiplepartitions
WITH ( format = 'Parquet', parquet_compression = 'SNAPPY', partitioned_by = ARRAY['F3','year','month','day'], external_location = 's3://rockstaretltesting1/sample_data/shipping/d0e9bd6148e8f14889980954017b0927/83bdd5c3225a70b48acff63e86c74193/ams/2020/202009291500/headermultiplepartitions/' ) AS
SELECT F1,
F2,
F4,
F5,
F6,
F7,
F8,
F9,
F10,
F11,
F12,
F13,
F14,
F15,
F16,
F17,
F18,
F19,
F20,
F21,
F22,
F23,
F24,
F25,
F26,
F27,
F28,
F29,
F30,
F31,
F32,
F33,
F34,
F3,
substr(cast(date_parse(substr(NULLIF(cast(F6 AS varchar(10)),
''), 1, 10), '%Y-%m-%d') AS varchar(10)),1,4) AS year ,substr(cast(date_parse(substr(NULLIF(cast(F6 AS varchar(10)), ''), 1, 10), '%Y-%m-%d') AS varchar(10)),6,2) AS month ,substr(cast(date_parse(substr(NULLIF(cast(F6 AS varchar(10)),''), 1, 10), '%Y-%m-%d') AS varchar(10)),9,2) AS day
FROM headersplitcompressed
WHERE F3 = 'AD'
and F6 is not null and F6 != ''
and substr(cast(date_parse(substr(NULLIF(cast(F6 AS varchar(10)),
''), 1, 10), '%Y-%m-%d') AS varchar(10)),1,4) = '2020'
and substr(cast(date_parse(substr(NULLIF(cast(F6 AS varchar(10)), ''), 1, 10), '%Y-%m-%d') AS varchar(10)),6,2) = '07'
and substr(cast(date_parse(substr(NULLIF(cast(F6 AS varchar(10)),''), 1, 10), '%Y-%m-%d') AS varchar(10)),9,2) = '27';

Notice we had to use an elaborate where clause to filter the data to not create too many partitions due to the 100 partition limit.

Now if we look in S3, we can see the partitions that Athena created:

Now since the table is created in Athena we can clear the contents in S3 and create the Athena insert into statement:

insert into headermultiplepartitions (F1, F2, F4, F5, F6, F7, F8, F9, F10, F11, F12, F13, F14, F15, F16, F17, F18, F19, F20, F21, F22, F23, F24, F25, F26, F27, F28, F29, F30, F31, F32, F33, F34, F3,year, month,day)
SELECT F1,
F2,
F4,
F5,
F6,
F7,
F8,
F9,
F10,
F11,
F12,
F13,
F14,
F15,
F16,
F17,
F18,
F19,
F20,
F21,
F22,
F23,
F24,
F25,
F26,
F27,
F28,
F29,
F30,
F31,
F32,
F33,
F34,
F3,
substr(cast(date_parse(substr(NULLIF(cast(F6 AS varchar(10)),
''), 1, 10), '%Y-%m-%d') AS varchar(10)),1,4) AS year ,substr(cast(date_parse(substr(NULLIF(cast(F6 AS varchar(10)), ''), 1, 10), '%Y-%m-%d') AS varchar(10)),6,2) AS month ,substr(cast(date_parse(substr(NULLIF(cast(F6 AS varchar(10)),''), 1, 10), '%Y-%m-%d') AS varchar(10)),9,2) AS day
FROM headersplitcompressed

We save the above in a text file and upload to S3.

Then we also need to specify the partitioning columns (and formulas) in a separate file:

eg:

The “Prefix”, “Middle variable” and “Suffix” allow you to specify multiple insert into statements as explained here:

https://birockstar.com/athena-batch-partition-insert/

The “Name of the Partitioning Column(s)” field should refer to the partitioning fields that are common to all the insert statements referenced in the “Prefix”, “Middle variable” and “Suffix” fields.

ie you would have to use the same partitioning schema for all these tables.

The rest of the fields:

We can clear our S3 location before we run the partition insert job by creating and scheduling a delete job:

After it’s run successfully:

We can then schedule our partition insert job:

So we get a failure message we need to troubleshoot:

Using the command line I’d like to see more info:

aws athena get-query-execution –query-execution-id 52e99585-da24-4889-bcc3-5645167119fe

The result:

The command also returns the query that was run:

I clean it up a little in notepad++ first by removing the “\n” and run it in Athena:

I get the same error message:

So this most likely means we have some dirty data in our day date column:

We do some light regular expressions (fortunately Athena’s presto functions come in handy here)

   select * from headersplitcompressed where not regexp_like(F6,'\d\d\d\d-\d\d-\d\d') limit 10; 

And we get the field names which is the one row causing the problem:

So now we need to exclude the row in order to proceed:

So I just drop and recreate the table:

drop table headersplitcompressed;
CREATE TABLE headersplitcompressed
WITH (
format = 'Parquet',
parquet_compression = 'SNAPPY',
external_location = 's3://rockstaretltesting1/athena-queries/Unsaved/2021/07/06/tables2/')
AS SELECT *
FROM headersplit
where f3 != '' and f3 is not null
and f6 != 'estimated_arrival_date';

So now we should have no more issues: