Let’s say we want to update the data in our Athena table.
We know we can’t do this directly using Athena as update and delete statements are not supported.
However we can achieve this indirectly by removing the data in S3 then running an Athena insert query.
Starting off, let’s inspect our table in Athena to check how many rows there are:

In S3:

Now let’s profile our largest partitions:

Suppose we want to remove these partitions. First let’s specify a S3 Delete Objects job. Note the subfolders specified in the “Sub Folder Names” field:


Our log indicates completion:

And we get our email notification:

Notice in S3 after refreshing, we have 18 fewer Objects now. 197 originally, now 179.

Querying the table in Athena also shows the change now:

As expected our total row count has also decreased significantly:

Right so now let’s build our query to insert these partitions:
insert into shipping.headeraltpartitions5 (f1,
f2,
f4,
f5,
f6,
f7,
f8,
f9,
f10,
f11,
f12,
f13,
f14,
f15,
f16,
f17,
f18,
f19,
f20,
f21,
f22,
f23,
f24,
f25,
f26,
f27,
f28,
f29,
f30,
f31,
f32,
f33,
f34,
f3,year, month)
SELECT f1,
f2,
f4,
f5,
f6,
f7,
f8,
f9,
f10,
f11,
f12,
f13,
f14,
f15,
f16,
f17,
f18,
f19,
f20,
f21,
f22,
f23,
f24,
f25,
f26,
f27,
f28,
f29,
f30,
f31,
f32,
f33,
f34,
f3,
substr(cast(date_parse(substr(NULLIF(cast(f6 AS varchar(10)),
''), 1, 10), '%Y-%m-%d') AS varchar(10)),1,4) AS year ,substr(cast(date_parse(substr(NULLIF(cast(f6 AS varchar(10)), ''), 1, 10), '%Y-%m-%d') AS varchar(10)),6,2) AS month
FROM headersplitcompressed a
where not exists (select 1 from shipping.headeraltpartitions5 b where a.f3 = b.f3 and substr(cast(date_parse(substr(NULLIF(cast(f6 AS varchar(10)),
''), 1, 10), '%Y-%m-%d') AS varchar(10)),1,4) = b.year and substr(cast(date_parse(substr(NULLIF(cast(f6 AS varchar(10)), ''), 1, 10), '%Y-%m-%d') AS varchar(10)),6,2) = b.month and a.f6= b.f6)
The key part is the where clause:
where not exists (select 1 from shipping.headeraltpartitions5 b where a.f3 = b.f3 and substr(cast(date_parse(substr(NULLIF(cast(f6 AS varchar(10)),
''), 1, 10), '%Y-%m-%d') AS varchar(10)),1,4) = b.year and substr(cast(date_parse(substr(NULLIF(cast(f6 AS varchar(10)), ''), 1, 10), '%Y-%m-%d') AS varchar(10)),6,2) = b.month and a.f6= b.f6)
This should insert only missing rows without duplicating data.
Now we use a Batch Partition Job Type:


The contents of the file referenced in the “Name of the Partitioning Columns field”. Note the pipe “|” delimited fields:
f3 as f3|
substr(cast(date_parse(substr(NULLIF(cast(f6 AS varchar(10)),
''), 1, 10), '%Y-%m-%d') AS varchar(10)),1,4) AS year |substr(cast(date_parse(substr(NULLIF(cast(f6 AS varchar(10)), ''), 1, 10), '%Y-%m-%d') AS varchar(10)),6,2) AS month
The log entries while the job is running:

On completion:

If we re-profile the table, we are back up to the original row count:

