Say you want to change a table ie recreate it or re-populate the underlying data.
Since Athena doesn’t provide the ability to delete rows, you’re options are to recreate the table using an Athena statement or stick with the original table and perform an S3 copy or delete operation:
So let’s look at just deleting the data in S3:
So our S3 data currently looks like this:
We have two options with our S3 Delete job.
Option 1 is to leave the subfolders field blank:
The contents of the prefix above will be cleared by one thread and this could be slower.
The faster option is to populate the “Sub Folder Names” above. This will each multiple threads in an asynchronous operation which could be faster. (This can also be completed with a smaller EC2 instance type)
First we use the command line to list the subfolders:
Highlight the results then copy and paste to Notepad ++
We can then perform a find and replace operation eg:
We then get:
Then a nice trick is to go to the end of the first line with your cursor and holding down alt and shift then press the down key:
This allows you to edit all the rows at once as if they were one row. We can then add a pipe “|” eg:
We can then copy and paste the result into the “Sub Folder Names” field:
Notice the space in front of each subfolder name. To eliminate, copy and paste the row onto a new sheet then find and replace the space with nothing eg:
Bear in mind, your sub folder list could be too long for the field length limit. In that case you would need to create additional S3 Delete jobs for the remaining items.
This is the case with this example as the field only allows me to enter 32 items which takes us from “AD” to “BY”
So before the replace of the space:
After the space replace:
We need to also remove the trailing pipe delimiter:
Using the nifty trick holding down alt and shift then using the down arrow key this is easy to achieve:
So now our first step looks like:
Let’s run it to test and see how it operates:
So it ran fine. Note the different thread numbers:
This means the subfolders were deleted using different threads.
We also received our notification email since we inserted a receiving email address as well as signed up for sendgrid and populated our mail settings:
Now we can check in S3:
We need to refresh. After refresh we see the subfolders have been correctly removed ie “AD” to “BY” leaving “BZ” as the new first folder:
That was a very quick operation. Let’s see how it runs if we leave out the “Sub Folder Name” field:
So it only deleted from “f3=BZ” to “f3=GY”:
This is caused by running too small an EC2 instance. ie you need to choose a bigger instance with more ram.
Let’s run it again.
Since I have 197 subfolders I would need to create a total of 6 jobs which could all be in the same pipeline or separate pipelines for some parallel processing. It’s a good idea to experiment with which approach would work best for your data. If you were to use the same pipeline, you could either have each of the six jobs as subsequent steps or you could put in the same step number eg:
Option 1, different job numbers all in the same pipeline:
We have to use unique job names:
The logs look good:
And the final test is to inspect S3 which is now clear:
Option 2: Same job step number in the same pipeline: