How to Update Athena tables

First we’ll create a new database to house the final tables

Then we’ll change to the newly created database:

And we’ll insert from our staging database into a table in the final database:

First we’ll just select everything:

Notice how without the partition pruning we scan 6.21 MB:

Whereas when we include the partition filters in the query, the query is faster and scans less data. This confirms everything is working as expected:

Now we can create the new table in the presentation dataset:

The snag with this approach is that Athena automatically chooses the location for us. Notice the s3 location of the table:

A better way is to use a proper create table statement where we specify the location in s3 of the underlying data:

Notice how we can also specify the partitioning scheme we want to use. Much better!

And this is how the underlying data for our table looks in s3:

Now to the crux: let’s say we wanted to update the value of one of the columns. And just to make it interesting, we’ll also make it a partitioning column!

Now an Update is really just an insert followed by a delete.

So all we’re doing here is taking all data from the 30th day partition and inserting it into the 29th day:

So first the insert:

We’ll save this as an external file:

And we’ll upload this to s3:

Now that we have our query in s3:

We can reference it in our rockstarETL job:

If you copy the location from the console:

It has a tendency to copy spaces, (which you’ll need to remove) and insert the forward slash which was left out after the bucket name:

After removing spaces and inserting the forwardslash:

So the final insert looks like:

And then the delete:

So the final pipeline with both the insert then the delete (completing our update) looks like:

So now let’s enable the pipeline by entering a non-zero value into “Number of Runs per day” for Step 1 (Zero disables the job and by extension the pipeline)

Let’s also ssh into the ec2 instance and keep an eye on the logs as it runs:

This is my favourite way to monitor the jobs as they run:

The command is:

 sudo tail -f /var/log/tomcat/catalina.out

We can also log into Mariadb (mysql) and check on things:

We can confirm when the pipeline will run:

This view will be empty until the appointed time:

After run time:

We can see the query being picked up in the logs:

And s3 before, note the 29th partition is not there yet:

And after. Notice the 29th partition now exists, while the 30th has been removed:

We also see our confirmation e-mails one for Step 1:

And one for Step 2:

And the logs for the delete step:

We can also check the JobRunStatus table:

using this query:

select * from JobRunStatus order by 1 desc limit 15;

And to disable our pipeline, set Number of Runs to Zero:

And that’s it!

Leave a Comment