Copying S3 data underlying an Athena table

This approach involves using a S3 Copy Objects job type to copy the S3 Objects to a new location. Thereafter we’ll create an Athena table pointing to the data in this new location. We can then profile the data using Athena queries to ensure it agrees.

Leaving the “Sub Folder Names” blank means all the copying will happen in the same thread. Only when the Sub Folder Names are populated will a multithreaded process be used where each subfolder will be copied by a separate thread using an asynchronous process. This will likely be a faster process but it is a little more work to set up.

There were no errors in our log table entries and the operation took 6.5 minutes:

An alternative would be to specify the subfolders to possibly speed things up:

Running it this way runs in 5 minutes which was 1.5 minutes faster but a little more effort to set up. So it depends on your use case. (A bonus is you can selectively select subfolders to copy)

Back to S3 to see the result of the copy job:

After examining S3, we now have all 197 subfolders in our destination:

This agrees with our source which also has 197 subfolders:

So let’s now create our new Athena table and compare to the original:

We can use this command to generate the create statement and then modify it:

So here is our modified create statement:

CREATE EXTERNAL TABLE headeraltpartitions2(
f1 string COMMENT '',
f2 string COMMENT '',
f4 string COMMENT '',
f5 string COMMENT '',
f6 string COMMENT '',
f7 string COMMENT '',
f8 string COMMENT '',
f9 string COMMENT '',
f10 string COMMENT '',
f11 string COMMENT '',
f12 string COMMENT '',
f13 string COMMENT '',
f14 string COMMENT '',
f15 string COMMENT '',
f16 string COMMENT '',
f17 string COMMENT '',
f18 string COMMENT '',
f19 string COMMENT '',
f20 string COMMENT '',
f21 string COMMENT '',
f22 string COMMENT '',
f23 string COMMENT '',
f24 string COMMENT '',
f25 string COMMENT '',
f26 string COMMENT '',
f27 string COMMENT '',
f28 string COMMENT '',
f29 string COMMENT '',
f30 string COMMENT '',
f31 string COMMENT '',
f32 string COMMENT '',
f33 string COMMENT '',
f34 string COMMENT '')
PARTITIONED BY (
f3 string COMMENT '',
year varchar(10) COMMENT '',
month varchar(10) COMMENT '')
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://rockstaretltesting1/sample_data/shipping/d0e9bd6148e8f14889980954017b0927/83bdd5c3225a70b48acff63e86c74193/ams/2020/202009291500/headeraltpartitions2/'
TBLPROPERTIES (
'has_encrypted_data'='false',
'parquet.compression'='SNAPPY')

So we’ve created our new Athena table:

Note the above message about having to load the partitions.

To confirm it’s validity, zero records are returned if we try to immediately query the table without first loading the partitions:

So now we run the MSCK REPAIR TABLE statement:

This can take a few minutes:

Now if we run a query, we get a result:

Now we can compare this new table to the original:

First basic check is to see how many rows there are in the two tables:

So our initial checks of row counts agree.

Let’s perform one more check, this time with the original table first:

Next our new table:

So at a cursory glance, our tables agree.