This approach involves using a S3 Copy Objects job type to copy the S3 Objects to a new location. Thereafter we’ll create an Athena table pointing to the data in this new location. We can then profile the data using Athena queries to ensure it agrees.

Leaving the “Sub Folder Names” blank means all the copying will happen in the same thread. Only when the Sub Folder Names are populated will a multithreaded process be used where each subfolder will be copied by a separate thread using an asynchronous process. This will likely be a faster process but it is a little more work to set up.
There were no errors in our log table entries and the operation took 6.5 minutes:

An alternative would be to specify the subfolders to possibly speed things up:

Running it this way runs in 5 minutes which was 1.5 minutes faster but a little more effort to set up. So it depends on your use case. (A bonus is you can selectively select subfolders to copy)

Back to S3 to see the result of the copy job:
After examining S3, we now have all 197 subfolders in our destination:

This agrees with our source which also has 197 subfolders:

So let’s now create our new Athena table and compare to the original:
We can use this command to generate the create statement and then modify it:

So here is our modified create statement:
CREATE EXTERNAL TABLEheaderaltpartitions2
(f1
string COMMENT '',f2
string COMMENT '',f4
string COMMENT '',f5
string COMMENT '',f6
string COMMENT '',f7
string COMMENT '',f8
string COMMENT '',f9
string COMMENT '',f10
string COMMENT '',f11
string COMMENT '',f12
string COMMENT '',f13
string COMMENT '',f14
string COMMENT '',f15
string COMMENT '',f16
string COMMENT '',f17
string COMMENT '',f18
string COMMENT '',f19
string COMMENT '',f20
string COMMENT '',f21
string COMMENT '',f22
string COMMENT '',f23
string COMMENT '',f24
string COMMENT '',f25
string COMMENT '',f26
string COMMENT '',f27
string COMMENT '',f28
string COMMENT '',f29
string COMMENT '',f30
string COMMENT '',f31
string COMMENT '',f32
string COMMENT '',f33
string COMMENT '',f34
string COMMENT '')
PARTITIONED BY (f3
string COMMENT '',year
varchar(10) COMMENT '',month
varchar(10) COMMENT '')
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://rockstaretltesting1/sample_data/shipping/d0e9bd6148e8f14889980954017b0927/83bdd5c3225a70b48acff63e86c74193/ams/2020/202009291500/headeraltpartitions2/'
TBLPROPERTIES (
'has_encrypted_data'='false',
'parquet.compression'='SNAPPY')
So we’ve created our new Athena table:

Note the above message about having to load the partitions.
To confirm it’s validity, zero records are returned if we try to immediately query the table without first loading the partitions:

So now we run the MSCK REPAIR TABLE statement:

This can take a few minutes:



Now if we run a query, we get a result:

Now we can compare this new table to the original:
First basic check is to see how many rows there are in the two tables:


So our initial checks of row counts agree.
Let’s perform one more check, this time with the original table first:

Next our new table:

So at a cursory glance, our tables agree.