You are currently viewing Querying AWS S3 Inventory with AWS Athena

Querying AWS S3 Inventory with AWS Athena

Spread the love

Querying AWS S3 Inventory with AWS Athena

In this blog post, I’ll be discussing how to use AWS Athena to query AWS S3 Inventory.

AWS S3 Inventory is a feature that allows you to generate reports about your S3 objects and their metadata. These reports can be delivered to an S3 bucket, and can be useful for tracking changes to your objects, or for auditing purposes.

AWS Athena is a serverless query service that makes it easy to analyse data stored in Amazon S3 using SQL. By using Athena to query S3 Inventory, you can quickly and easily get insights into your S3 objects, without the need to set up a separate data warehouse or ETL process.

In this blog post, I’ll show you how to set up an S3 Inventory report, and how to use Athena to query the report to get information about your S3 objects. We’ll also discuss some of the benefits and potential use cases for using Athena to query S3 Inventory.

I hope you’ll find this blog post helpful! If you have any questions, feel free to leave a comment or send me a message.

Setting up an S3 Inventory Report

Using the AWS Management Console

To create an S3 Inventory file using the AWS Management Console:

  1. Navigate to the S3 Management Console.
  2. Select the bucket for which you want to create an inventory file.
  3. Click the Management tab, and then click the Inventory button.
  4. On the Inventory page, click the Create Inventory button.
  5. On the Create Inventory page, enter a name for the inventory file and select the objects that you want to include in the inventory. You can also specify the format of the inventory file (e.g. CSV, ORC, or Parquet) and any optional metadata that you want to include. We will use CSV in this example.
  6. Choose a destination S3 bucket for the inventory file, and then click the Create button to create the inventory.

Using the AWS CLI

To create an S3 Inventory file using the AWS CLI, use the aws s3api create-inventory-configuration command: 

aws s3api create-inventory-configuration \

– bucket my-bucket \
– inventory-configuration file:inventory-config.json

Replace my-bucket with the name of the bucket for which you want to create an inventory file, and inventory-config.json with the path to a file containing the inventory configuration in JSON format.

Here is an example of the JSON configuration file:

“Destination”: {

“S3BucketDestination”: {
“Bucket”: “inventory-bucket”,
“Format”: “CSV”,
“Prefix”: “s3-inventory/”
}
},
“IsEnabled”: true,
“IncludedObjectVersions”: “Current”,
“Id”: “inventory-id”,
“OptionalFields”: [
“Size”,
“LastModifiedDate”,
“StorageClass”,
“ETag”
]
}

This configuration file specifies that the inventory file should be delivered to the inventory-bucket bucket in CSV format, with the prefix s3-inventory/. It also specifies that the inventory should include the current version of each object, and should include the SizeLastModifiedDateStorageClass, and ETag fields.

Inventory files can take some time to generate. The official line at the time of writing is that your first report can take up to 24hrs to appear in the destination bucket. In practice, I’ve experienced this to take only a few hours.

Querying AWS Inventory Report with Athena

Now we have our inventory file, we need to create an Athena table that maps to the S3 Inventory file. To do this, you can use the CREATE EXTERNAL TABLE statement in Athena, specifying the location of the S3 Inventory file and the format of the file (e.g. CSV)

CREATE EXTERNAL TABLE inventory_table (
`Bucket` string,
`Key` string,
`Size` bigint,
`LastModifiedDate` timestamp,
`StorageClass` string,
`ETag` string
)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.OpenCSVSerde’
WITH SERDEPROPERTIES (
‘separatorChar’ = ‘,’,
‘quoteChar’ = ‘\”‘
)
LOCATION ‘s3://my-inventory-bucket/s3-inventory/’

Next, run a SELECT query on the Athena table to retrieve the data from the S3 Inventory file. For example:

SELECT * FROM inventory_table WHERE LastModifiedDate BETWEEN ‘2022-10-01’ AND ‘2022-12-01’

This query will return all rows from the S3 Inventory file where the LastModifiedDate field is between October 1 and December 1, 2022.

Optionally, you can save the query results to a new table or to a file in S3 by using the CREATE TABLE AS or INSERT INTO statements in Athena.

CREATE TABLE inventory_results AS
SELECT * FROM inventory_table WHERE LastModifiedDate BETWEEN ‘2022-10-01’ AND ‘2022-12-01’

INSERT INTO TABLE inventory_results
SELECT * FROM inventory_table WHERE LastModifiedDate BETWEEN '2022-10-01' AND '2022-12-01'
 

Use-cases

Ok great, so why is this useful? I stumbled across the use of Inventory Reports and querying them with Athena recently when trying to find an efficient way to iterate through almost a billion objects to sort and extract the key for any object updated in the last 90 days, so that the resulting objects could be migrated to a blob storage service in another cloud provider. Some further potential use cases could include:

Auditing and compliance: S3 Inventory can be used to track changes to objects in your S3 bucket, including when objects are added, deleted, or modified. By using Athena to query S3 Inventory, you can easily retrieve this information and use it for auditing and compliance purposes.

Data analysis: By using Athena to query S3 Inventory, you can quickly and easily get insights into your S3 data, such as the number of objects in a bucket, the size of the objects, and the storage class of the objects. This can be useful for data analysis and data warehousing purposes.

Data archiving: If you need to archive data from an S3 bucket, you can use S3 Inventory to generate a list of all the objects in the bucket, and then use Athena to filter the list to include only the objects that need to be archived. You can then use this list to perform the data archiving process.

Wrapping up

In conclusion, AWS Athena is a powerful tool for querying S3 Inventory and getting insights into your S3 data. By setting up an S3 Inventory file and creating an Athena table that maps to the file, you can easily run SQL queries on your S3 data and get the answers you need. Whether you’re using S3 Inventory for auditing and compliance, data analysis, data migration, or data archiving, Athena can help you get the most out of your S3 data.

I hope you found this blog post helpful! If you have any questions or comments, feel free to leave them in the comments section below. Thank you for reading, and happy querying!

Leave a Reply