S3 Server Side Logging query — Part 1 of N

3 min readApr 22, 2022

tl;dr If you want to query a day of your s3 server side logging data and find Athena abort because the underlying s3 bucket throttles read, use SymlinkTextInputFormat

S3 has become a fundamental building block in many AWS deployment. When you run into unexpected issues with S3, you need to analyze the bucket access logs. S3 has a built-in server-side access log that details the web request to the S3 bucket. S3 server-side logging has used this pattern: s3://<log-bucket>/<optional-prefix>/YYYY-MM-DD-HH-MM-SS-HASH for the logs. AWS even has a premium article to explain how to use Athena to query the bucket. However, it specifically calls out to set a lifecycle policy on the access logs bucket because Athena still behaves badly when the source data has many tiny objects. (My Guess: Athena is so scaled out that each object is a parallel task and S3 server side log itself can easily have millions of objects)

Athena power users understand about partition. Partition has the query engine to skip objects that does not need to be read using the condition clause. However, Athena implementation dislikes partition location that does not end in an “/”. If you try to indicate the prefix is s3://<log-bucket>/<optional-prefix>/YYYY-MM-DD-HH , Athena will automatically add a “/” at the end, and of course, now the s3 prefix is not correct anymore.

You can try to write a script to just copy log objects you care into a different location for analysis. However, there is a more speedy way to do so, and that’s to teach Athena exactly the objects it needs to read instead of simply listing a prefix. Enters SymlinkTextInputFormat. This is a Apache Hive days to simply list out the redirect the readers to read a list of URI. If we generate a manifest that includes the exact log objects, then we can have Athena analyze in-place without copying a day worth of log objects before analysis.

export SOURCE_BUCKET=smoy-s3-server-side-logging-dev-us-west-2
export YYYY_MM_DD=2022-04-22
export PREFIX=logs/smoy-data-swamp-dev-us-west-2/# Generate the manifest locally
aws s3api list-objects-v2 --bucket $SOURCE_BUCKET --prefix ${PREFIX}${YYYY_MM_DD} --output=text --query 'Contents[*].[Key]'  | awk -v s3_prefix="s3://${SOURCE_BUCKET}" '{print s3_prefix"/"$1}' > ${YYYY_MM_DD}.txt# You can examine the manifest locally to see if it's sane# Copy the manifest to a s3 bucket for Athena to use
aws s3 cp ${YYYY_MM_DD}.txt s3://smoy-data-swamp-dev-us-west-2/symlink-test/${YYYY_MM_DD}/

Now, I will create a Athena table to analyze

CREATE EXTERNAL TABLE `mybucket_logs`(
  `bucketowner` STRING,
  `bucket_name` STRING,
  `requestdatetime` STRING,
  `remoteip` STRING,
  `requester` STRING,
  `requestid` STRING,
  `operation` STRING,
  `key` STRING,
  `request_uri` STRING,
  `httpstatus` STRING,
  `errorcode` STRING,
  `bytessent` BIGINT,
  `objectsize` BIGINT,
  `totaltime` STRING,
  `turnaroundtime` STRING,
  `referrer` STRING,
  `useragent` STRING,
  `versionid` STRING,
  `hostid` STRING,
  `sigv` STRING,
  `ciphersuite` STRING,
  `authtype` STRING,
  `endpoint` STRING,
  `tlsversion` STRING)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  'input.regex'='([^ ]*) ([^ ]*) \\[(.*?)\\] ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\"|-) (-|[0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\"|-) ([^ ]*)(?: ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*))?.*$')
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://smoy-data-swamp-dev-us-west-2/symlink-test/2022-04-22/'

This is a very similar statement from AWS premium article, I simply use ‘org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat’ instead of ‘org.apache.hadoop.mapred.TextInputFormat’.

In a way I am pretty annoyed S3 server side access logs does not have better out of the box query experience. In my previous work, I wrote some pretty custom solution for another company. However, many shops simply don’t have the luxury to setup the additional component to analyze s3 logs. If I build a software-as-a-service that makes your s3 server side logs query like a groom data lake, would you use it? Let me know. Cheers.

S3 Server Side Logging query — Part 1 of N

Written by Steven Moy