ablog

不器用で落着きのない技術者のメモ

EMRからS3にアクセス時のAPI発行回数を調べる

PySpark on EMR で S3 上の1オブジェクトの読込みに API 発行回数が1回か複数回か調べてみた。

CloudTrail で API 発行を捕捉する

$ export LANG=C
$ export TZ=UTC
$ export AWS_DEFAULT_REGION=ap-northeast-1
$ aws s3api create-bucket --bucket az-s3-trail-log \
	--create-bucket-configuration LocationConstraint=ap-northeast-1
$ cat <<EOF > policy.json
{
     "Version": "2012-10-17",
     "Statement": [
          {
               "Sid": "AWSCloudTrailAclCheck20180211",
               "Effect": "Allow",
               "Principal": {
                    "Service": "cloudtrail.amazonaws.com"
               },
               "Action": "s3:GetBucketAcl",
               "Resource": "arn:aws:s3:::az-s3-trail-log"
          },
          {
               "Sid": "AWSCloudTrailRead20180211",
               "Effect": "Allow",
               "Principal": {
                    "Service": "cloudtrail.amazonaws.com"
               },
               "Action": "s3:*",
               "Resource": "arn:aws:s3:::az-s3-trail-log/*",
               "Condition": {
                    "StringEquals": {
                         "s3:x-amz-acl": "bucket-owner-full-control"
                    }
               }
          }
     ]
}
EOF
$ aws s3api put-bucket-policy --bucket az-s3-trail-log --policy file://policy.json
$ aws s3api create-bucket --bucket az-s3-trail-test \
	--create-bucket-configuration LocationConstraint=ap-northeast-1
  • trail の作成
$ aws cloudtrail create-trail --name s3-trail --s3-bucket-name az-s3-trail-log
$ cat <<EOF >event_selector.json
[
    {
        "ReadWriteType": "All",
        "IncludeManagementEvents": false,
        "DataResources": [
            {
                "Type": "AWS::S3::Object",
                "Values": [
                    "arn:aws:s3:::az-s3-trail-test/"
                ]
            }
        ]
    }
]
EOF
$ aws cloudtrail put-event-selectors --trail-name s3-trail \
	--event-selectors file://event_selector.json
{
    "EventSelectors": [
        {
            "IncludeManagementEvents": false,
            "DataResources": [
                {
                    "Values": [
                        "arn:aws:s3:::az-s3-trail-test/"
                    ],
                    "Type": "AWS::S3::Object"
                }
            ],
            "ReadWriteType": "All"
        }
    ],
    "TrailARN": "arn:aws:cloudtrail:ap-northeast-1:**********:trail/s3-trail"
}
  • ロギングの開始
$ aws cloudtrail start-logging --name arn:aws:cloudtrail:ap-northeast-1:**********:trail/s3-trail

EMR から S3 にアクセスする

  • PySpark で S3 のファイルにアクセスする
$ perl -le 'print for 1..100000000' > number.txt
$ aws s3 cp number.txt s3://az-s3-trail-test/
$ pyspark
>>> rdd = sc.textFile("s3://az-s3-trail-test/number.txt")
>>> rdd.count()
>>> exit()
  • 行数を倍にしてみる
$ perl -le 'print for 1..200000000' > number2x.txt
$ aws s3 cp number2x.txt s3://az-s3-trail-test/
$ pyspark
>>> rdd = sc.textFile("s3://az-s3-trail-test/number.txt")
>>> rdd.count()
>>> exit()

EMR から S3 にアクセスした際の API 発行回数を調べる

  • Trail ログから API 発行回数を確認する。
$ aws s3 ls --recursive s3://az-s3-trail-log
$ aws s3 cp s3://az-s3-trail-log/AWSLogs/*********/CloudTrail/ap-northeast-1/2018/02/11/*********_CloudTrail_ap-northeast-1_20180211T1030Z_8UxnbO7KHiDJe42P.json.gz ./
$ gunzip *********_CloudTrail_ap-northeast-1_20180211T1030Z_8UxnbO7KHiDJe42P.json.gz
$ cat *********_CloudTrail_ap-northeast-1_20180211T1030Z_8UxnbO7KHiDJe42P.json|jq -r '.Records[]|select(.userAgent|contains("ElasticMapReduce"))|@text "\(.userAgent)\t\(.eventName)"'|head
[ElasticMapReduce/1.0.0 emrfs/s3n {}, aws-sdk-java/1.11.129 Linux/4.9.70-25.242.amzn1.x86_64 OpenJDK_64-Bit_Server_VM/25.151-b12/1.8.0_151 scala/2.11.8]	HeadBucket
[ElasticMapReduce/1.0.0 emrfs/s3n {}, aws-sdk-java/1.11.129 Linux/4.9.70-25.242.amzn1.x86_64 OpenJDK_64-Bit_Server_VM/25.151-b12/1.8.0_151 scala/2.11.8]	HeadObject
[ElasticMapReduce/1.0.0 emrfs/s3n {}, aws-sdk-java/1.11.129 Linux/4.9.70-25.242.amzn1.x86_64 OpenJDK_64-Bit_Server_VM/25.151-b12/1.8.0_151 scala/2.11.8]	HeadBucket
[ElasticMapReduce/1.0.0 emrfs/s3n {}, aws-sdk-java/1.11.129 Linux/4.9.70-25.242.amzn1.x86_64 OpenJDK_64-Bit_Server_VM/25.151-b12/1.8.0_151 scala/2.11.8]	HeadObject
[ElasticMapReduce/1.0.0 emrfs/s3n {}, aws-sdk-java/1.11.129 Linux/4.9.70-25.242.amzn1.x86_64 OpenJDK_64-Bit_Server_VM/25.151-b12/1.8.0_151 scala/2.11.8]	GetObject
[ElasticMapReduce/1.0.0 emrfs/s3n {}, aws-sdk-java/1.11.129 Linux/4.9.70-25.242.amzn1.x86_64 OpenJDK_64-Bit_Server_VM/25.151-b12/1.8.0_151 scala/2.11.8]	GetObject
[ElasticMapReduce/1.0.0 emrfs/s3n {}, aws-sdk-java/1.11.129 Linux/4.9.70-25.242.amzn1.x86_64 OpenJDK_64-Bit_Server_VM/25.151-b12/1.8.0_151 scala/2.11.8]	HeadBucket
[ElasticMapReduce/1.0.0 emrfs/s3n {}, aws-sdk-java/1.11.129 Linux/4.9.70-25.242.amzn1.x86_64 OpenJDK_64-Bit_Server_VM/25.151-b12/1.8.0_151 scala/2.11.8]	HeadBucket
[ElasticMapReduce/1.0.0 emrfs/s3n {}, aws-sdk-java/1.11.129 Linux/4.9.70-25.242.amzn1.x86_64 OpenJDK_64-Bit_Server_VM/25.151-b12/1.8.0_151 scala/2.11.8]	HeadObject
[ElasticMapReduce/1.0.0 emrfs/s3n {}, aws-sdk-java/1.11.129 Linux/4.9.70-25.242.amzn1.x86_64 OpenJDK_64-Bit_Server_VM/25.151-b12/1.8.0_151 scala/2.11.8]	HeadBucket
$ cat *********_CloudTrail_ap-northeast-1_20180211T1030Z_8UxnbO7KHiDJe42P.json|jq -r '.Records[]|select(.userAgent|contains("ElasticMapReduce"))|@text "\(.userAgent)\t\(.eventName)"'|perl -lane 'print $F[$#F]'|sort|uniq -c
  27 GetObject
  13 HeadBucket
  12 HeadObject
  • 行数を倍にした時は API 発行回数が増えている。
$ cat *.json|jq -r '.Records[]|select(.userAgent|contains("ElasticMapReduce"))|@text "\(.userAgent)\t\(.eventName)"'|perl -lane 'print $F[$#F]'|sort|uniq -c 
  57 GetObject
  13 HeadBucket
  17 HeadObject

AWS CLI で s3 cp した際の API 発行回数を調べる

$ time aws s3 cp s3://az-s3-trail-test/number.txt s3://az-to/
copy: s3://az-s3-trail-test/number.txt to s3://az-to/number.txt

real	0m5.046s
user	0m1.073s
sys	0m0.232s