PySpark on EMR で S3 上の1オブジェクトの読込みに API 発行回数が1回か複数回か調べてみた。
CloudTrail で API 発行を捕捉する
- 環境変数を設定する
$ export LANG=C $ export TZ=UTC $ export AWS_DEFAULT_REGION=ap-northeast-1
- trail ログ保存用 S3 バケットを作成する
$ aws s3api create-bucket --bucket az-s3-trail-log \ --create-bucket-configuration LocationConstraint=ap-northeast-1
- バケットポリシーを付与する。
$ cat <<EOF > policy.json { "Version": "2012-10-17", "Statement": [ { "Sid": "AWSCloudTrailAclCheck20180211", "Effect": "Allow", "Principal": { "Service": "cloudtrail.amazonaws.com" }, "Action": "s3:GetBucketAcl", "Resource": "arn:aws:s3:::az-s3-trail-log" }, { "Sid": "AWSCloudTrailRead20180211", "Effect": "Allow", "Principal": { "Service": "cloudtrail.amazonaws.com" }, "Action": "s3:*", "Resource": "arn:aws:s3:::az-s3-trail-log/*", "Condition": { "StringEquals": { "s3:x-amz-acl": "bucket-owner-full-control" } } } ] } EOF $ aws s3api put-bucket-policy --bucket az-s3-trail-log --policy file://policy.json
$ aws s3api create-bucket --bucket az-s3-trail-test \ --create-bucket-configuration LocationConstraint=ap-northeast-1
- trail の作成
$ aws cloudtrail create-trail --name s3-trail --s3-bucket-name az-s3-trail-log
- イベントセレクタの作成
$ cat <<EOF >event_selector.json [ { "ReadWriteType": "All", "IncludeManagementEvents": false, "DataResources": [ { "Type": "AWS::S3::Object", "Values": [ "arn:aws:s3:::az-s3-trail-test/" ] } ] } ] EOF $ aws cloudtrail put-event-selectors --trail-name s3-trail \ --event-selectors file://event_selector.json { "EventSelectors": [ { "IncludeManagementEvents": false, "DataResources": [ { "Values": [ "arn:aws:s3:::az-s3-trail-test/" ], "Type": "AWS::S3::Object" } ], "ReadWriteType": "All" } ], "TrailARN": "arn:aws:cloudtrail:ap-northeast-1:**********:trail/s3-trail" }
- ロギングの開始
$ aws cloudtrail start-logging --name arn:aws:cloudtrail:ap-northeast-1:**********:trail/s3-trail
EMR から S3 にアクセスする
- PySpark で S3 のファイルにアクセスする
$ perl -le 'print for 1..100000000' > number.txt $ aws s3 cp number.txt s3://az-s3-trail-test/ $ pyspark >>> rdd = sc.textFile("s3://az-s3-trail-test/number.txt") >>> rdd.count() >>> exit()
- 行数を倍にしてみる
$ perl -le 'print for 1..200000000' > number2x.txt $ aws s3 cp number2x.txt s3://az-s3-trail-test/ $ pyspark >>> rdd = sc.textFile("s3://az-s3-trail-test/number.txt") >>> rdd.count() >>> exit()
EMR から S3 にアクセスした際の API 発行回数を調べる
- Trail ログから API 発行回数を確認する。
$ aws s3 ls --recursive s3://az-s3-trail-log $ aws s3 cp s3://az-s3-trail-log/AWSLogs/*********/CloudTrail/ap-northeast-1/2018/02/11/*********_CloudTrail_ap-northeast-1_20180211T1030Z_8UxnbO7KHiDJe42P.json.gz ./ $ gunzip *********_CloudTrail_ap-northeast-1_20180211T1030Z_8UxnbO7KHiDJe42P.json.gz $ cat *********_CloudTrail_ap-northeast-1_20180211T1030Z_8UxnbO7KHiDJe42P.json|jq -r '.Records[]|select(.userAgent|contains("ElasticMapReduce"))|@text "\(.userAgent)\t\(.eventName)"'|head [ElasticMapReduce/1.0.0 emrfs/s3n {}, aws-sdk-java/1.11.129 Linux/4.9.70-25.242.amzn1.x86_64 OpenJDK_64-Bit_Server_VM/25.151-b12/1.8.0_151 scala/2.11.8] HeadBucket [ElasticMapReduce/1.0.0 emrfs/s3n {}, aws-sdk-java/1.11.129 Linux/4.9.70-25.242.amzn1.x86_64 OpenJDK_64-Bit_Server_VM/25.151-b12/1.8.0_151 scala/2.11.8] HeadObject [ElasticMapReduce/1.0.0 emrfs/s3n {}, aws-sdk-java/1.11.129 Linux/4.9.70-25.242.amzn1.x86_64 OpenJDK_64-Bit_Server_VM/25.151-b12/1.8.0_151 scala/2.11.8] HeadBucket [ElasticMapReduce/1.0.0 emrfs/s3n {}, aws-sdk-java/1.11.129 Linux/4.9.70-25.242.amzn1.x86_64 OpenJDK_64-Bit_Server_VM/25.151-b12/1.8.0_151 scala/2.11.8] HeadObject [ElasticMapReduce/1.0.0 emrfs/s3n {}, aws-sdk-java/1.11.129 Linux/4.9.70-25.242.amzn1.x86_64 OpenJDK_64-Bit_Server_VM/25.151-b12/1.8.0_151 scala/2.11.8] GetObject [ElasticMapReduce/1.0.0 emrfs/s3n {}, aws-sdk-java/1.11.129 Linux/4.9.70-25.242.amzn1.x86_64 OpenJDK_64-Bit_Server_VM/25.151-b12/1.8.0_151 scala/2.11.8] GetObject [ElasticMapReduce/1.0.0 emrfs/s3n {}, aws-sdk-java/1.11.129 Linux/4.9.70-25.242.amzn1.x86_64 OpenJDK_64-Bit_Server_VM/25.151-b12/1.8.0_151 scala/2.11.8] HeadBucket [ElasticMapReduce/1.0.0 emrfs/s3n {}, aws-sdk-java/1.11.129 Linux/4.9.70-25.242.amzn1.x86_64 OpenJDK_64-Bit_Server_VM/25.151-b12/1.8.0_151 scala/2.11.8] HeadBucket [ElasticMapReduce/1.0.0 emrfs/s3n {}, aws-sdk-java/1.11.129 Linux/4.9.70-25.242.amzn1.x86_64 OpenJDK_64-Bit_Server_VM/25.151-b12/1.8.0_151 scala/2.11.8] HeadObject [ElasticMapReduce/1.0.0 emrfs/s3n {}, aws-sdk-java/1.11.129 Linux/4.9.70-25.242.amzn1.x86_64 OpenJDK_64-Bit_Server_VM/25.151-b12/1.8.0_151 scala/2.11.8] HeadBucket $ cat *********_CloudTrail_ap-northeast-1_20180211T1030Z_8UxnbO7KHiDJe42P.json|jq -r '.Records[]|select(.userAgent|contains("ElasticMapReduce"))|@text "\(.userAgent)\t\(.eventName)"'|perl -lane 'print $F[$#F]'|sort|uniq -c 27 GetObject 13 HeadBucket 12 HeadObject
- 行数を倍にした時は API 発行回数が増えている。
$ cat *.json|jq -r '.Records[]|select(.userAgent|contains("ElasticMapReduce"))|@text "\(.userAgent)\t\(.eventName)"'|perl -lane 'print $F[$#F]'|sort|uniq -c 57 GetObject 13 HeadBucket 17 HeadObject