ablog

不器用で落着きのない技術者のメモ

「AWS Glue と Amazon S3 を使用してデータレイクの基礎を構築する」を試してみた

AWS Glue と Amazon S3 を使用してデータレイクの基礎を構築する | Amazon Web Services ブログ を試してみた。
以下、メモ。

データソースの確認

% aws s3 ls --human-readable s3://aws-bigdata-blog/artifacts/glue-data-lake/data/
2017-10-24 06:24:27    0 Bytes
2017-10-24 06:24:42   91.3 MiB green_tripdata_2017-01.csv

ジョブ実行でエラー

"nytaxi-csv-parquet" ジョブの実行で "AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied" とエラーになったので、作成した"AWSGlueServiceRole-Default" IAMロールにアタッチしているポリシーを確認すると s3://nytaxi-parquet の参照権限がなかったので、AmazonS3FullAccess を付与して再実行したら成功した。

Traceback (most recent call last):
File "script_2018-09-12-02-46-38.py", line 40, in <module>
datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options =
{
    "path": "s3://nytaxi-parquet"
}
, format = "parquet", transformation_ctx = "datasink4")
File "/mnt/yarn/usercache/root/appcache/application_1536719177297_0001/container_1536719177297_0001_01_000001/PyGlue.zip/awsglue/dynamicframe.py", line 574, in from_options
File "/mnt/yarn/usercache/root/appcache/application_1536719177297_0001/container_1536719177297_0001_01_000001/PyGlue.zip/awsglue/context.py", line 191, in write_dynamic_frame_from_options
File "/mnt/yarn/usercache/root/appcache/application_1536719177297_0001/container_1536719177297_0001_01_000001/PyGlue.zip/awsglue/context.py", line 214, in write_from_options
File "/mnt/yarn/usercache/root/appcache/application_1536719177297_0001/container_1536719177297_0001_01_000001/PyGlue.zip/awsglue/data_sink.py", line 32, in write
File "/mnt/yarn/usercache/root/appcache/application_1536719177297_0001/container_1536719177297_0001_01_000001/PyGlue.zip/awsglue/data_sink.py", line 28, in writeFrame
File "/mnt/yarn/usercache/root/appcache/application_1536719177297_0001/container_1536719177297_0001_01_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/mnt/yarn/usercache/root/appcache/application_1536719177297_0001/container_1536719177297_0001_01_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/mnt/yarn/usercache/root/appcache/application_1536719177297_0001/container_1536719177297_0001_01_000001/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o120.pyWriteDynamicFrame.
: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied★; Request ID: 929E2CD0A8C1BDA6), S3 Extended Request ID: KoO7fU9QDKYG7KnvHxoSns810SeuyuZMZUtmEua6r/DGyMLEGQXHz78G8YDLEmECvUSDJAeNud4=
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1588)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1258)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1030)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:742)

CloudTrail で Glue から S3 に発行されたオブジェクトレベルの API を Athena で確認してみる

    • クエリ
select eventtime, eventname, requestparameters 
from cloudtrail_logs_cloudtrail_200000000000_do_not_delete 
where eventsource = 's3.amazonaws.com' 
and useragent = '[ElasticMapReduce/1.0.0 emrfs/s3n {}, aws-internal/3]' 
and awsregion = 'ap-northeast-1' 
and errorcode is null
  • a11c2139-e621-4e0c-802d-27adb07479b3.csv
eventtime eventname requestparameters
2018-09-12T03:19:12Z PutObject {"bucketName":"nytaxi-parquet","key":"_temporary/0_$folder$"}
2018-09-12T03:19:11Z PutObject {"bucketName":"nytaxi-parquet","key":"_temporary_$folder$"}
2018-09-12T03:19:11Z ListObjects {"list-type":"2","bucketName":"nytaxi-parquet","max-keys":"1","fetch-owner":"false","prefix":"_temporary/","delimiter":"/"}
2018-09-12T03:19:12Z ListObjects {"list-type":"2","bucketName":"nytaxi-parquet","max-keys":"1","fetch-owner":"false","prefix":"_temporary/0/","delimiter":"/"}
2018-09-12T03:19:16Z HeadBucket {"bucketName":"nytaxi-parquet"}
2018-09-12T03:19:16Z ListObjects {"list-type":"2","bucketName":"nytaxi-parquet","max-keys":"1","fetch-owner":"false","prefix":"_temporary/0/_temporary/attempt_20180912031916_0000_m_000000_0/part-00000-eebf4723-bbaf-42e9-aa03-1d20539eba2d-c000.snappy.parquet/","delimiter":"/"}
2018-09-12T03:19:17Z ListObjects {"list-type":"2","bucketName":"nytaxi-parquet","max-keys":"1","fetch-owner":"false","prefix":"_temporary/0/_temporary/attempt_20180912031917_0000_m_000001_0/part-00001-eebf4723-bbaf-42e9-aa03-1d20539eba2d-c000.snappy.parquet/","delimiter":"/"}
2018-09-12T03:19:27Z CopyObject {"x-amz-copy-source":"/nytaxi-parquet/_temporary/0/_temporary/attempt_20180912031917_0000_m_000001_0/part-00001-eebf4723-bbaf-42e9-aa03-1d20539eba2d-c000.snappy.parquet","bucketName":"nytaxi-parquet","key":"part-00001-eebf4723-bbaf-42e9-aa03-1d20539eba2d-c000.snappy.parquet"}
2018-09-12T03:19:27Z DeleteObject {"bucketName":"nytaxi-parquet","key":"_temporary/0/_temporary/attempt_20180912031917_0000_m_000001_0/part-00001-eebf4723-bbaf-42e9-aa03-1d20539eba2d-c000.snappy.parquet"}
2018-09-12T03:19:27Z PutObject {"bucketName":"nytaxi-parquet","key":"_temporary/0/_temporary/attempt_20180912031917_0000_m_000001_0/part-00001-eebf4723-bbaf-42e9-aa03-1d20539eba2d-c000.snappy.parquet"}
2018-09-12T03:19:27Z ListObjects {"list-type":"2","bucketName":"nytaxi-parquet","max-keys":"1","fetch-owner":"false","prefix":"_temporary/0/_temporary/attempt_20180912031917_0000_m_000001_0/","delimiter":"/"}
2018-09-12T03:19:27Z ListObjects {"list-type":"2","bucketName":"nytaxi-parquet","max-keys":"1","fetch-owner":"false","prefix":"_temporary/0/_temporary/attempt_20180912031917_0000_m_000001_0/","delimiter":"/"}
2018-09-12T03:19:27Z ListObjects {"list-type":"2","bucketName":"nytaxi-parquet","max-keys":"1000","fetch-owner":"false","prefix":"_temporary/0/_temporary/attempt_20180912031917_0000_m_000001_0/","delimiter":"/"}
2018-09-12T03:19:27Z ListObjects {"list-type":"2","bucketName":"nytaxi-parquet","max-keys":"1","fetch-owner":"false","prefix":"part-00001-eebf4723-bbaf-42e9-aa03-1d20539eba2d-c000.snappy.parquet/","delimiter":"/"}
2018-09-12T03:19:27Z ListObjects {"list-type":"2","bucketName":"nytaxi-parquet","max-keys":"1","fetch-owner":"false","prefix":"part-00001-eebf4723-bbaf-42e9-aa03-1d20539eba2d-c000.snappy.parquet/","delimiter":"/"}
2018-09-12T03:19:27Z HeadObject {"bucketName":"nytaxi-parquet","key":"_temporary/0/_temporary/attempt_20180912031917_0000_m_000001_0/part-00001-eebf4723-bbaf-42e9-aa03-1d20539eba2d-c000.snappy.parquet"}
2018-09-12T03:19:27Z HeadObject {"bucketName":"nytaxi-parquet","key":"_temporary/0/_temporary/attempt_20180912031917_0000_m_000001_0/part-00001-eebf4723-bbaf-42e9-aa03-1d20539eba2d-c000.snappy.parquet"}
2018-09-12T03:19:33Z PutObject {"bucketName":"nytaxi-parquet","key":"_temporary/0/_temporary/attempt_20180912031916_0000_m_000000_0/part-00000-eebf4723-bbaf-42e9-aa03-1d20539eba2d-c000.snappy.parquet"}
2018-09-12T03:19:33Z DeleteObject {"bucketName":"nytaxi-parquet","key":"_temporary/0_$folder$"}
2018-09-12T03:19:33Z DeleteObject {"bucketName":"nytaxi-parquet","key":"_temporary_$folder$"}
2018-09-12T03:19:33Z DeleteObject {"bucketName":"nytaxi-parquet","key":"_temporary/0/_temporary/attempt_20180912031916_0000_m_000000_0/part-00000-eebf4723-bbaf-42e9-aa03-1d20539eba2d-c000.snappy.parquet"}
2018-09-12T03:19:33Z ListObjects {"list-type":"2","bucketName":"nytaxi-parquet","max-keys":"1","fetch-owner":"false","prefix":"_temporary/0/_temporary/attempt_20180912031916_0000_m_000000_0/","delimiter":"/"}
2018-09-12T03:19:33Z ListObjects {"list-type":"2","bucketName":"nytaxi-parquet","max-keys":"1","fetch-owner":"false","prefix":"_temporary/0/_temporary/attempt_20180912031916_0000_m_000000_0/","delimiter":"/"}
2018-09-12T03:19:33Z ListObjects {"list-type":"2","bucketName":"nytaxi-parquet","max-keys":"1000","fetch-owner":"false","prefix":"_temporary/0/_temporary/attempt_20180912031916_0000_m_000000_0/","delimiter":"/"}
2018-09-12T03:19:33Z ListObjects {"list-type":"2","bucketName":"nytaxi-parquet","max-keys":"1","fetch-owner":"false","prefix":"part-00000-eebf4723-bbaf-42e9-aa03-1d20539eba2d-c000.snappy.parquet/","delimiter":"/"}
2018-09-12T03:19:33Z ListObjects {"list-type":"2","bucketName":"nytaxi-parquet","max-keys":"1","fetch-owner":"false","prefix":"part-00000-eebf4723-bbaf-42e9-aa03-1d20539eba2d-c000.snappy.parquet/","delimiter":"/"}
2018-09-12T03:19:33Z HeadObject {"bucketName":"nytaxi-parquet","key":"_temporary/0/_temporary/attempt_20180912031916_0000_m_000000_0/part-00000-eebf4723-bbaf-42e9-aa03-1d20539eba2d-c000.snappy.parquet"}
2018-09-12T03:19:33Z HeadObject {"bucketName":"nytaxi-parquet","key":"_temporary/0/_temporary/attempt_20180912031916_0000_m_000000_0/part-00000-eebf4723-bbaf-42e9-aa03-1d20539eba2d-c000.snappy.parquet"}
2018-09-12T03:19:33Z CopyObject {"x-amz-copy-source":"/nytaxi-parquet/_temporary/0/_temporary/attempt_20180912031916_0000_m_000000_0/part-00000-eebf4723-bbaf-42e9-aa03-1d20539eba2d-c000.snappy.parquet","bucketName":"nytaxi-parquet","key":"part-00000-eebf4723-bbaf-42e9-aa03-1d20539eba2d-c000.snappy.parquet"}
2018-09-12T03:19:33Z HeadObject {"bucketName":"nytaxi-parquet","key":"_temporary_$folder$"}
2018-09-12T03:19:33Z ListObjects {"list-type":"2","bucketName":"nytaxi-parquet","max-keys":"1000","fetch-owner":"false","prefix":"_temporary/"}
2018-09-12T03:19:33Z ListObjects {"list-type":"2","bucketName":"nytaxi-parquet","max-keys":"1","fetch-owner":"false","prefix":"_temporary-eebf4723-bbaf-42e9-aa03-1d20539eba2d/","delimiter":"/"}
2018-09-12T03:48:03Z HeadBucket {"bucketName":"nytaxi-parquet"}
2018-09-12T03:48:03Z ListObjects {"list-type":"2","bucketName":"nytaxi-parquet","max-keys":"1","fetch-owner":"false","prefix":"_temporary/","delimiter":"/"}
2018-09-12T03:19:11Z CreateBucket {"CreateBucketConfiguration":{"LocationConstraint":"ap-northeast-1","xmlns":"http://s3.amazonaws.com/doc/2006-03-01/"},"bucketName":"nytaxi-parquet"}
2018-09-12T03:54:08Z HeadBucket {"bucketName":"nytaxi-parquet"}
2018-09-12T03:54:08Z ListObjects {"list-type":"2","bucketName":"nytaxi-parquet","max-keys":"1","fetch-owner":"false","prefix":"_temporary/","delimiter":"/"}