AWS Glue と Amazon S3 を使用してデータレイクの基礎を構築する | Amazon Web Services ブログ を試してみた。
以下、メモ。
データソースの確認
% aws s3 ls --human-readable s3://aws-bigdata-blog/artifacts/glue-data-lake/data/ 2017-10-24 06:24:27 0 Bytes 2017-10-24 06:24:42 91.3 MiB green_tripdata_2017-01.csv
ジョブ実行でエラー
"nytaxi-csv-parquet" ジョブの実行で "AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied" とエラーになったので、作成した"AWSGlueServiceRole-Default" IAMロールにアタッチしているポリシーを確認すると s3://nytaxi-parquet の参照権限がなかったので、AmazonS3FullAccess を付与して再実行したら成功した。
Traceback (most recent call last): File "script_2018-09-12-02-46-38.py", line 40, in <module> datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = { "path": "s3://nytaxi-parquet" } , format = "parquet", transformation_ctx = "datasink4") File "/mnt/yarn/usercache/root/appcache/application_1536719177297_0001/container_1536719177297_0001_01_000001/PyGlue.zip/awsglue/dynamicframe.py", line 574, in from_options File "/mnt/yarn/usercache/root/appcache/application_1536719177297_0001/container_1536719177297_0001_01_000001/PyGlue.zip/awsglue/context.py", line 191, in write_dynamic_frame_from_options File "/mnt/yarn/usercache/root/appcache/application_1536719177297_0001/container_1536719177297_0001_01_000001/PyGlue.zip/awsglue/context.py", line 214, in write_from_options File "/mnt/yarn/usercache/root/appcache/application_1536719177297_0001/container_1536719177297_0001_01_000001/PyGlue.zip/awsglue/data_sink.py", line 32, in write File "/mnt/yarn/usercache/root/appcache/application_1536719177297_0001/container_1536719177297_0001_01_000001/PyGlue.zip/awsglue/data_sink.py", line 28, in writeFrame File "/mnt/yarn/usercache/root/appcache/application_1536719177297_0001/container_1536719177297_0001_01_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File "/mnt/yarn/usercache/root/appcache/application_1536719177297_0001/container_1536719177297_0001_01_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File "/mnt/yarn/usercache/root/appcache/application_1536719177297_0001/container_1536719177297_0001_01_000001/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o120.pyWriteDynamicFrame. : com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied★; Request ID: 929E2CD0A8C1BDA6), S3 Extended Request ID: KoO7fU9QDKYG7KnvHxoSns810SeuyuZMZUtmEua6r/DGyMLEGQXHz78G8YDLEmECvUSDJAeNud4= at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1588) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1258) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1030) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:742)
CloudTrail で Glue から S3 に発行されたオブジェクトレベルの API を Athena で確認してみる
-
- クエリ
select eventtime, eventname, requestparameters from cloudtrail_logs_cloudtrail_200000000000_do_not_delete where eventsource = 's3.amazonaws.com' and useragent = '[ElasticMapReduce/1.0.0 emrfs/s3n {}, aws-internal/3]' and awsregion = 'ap-northeast-1' and errorcode is null
- a11c2139-e621-4e0c-802d-27adb07479b3.csv
eventtime | requestparameters | |
---|---|---|
{"bucketName":"nytaxi-parquet","key":"_temporary/0_$folder$"} | ||
2018-09-12T03:19:11Z | PutObject | {"bucketName":"nytaxi-parquet","key":"_temporary_$folder$"} |
2018-09-12T03:19:11Z | ListObjects | {"list-type":"2","bucketName":"nytaxi-parquet","max-keys":"1","fetch-owner":"false","prefix":"_temporary/","delimiter":"/"} |
2018-09-12T03:19:12Z | ListObjects | {"list-type":"2","bucketName":"nytaxi-parquet","max-keys":"1","fetch-owner":"false","prefix":"_temporary/0/","delimiter":"/"} |
2018-09-12T03:19:16Z | HeadBucket | {"bucketName":"nytaxi-parquet"} |
2018-09-12T03:19:16Z | ListObjects | {"list-type":"2","bucketName":"nytaxi-parquet","max-keys":"1","fetch-owner":"false","prefix":"_temporary/0/_temporary/attempt_20180912031916_0000_m_000000_0/part-00000-eebf4723-bbaf-42e9-aa03-1d20539eba2d-c000.snappy.parquet/","delimiter":"/"} |
2018-09-12T03:19:17Z | ListObjects | {"list-type":"2","bucketName":"nytaxi-parquet","max-keys":"1","fetch-owner":"false","prefix":"_temporary/0/_temporary/attempt_20180912031917_0000_m_000001_0/part-00001-eebf4723-bbaf-42e9-aa03-1d20539eba2d-c000.snappy.parquet/","delimiter":"/"} |
2018-09-12T03:19:27Z | CopyObject | {"x-amz-copy-source":"/nytaxi-parquet/_temporary/0/_temporary/attempt_20180912031917_0000_m_000001_0/part-00001-eebf4723-bbaf-42e9-aa03-1d20539eba2d-c000.snappy.parquet","bucketName":"nytaxi-parquet","key":"part-00001-eebf4723-bbaf-42e9-aa03-1d20539eba2d-c000.snappy.parquet"} |
2018-09-12T03:19:27Z | DeleteObject | {"bucketName":"nytaxi-parquet","key":"_temporary/0/_temporary/attempt_20180912031917_0000_m_000001_0/part-00001-eebf4723-bbaf-42e9-aa03-1d20539eba2d-c000.snappy.parquet"} |
2018-09-12T03:19:27Z | PutObject | {"bucketName":"nytaxi-parquet","key":"_temporary/0/_temporary/attempt_20180912031917_0000_m_000001_0/part-00001-eebf4723-bbaf-42e9-aa03-1d20539eba2d-c000.snappy.parquet"} |
2018-09-12T03:19:27Z | ListObjects | {"list-type":"2","bucketName":"nytaxi-parquet","max-keys":"1","fetch-owner":"false","prefix":"_temporary/0/_temporary/attempt_20180912031917_0000_m_000001_0/","delimiter":"/"} |
2018-09-12T03:19:27Z | ListObjects | {"list-type":"2","bucketName":"nytaxi-parquet","max-keys":"1","fetch-owner":"false","prefix":"_temporary/0/_temporary/attempt_20180912031917_0000_m_000001_0/","delimiter":"/"} |
2018-09-12T03:19:27Z | ListObjects | {"list-type":"2","bucketName":"nytaxi-parquet","max-keys":"1000","fetch-owner":"false","prefix":"_temporary/0/_temporary/attempt_20180912031917_0000_m_000001_0/","delimiter":"/"} |
2018-09-12T03:19:27Z | ListObjects | {"list-type":"2","bucketName":"nytaxi-parquet","max-keys":"1","fetch-owner":"false","prefix":"part-00001-eebf4723-bbaf-42e9-aa03-1d20539eba2d-c000.snappy.parquet/","delimiter":"/"} |
2018-09-12T03:19:27Z | ListObjects | {"list-type":"2","bucketName":"nytaxi-parquet","max-keys":"1","fetch-owner":"false","prefix":"part-00001-eebf4723-bbaf-42e9-aa03-1d20539eba2d-c000.snappy.parquet/","delimiter":"/"} |
2018-09-12T03:19:27Z | HeadObject | {"bucketName":"nytaxi-parquet","key":"_temporary/0/_temporary/attempt_20180912031917_0000_m_000001_0/part-00001-eebf4723-bbaf-42e9-aa03-1d20539eba2d-c000.snappy.parquet"} |
2018-09-12T03:19:27Z | HeadObject | {"bucketName":"nytaxi-parquet","key":"_temporary/0/_temporary/attempt_20180912031917_0000_m_000001_0/part-00001-eebf4723-bbaf-42e9-aa03-1d20539eba2d-c000.snappy.parquet"} |
2018-09-12T03:19:33Z | PutObject | {"bucketName":"nytaxi-parquet","key":"_temporary/0/_temporary/attempt_20180912031916_0000_m_000000_0/part-00000-eebf4723-bbaf-42e9-aa03-1d20539eba2d-c000.snappy.parquet"} |
2018-09-12T03:19:33Z | DeleteObject | {"bucketName":"nytaxi-parquet","key":"_temporary/0_$folder$"} |
2018-09-12T03:19:33Z | DeleteObject | {"bucketName":"nytaxi-parquet","key":"_temporary_$folder$"} |
2018-09-12T03:19:33Z | DeleteObject | {"bucketName":"nytaxi-parquet","key":"_temporary/0/_temporary/attempt_20180912031916_0000_m_000000_0/part-00000-eebf4723-bbaf-42e9-aa03-1d20539eba2d-c000.snappy.parquet"} |
2018-09-12T03:19:33Z | ListObjects | {"list-type":"2","bucketName":"nytaxi-parquet","max-keys":"1","fetch-owner":"false","prefix":"_temporary/0/_temporary/attempt_20180912031916_0000_m_000000_0/","delimiter":"/"} |
2018-09-12T03:19:33Z | ListObjects | {"list-type":"2","bucketName":"nytaxi-parquet","max-keys":"1","fetch-owner":"false","prefix":"_temporary/0/_temporary/attempt_20180912031916_0000_m_000000_0/","delimiter":"/"} |
2018-09-12T03:19:33Z | ListObjects | {"list-type":"2","bucketName":"nytaxi-parquet","max-keys":"1000","fetch-owner":"false","prefix":"_temporary/0/_temporary/attempt_20180912031916_0000_m_000000_0/","delimiter":"/"} |
2018-09-12T03:19:33Z | ListObjects | {"list-type":"2","bucketName":"nytaxi-parquet","max-keys":"1","fetch-owner":"false","prefix":"part-00000-eebf4723-bbaf-42e9-aa03-1d20539eba2d-c000.snappy.parquet/","delimiter":"/"} |
2018-09-12T03:19:33Z | ListObjects | {"list-type":"2","bucketName":"nytaxi-parquet","max-keys":"1","fetch-owner":"false","prefix":"part-00000-eebf4723-bbaf-42e9-aa03-1d20539eba2d-c000.snappy.parquet/","delimiter":"/"} |
2018-09-12T03:19:33Z | HeadObject | {"bucketName":"nytaxi-parquet","key":"_temporary/0/_temporary/attempt_20180912031916_0000_m_000000_0/part-00000-eebf4723-bbaf-42e9-aa03-1d20539eba2d-c000.snappy.parquet"} |
2018-09-12T03:19:33Z | HeadObject | {"bucketName":"nytaxi-parquet","key":"_temporary/0/_temporary/attempt_20180912031916_0000_m_000000_0/part-00000-eebf4723-bbaf-42e9-aa03-1d20539eba2d-c000.snappy.parquet"} |
2018-09-12T03:19:33Z | CopyObject | {"x-amz-copy-source":"/nytaxi-parquet/_temporary/0/_temporary/attempt_20180912031916_0000_m_000000_0/part-00000-eebf4723-bbaf-42e9-aa03-1d20539eba2d-c000.snappy.parquet","bucketName":"nytaxi-parquet","key":"part-00000-eebf4723-bbaf-42e9-aa03-1d20539eba2d-c000.snappy.parquet"} |
2018-09-12T03:19:33Z | HeadObject | {"bucketName":"nytaxi-parquet","key":"_temporary_$folder$"} |
2018-09-12T03:19:33Z | ListObjects | {"list-type":"2","bucketName":"nytaxi-parquet","max-keys":"1000","fetch-owner":"false","prefix":"_temporary/"} |
2018-09-12T03:19:33Z | ListObjects | {"list-type":"2","bucketName":"nytaxi-parquet","max-keys":"1","fetch-owner":"false","prefix":"_temporary-eebf4723-bbaf-42e9-aa03-1d20539eba2d/","delimiter":"/"} |
2018-09-12T03:48:03Z | HeadBucket | {"bucketName":"nytaxi-parquet"} |
2018-09-12T03:48:03Z | ListObjects | {"list-type":"2","bucketName":"nytaxi-parquet","max-keys":"1","fetch-owner":"false","prefix":"_temporary/","delimiter":"/"} |
2018-09-12T03:19:11Z | CreateBucket | {"CreateBucketConfiguration":{"LocationConstraint":"ap-northeast-1","xmlns":"http://s3.amazonaws.com/doc/2006-03-01/"},"bucketName":"nytaxi-parquet"} |
2018-09-12T03:54:08Z | HeadBucket | {"bucketName":"nytaxi-parquet"} |
2018-09-12T03:54:08Z | ListObjects | {"list-type":"2","bucketName":"nytaxi-parquet","max-keys":"1","fetch-owner":"false","prefix":"_temporary/","delimiter":"/"} |