2018-12-02

はてなダイアリーからはてなブログに移行しました

ablog ははてなダイアリーからはてなブログに移行しました。

2018-11-28

Google 認証を使ってALBで認証する

AWS

Google Identity Platformで OAuthクライアントID発行

Google Cloud Platform にアクセスし Google アカウントでログインする。
任意の名前でプロジェクトを作成する。

[認証情報を作成]-[OAuth クライアント ID]をクリックする。

[同意画面を設定]をクリックする。

認証情報を設定する。
- アプリケーション名: 任意
- 承認済みドメイン: ロードバランサーのDNS名とamazonaws.com（入力して Enter 押下）
- [アプリケーションホームページ] リンク: ロードバランサーのDNS名（とりあえず）
- [アプリケーションプライバシーポリシー] リンク: ロードバランサーのDNS名（とりあえず）
- [アプリケーション利用規約] リンク: ロードバランサーのDNS名（とりあえず）

OAuthクライアントIDの作成
- アプリケーションの種類: ウェブアプリケーション
- https://ロードバランサーのDNS名/oauth2/idpresponse

クライアントIDとクライアントシークレットをコピーしておく。

Google OpenID Providerの各エンドポイントを確認

$ curl https://accounts.google.com/.well-known/openid-configuration
{
 "issuer": "https://accounts.google.com", ★
 "authorization_endpoint": "https://accounts.google.com/o/oauth2/v2/auth",★
 "token_endpoint": "https://oauth2.googleapis.com/token",★
 "userinfo_endpoint": "https://openidconnect.googleapis.com/v1/userinfo",★
 "revocation_endpoint": "https://oauth2.googleapis.com/revoke",
 "jwks_uri": "https://www.googleapis.com/oauth2/v3/certs",
 "response_types_supported": [
  "code",
  "token",
  "id_token",
  "code token",
  "code id_token",
  "token id_token",
  "code token id_token",
  "none"
 ],

AWS

EC2

EC2インスタンスを作成してApacheをインストールして起動する。

$ sudo yum -y install httpd
$ sudo service httpd start

証明書を作成する。

$ openssl genrsa -out server.key 2048
$ openssl req -new -key server.key -out server.csr #いろいろ聞かれるのですべて入力せずに Enter 押下
$ openssl x509 -in server.csr -days 365000 -req -signkey server.key > server.crt

EC2インスタンスにアタッチしているセキュリティグループを設定してHTTPでアクセスできるようにする。
「http://EC2のパブリック DNS」にアクセスしてページが表示されることを確認する。

ターゲットグループ

[EC2]-[ターゲットグループ]-[ターゲットグループの作成]をクリックし、ターゲットグループを作成する。
- ターゲットグループ名: 任意
- VPC: 作成したEC2インスタンスと同じVPCを選択

作成したターゲットグループを選択して、[ターゲット]タブを選択して、[編集]をクリックする。
- 作成したEC2インスタンスを選択して、[登録済みに追加]をクリック

ロードバランサー

[EC2]-[ロードバランサー]-[ロードバランサーの作成]をクリックし、[ロードバランサーの種類の選択]で"Application Load Balancer"を選択する。
- 名前: 任意
- ロードバランサーのプロトコル: HTTPS
- ロードバランサーのポート: 443
- VPC: 作成したEC2インスタンスと同じVPCを選択
- アベイラビリティゾーン: 全て選択

セキュリティ設定の構成
- 証明書タイプ: IAM に証明書をアップロードする
- 証明書の名前: 任意
- プライベートキー: EC2で作成した server.key の内容をコピー&ペースト
- 証明書本文: EC2で作成した server.crt の内容をコピー&ペースト

[アクションの追加]-[認証]を選択する。

Google OpenID Providerの各エンドポイントとクライアントID、クライアントシークレットをコピー&ペーストする。

EC2インスタンスと同じセキュリティグループを選択する。

ルーティングの設定
- ターゲットグループ: 既存のターゲットグループ
- 名前: 作成したターゲットグループを選択

ウイザードに従って作成を完了する。

テスト

ロードバランサーのDNS名をコピーする。

Google認証した上で「https://ロードバランサーのDNS 名/」にアクセスすると成功する。

参考

2018-11-27

ssh接続せずにAWSマネジメントコンソールからEC2にログインしてシェルを実行する

AWS

IAMポリシー"AmazonSSMManagedInstanceCore"を付与したIAMロールをEC2にアタッチする。
AWSマネジメントコンソールにログインして Systems Manager をクリックする。

[セッションの開始]をクリックする。

インスタンスを選択して[セッションの開始]をクリックする。

コマンドラインからシェル操作ができるようになります。

補足

プライベートサブネットのEC2に接続したい場合は以下のVPCエンドポイントを作成する。VPCエンドポイントとEC2のセキュリティグループの設定も要注意。うまく行かない場合は Private Subnet の EC2 にセッションマネージャーで接続できない - ablog 参照。
- com.amazonaws.region.ssm
- com.amazonaws.region.ec2messages
- com.amazonaws.region.ssmmessages
- com.amazonaws.region.s3

参考

2018-11-25

Hive on EMR で S3 Select を有効化してI/O量を削減する

AWS

Hive on EMR で S3 Select を有効化すると、I/O量が削減され、実行時間が短縮することを確認した*1。

検証結果

通常

hive> select count(tax_region) from sh10.json_sales★ where tax_region = 'US';
Query ID = hadoop_20181125201846_ceb61407-d775-4399-a4ff-b123de4794ea
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1543070548885_0006)

----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED     64         64        0        0       0       0
Reducer 2 ...... container     SUCCEEDED      1          1        0        0       0       0
----------------------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 177.90 s
----------------------------------------------------------------------------------------------
OK
145998
Time taken: 181.039 seconds★, Fetched: 1 row(s)

S3 Select 有効

hive> SET s3select.filter=true;
hive> select count(tax_region) from sh10.json_sales_s3select★ where tax_region = 'US';
Query ID = hadoop_20181125203338_a4b89db5-5f2e-46e2-b1a8-c86965d74225
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1543070548885_0006)

----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED     64         64        0        0       0       0
Reducer 2 ...... container     SUCCEEDED      1          1        0        0       0       0
----------------------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 54.13 s
----------------------------------------------------------------------------------------------
OK
145998
Time taken: 54.658 seconds★, Fetched: 1 row(s)

準備

hive シェルを起動する。

$ hive

S3のJSONデータに対して外部テーブルを定義する。

CREATE EXTERNAL TABLE IF NOT EXISTS sh10.json_sales(
  prod_id int,
  cust_id int,
  time_id string,
  channel_id int,
  promo_id int,
  quantity_sold double,
  seller int,
  fulfillment_center int,
  courier_org int,
  tax_country string,
  tax_region string,
  amount_sold double
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://sb20181126/data/json/sh10/sales/';

同じS3のJSONデータに対して外部テーブルを定義する（S3 Select有効化）。

CREATE EXTERNAL TABLE IF NOT EXISTS sh10.json_sales_s3select(
  prod_id int,
  cust_id int,
  time_id string,
  channel_id int,
  promo_id int,
  quantity_sold double,
  seller int,
  fulfillment_center int,
  courier_org int,
  tax_country string,
  tax_region string,
  amount_sold double
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS
INPUTFORMAT
  'com.amazonaws.emr.s3select.hive.S3SelectableTextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://sb20181126/data/json/sh10/sales/'
TBLPROPERTIES (
  "s3select.format" = "json"
);

補足

テーブル定義を S3 Select 対応しても、s3select.filter を有効にしないと S3 Select は効かない。

hive> SET s3select.filter=false;
hive> select count(tax_region) from sh10.json_sales_s3select★ where tax_region = 'US';
Query ID = hadoop_20181125203003_e28e36fc-8fd4-46ea-966f-0c65bfdc9024
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1543070548885_0006)

----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED     64         64        0        0       0       0
Reducer 2 ...... container     SUCCEEDED      1          1        0        0       0       0
----------------------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 171.59 s
----------------------------------------------------------------------------------------------
OK
145998
Time taken: 172.17 seconds★, Fetched: 1 row(s)

参考

Specifying S3 Select in Your Code
To use S3 select in your Hive table, create the table by specifying com.amazonaws.emr.s3select.hive.S3SelectableTextInputFormat as the INPUTFORMAT class name, and specify a value for the s3select.format property using the TBLPROPERTIES clause.
By default, S3 Select is disabled when you run queries. Enable S3 Select by setting s3select.filter to true in your Hive session as shown below. The examples below demonstrate how to specify S3 Select when creating a table from underlying CSV and JSON files and then querying the table using a simple select statement.

Example CREATE TABLE Statement for CSV-Based Table
CREATE TABLE mys3selecttable (
col1 string,
col2 int,
col3 boolean
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS
INPUTFORMAT
  'com.amazonaws.emr.s3select.hive.S3SelectableTextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://path/to/mycsvfile/'
TBLPROPERTIES (
  "s3select.format" = "csv",
  "s3select.headerInfo" = "ignore"
);
Example CREATE TABLE Statement for JSON-Based Table
CREATE TABLE mys3selecttable (
col1 string,
col2 int,
col3 boolean
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS
INPUTFORMAT
  'com.amazonaws.emr.s3select.hive.S3SelectableTextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://path/to/json/'
TBLPROPERTIES (
  "s3select.format" = "json"
);
Example SELECT TABLE Statement
SET s3select.filter=true;
SELECT * FROM mys3selecttable WHERE col2 > 10;
Using S3 Select with Hive to Improve Performance - Amazon EMR

*1:当り前の話ですが、クエリで絞り込みの効くフィルタ条件が指定されて push down されるケースで効果が出ます

2018-11-24

CloudFormation で EMR クラスター作成時に Bootstrap Action を実行する

AWS

CloudFormation で EMR クラスター作成時に Bootstrap Action を実行したメモ。

master.yaml

---
AWSTemplateFormatVersion: '2010-09-09'
Description: Main Template For Workshop

Parameters:

（中略）

  CFnS3Bucket:
    Description: Specify an Amazon S3 template URL
    Type: String
    Default: cfnBucket20181124

（中略）

  EMRStack:
    Type: AWS::CloudFormation::Stack
    Properties:
      TemplateURL: !Sub https://s3-ap-northeast-1.amazonaws.com/${CFnS3Bucket}/emr.yaml
      Parameters:

（中略）

        CFnS3Bucket: 
          Ref: CFnS3Bucket

emr.yaml

AWSTemplateFormatVersion: '2010-09-09'
Description: Stack to create EMR Cluster.
Parameters:

（中略）

  CFnS3Bucket:
    Type: String
Resources:
  cluster:
    Type: AWS::EMR::Cluster
    Properties:

（中略）

      BootstrapActions:
        - Name: BootstrapAction
          ScriptBootstrapAction:
            Path: !Sub s3://${CFnS3Bucket}/emrBootstrapAction.sh

emrBootstrapAction.sh

#!/usr/bin/env bash

sudo wget http://www.congiu.net/hive-json-serde/1.3.6-SNAPSHOT/cdh5/json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar -P /usr/lib/presto/plugin/hive-hadoop2/

参考

Amazon EMR クラスターの ScriptBootstrapActionConfig - AWS CloudFormation

2018-11-24

CloudFormation で EMR クラスター作成時に Presto の S3 Select Pushdown を有効化する

AWS

CloudFormation で EMR クラスター（EMR リリース 5.18.0以降）作成時に Presto の S3 Select Pushdown を有効化する方法をメモ。

AWSTemplateFormatVersion: '2010-09-09'
Description: Stack to create EMR Cluster.
Parameters:
  InstanceType:
    Type: String

（中略）

Resources:
  cluster:
    Type: AWS::EMR::Cluster
    Properties:
     Configurations:
        - Classification: presto-connector-hive
          ConfigurationProperties:
            hive.s3select-pushdown.enabled: true
            hive.s3select-pushdown.max-connections: 500

EMR クラスター作成後にマスターノードに ssh でログインして設定を確認してみる。

$ cat /etc/presto/conf/catalog/hive.properties
（中略）
hive.s3select-pushdown.max-connections = 500
hive.s3select-pushdown.enabled = true

参考

EMR リリース 5.18.0 では、S3 Select を Hive および Presto と共にお使いいただけます。S3 Select では、アプリケーションは Amazon S3 に保存されたオブジェクトに含まれるデータのサブセットのみを取得できます。これにより、Hive および Presto のクエリ実行時に EMR クラスターに転送してプロセスされる必要のあるデータ量が減るため、パフォーマンスが向上します。これらの機能の詳細については、S3 Select with Hive および S3 Select with Presto のページをご覧ください。
Amazon EMR リリース 5.18.0 にて、Flink 1.6.0、Zeppelin 0.8.0、S3 Select と Hive および Presto の併用をサポート

Enabling S3 Select Pushdown With Presto
To enable S3 Select Pushdown for Presto on Amazon EMR, use the presto-connector-hive configuration classification to set hive.s3select-pushdown.enabled to true as shown in the example below. For more information, see Configuring Applications. The hive.s3select-pushdown.max-connections value must also be set. For most applications, the default setting of 500 should be adequate. For more information, see Understanding and tuning hive.s3select-pushdown.max-connections below.
[
    {
        "classification": "presto-connector-hive",
        "properties": {
            "hive.s3select-pushdown.enabled": "true",
            "hive.s3select-pushdown.max-connections": "500"
        }
    }
]
Understanding and tuning hive.s3select-pushdown.max-connections
By default, Presto uses EMRFS as its file system. The setting fs.s3.maxConnections in the emrfs-site configuration classification specifies the maximum allowable client connections to Amazon S3 through EMRFS for Presto. By default, this is 500. S3 Select Pushdown bypasses EMRFS when accessing Amazon S3 for predicate operations. In this case, the value of hive.s3select-pushdown.max-connections determines the maximum number of client connections allowed for those operations from worker nodes. However, any requests to Amazon S3 that Presto initiates that are not pushed down—for example, GET operations—continue to be governed by the value of fs.s3.maxConnections.
If your application experiences the error "Timeout waiting for connection from pool," increase the value of both hive.s3select-pushdown.max-connections and fs.s3.maxConnections.
Using S3 Select Pushdown with Presto to Improve Performance - Amazon EMR

Configurations は、AWS::EMR::Cluster リソースのプロパティで、Amazon EMR (Amazon EMR) クラスターのソフトウェア設定を指定します。設定の例については、Amazon EMR Release Guide の「Configuring Applications」を参照してください。
構文

JSON
{
  "Classification" : String,
  "ConfigurationProperties" : { 文字列: 文字列, ... },
  "Configurations" : [ Configuration, ... ]
}
YAML
Classification: String
ConfigurationProperties:
  文字列: 文字列
Configurations:
  - Configuration
Amazon EMR クラスターの設定 - AWS CloudFormation

https://github.com/awslabs/aws-cloudformation-templates/tree/master/aws/services/EMR

2018-11-24

s3 cp でクロスアカウントでバケット間コピーすると "An error occurred (AccessDenied) when calling the UploadPartCopy operation" と怒られる

AWS

事象

s3 cp でクロスアカウントでバケット間コピーすると "Access Denied" と怒られる。コピー元バケットからEC2へのコピーや、EC2からコピー先のバケットへのコピーは成功する。

$ aws s3 cp --recursive s3://cp-from/ s3://cp-to/
copy failed: s3://cp-from/test.txt to s3://cp-to/test.txt An error occurred (AccessDenied) when calling the UploadPartCopy operation: Access Denied

解決策

コピー先の自アカウントでS3バケットの「バケットにパブリックポリシーがある場合、パブリックアクセスとクロスアカウントアクセスをブロックする (推奨) 」をOFFにする。

前提

コピー元が他アカウントで、コピー先が自アカウントとする。