Hive on EMR で S3 Select を有効化すると、I/O量が削減され、実行時間が短縮することを確認した*1。
検証結果
- 通常
hive> select count(tax_region) from sh10.json_sales★ where tax_region = 'US'; Query ID = hadoop_20181125201846_ceb61407-d775-4399-a4ff-b123de4794ea Total jobs = 1 Launching Job 1 out of 1 Status: Running (Executing on YARN cluster with App id application_1543070548885_0006) ---------------------------------------------------------------------------------------------- VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED ---------------------------------------------------------------------------------------------- Map 1 .......... container SUCCEEDED 64 64 0 0 0 0 Reducer 2 ...... container SUCCEEDED 1 1 0 0 0 0 ---------------------------------------------------------------------------------------------- VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 177.90 s ---------------------------------------------------------------------------------------------- OK 145998 Time taken: 181.039 seconds★, Fetched: 1 row(s)
- S3 Select 有効
hive> SET s3select.filter=true; hive> select count(tax_region) from sh10.json_sales_s3select★ where tax_region = 'US'; Query ID = hadoop_20181125203338_a4b89db5-5f2e-46e2-b1a8-c86965d74225 Total jobs = 1 Launching Job 1 out of 1 Status: Running (Executing on YARN cluster with App id application_1543070548885_0006) ---------------------------------------------------------------------------------------------- VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED ---------------------------------------------------------------------------------------------- Map 1 .......... container SUCCEEDED 64 64 0 0 0 0 Reducer 2 ...... container SUCCEEDED 1 1 0 0 0 0 ---------------------------------------------------------------------------------------------- VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 54.13 s ---------------------------------------------------------------------------------------------- OK 145998 Time taken: 54.658 seconds★, Fetched: 1 row(s)
準備
- hive シェルを起動する。
$ hive
- S3のJSONデータに対して外部テーブルを定義する。
CREATE EXTERNAL TABLE IF NOT EXISTS sh10.json_sales( prod_id int, cust_id int, time_id string, channel_id int, promo_id int, quantity_sold double, seller int, fulfillment_center int, courier_org int, tax_country string, tax_region string, amount_sold double ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' LOCATION 's3://sb20181126/data/json/sh10/sales/';
- 同じS3のJSONデータに対して外部テーブルを定義する(S3 Select有効化)。
CREATE EXTERNAL TABLE IF NOT EXISTS sh10.json_sales_s3select( prod_id int, cust_id int, time_id string, channel_id int, promo_id int, quantity_sold double, seller int, fulfillment_center int, courier_org int, tax_country string, tax_region string, amount_sold double ) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS INPUTFORMAT 'com.amazonaws.emr.s3select.hive.S3SelectableTextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://sb20181126/data/json/sh10/sales/' TBLPROPERTIES ( "s3select.format" = "json" );
補足
- テーブル定義を S3 Select 対応しても、s3select.filter を有効にしないと S3 Select は効かない。
hive> SET s3select.filter=false; hive> select count(tax_region) from sh10.json_sales_s3select★ where tax_region = 'US'; Query ID = hadoop_20181125203003_e28e36fc-8fd4-46ea-966f-0c65bfdc9024 Total jobs = 1 Launching Job 1 out of 1 Status: Running (Executing on YARN cluster with App id application_1543070548885_0006) ---------------------------------------------------------------------------------------------- VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED ---------------------------------------------------------------------------------------------- Map 1 .......... container SUCCEEDED 64 64 0 0 0 0 Reducer 2 ...... container SUCCEEDED 1 1 0 0 0 0 ---------------------------------------------------------------------------------------------- VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 171.59 s ---------------------------------------------------------------------------------------------- OK 145998 Time taken: 172.17 seconds★, Fetched: 1 row(s)
参考
Specifying S3 Select in Your Code
To use S3 select in your Hive table, create the table by specifying com.amazonaws.emr.s3select.hive.S3SelectableTextInputFormat as the INPUTFORMAT class name, and specify a value for the s3select.format property using the TBLPROPERTIES clause.By default, S3 Select is disabled when you run queries. Enable S3 Select by setting s3select.filter to true in your Hive session as shown below. The examples below demonstrate how to specify S3 Select when creating a table from underlying CSV and JSON files and then querying the table using a simple select statement.
- Example CREATE TABLE Statement for CSV-Based Table
CREATE TABLE mys3selecttable ( col1 string, col2 int, col3 boolean ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS INPUTFORMAT 'com.amazonaws.emr.s3select.hive.S3SelectableTextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://path/to/mycsvfile/' TBLPROPERTIES ( "s3select.format" = "csv", "s3select.headerInfo" = "ignore" );
- Example CREATE TABLE Statement for JSON-Based Table
CREATE TABLE mys3selecttable ( col1 string, col2 int, col3 boolean ) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS INPUTFORMAT 'com.amazonaws.emr.s3select.hive.S3SelectableTextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://path/to/json/' TBLPROPERTIES ( "s3select.format" = "json" );
- Example SELECT TABLE Statement
SET s3select.filter=true; SELECT * FROM mys3selecttable WHERE col2 > 10;Using S3 Select with Hive to Improve Performance - Amazon EMR
*1:当り前の話ですが、クエリで絞り込みの効くフィルタ条件が指定されて push down されるケースで効果が出ます