ablog

不器用で落着きのない技術者のメモ

Hive on EMR で S3 Select を有効化してI/O量を削減する

AWS

Hive on EMR で S3 Select を有効化すると、I/O量が削減され、実行時間が短縮することを確認した*1。

検証結果

通常

hive> select count(tax_region) from sh10.json_sales★ where tax_region = 'US';
Query ID = hadoop_20181125201846_ceb61407-d775-4399-a4ff-b123de4794ea
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1543070548885_0006)

----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED     64         64        0        0       0       0
Reducer 2 ...... container     SUCCEEDED      1          1        0        0       0       0
----------------------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 177.90 s
----------------------------------------------------------------------------------------------
OK
145998
Time taken: 181.039 seconds★, Fetched: 1 row(s)

S3 Select 有効

hive> SET s3select.filter=true;
hive> select count(tax_region) from sh10.json_sales_s3select★ where tax_region = 'US';
Query ID = hadoop_20181125203338_a4b89db5-5f2e-46e2-b1a8-c86965d74225
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1543070548885_0006)

----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED     64         64        0        0       0       0
Reducer 2 ...... container     SUCCEEDED      1          1        0        0       0       0
----------------------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 54.13 s
----------------------------------------------------------------------------------------------
OK
145998
Time taken: 54.658 seconds★, Fetched: 1 row(s)

準備

hive シェルを起動する。

$ hive

S3のJSONデータに対して外部テーブルを定義する。

CREATE EXTERNAL TABLE IF NOT EXISTS sh10.json_sales(
  prod_id int,
  cust_id int,
  time_id string,
  channel_id int,
  promo_id int,
  quantity_sold double,
  seller int,
  fulfillment_center int,
  courier_org int,
  tax_country string,
  tax_region string,
  amount_sold double
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://sb20181126/data/json/sh10/sales/';

同じS3のJSONデータに対して外部テーブルを定義する（S3 Select有効化）。

CREATE EXTERNAL TABLE IF NOT EXISTS sh10.json_sales_s3select(
  prod_id int,
  cust_id int,
  time_id string,
  channel_id int,
  promo_id int,
  quantity_sold double,
  seller int,
  fulfillment_center int,
  courier_org int,
  tax_country string,
  tax_region string,
  amount_sold double
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS
INPUTFORMAT
  'com.amazonaws.emr.s3select.hive.S3SelectableTextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://sb20181126/data/json/sh10/sales/'
TBLPROPERTIES (
  "s3select.format" = "json"
);

補足

テーブル定義を S3 Select 対応しても、s3select.filter を有効にしないと S3 Select は効かない。

hive> SET s3select.filter=false;
hive> select count(tax_region) from sh10.json_sales_s3select★ where tax_region = 'US';
Query ID = hadoop_20181125203003_e28e36fc-8fd4-46ea-966f-0c65bfdc9024
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1543070548885_0006)

----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED     64         64        0        0       0       0
Reducer 2 ...... container     SUCCEEDED      1          1        0        0       0       0
----------------------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 171.59 s
----------------------------------------------------------------------------------------------
OK
145998
Time taken: 172.17 seconds★, Fetched: 1 row(s)

参考

Specifying S3 Select in Your Code
To use S3 select in your Hive table, create the table by specifying com.amazonaws.emr.s3select.hive.S3SelectableTextInputFormat as the INPUTFORMAT class name, and specify a value for the s3select.format property using the TBLPROPERTIES clause.
By default, S3 Select is disabled when you run queries. Enable S3 Select by setting s3select.filter to true in your Hive session as shown below. The examples below demonstrate how to specify S3 Select when creating a table from underlying CSV and JSON files and then querying the table using a simple select statement.

Example CREATE TABLE Statement for CSV-Based Table
CREATE TABLE mys3selecttable (
col1 string,
col2 int,
col3 boolean
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS
INPUTFORMAT
  'com.amazonaws.emr.s3select.hive.S3SelectableTextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://path/to/mycsvfile/'
TBLPROPERTIES (
  "s3select.format" = "csv",
  "s3select.headerInfo" = "ignore"
);
Example CREATE TABLE Statement for JSON-Based Table
CREATE TABLE mys3selecttable (
col1 string,
col2 int,
col3 boolean
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS
INPUTFORMAT
  'com.amazonaws.emr.s3select.hive.S3SelectableTextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://path/to/json/'
TBLPROPERTIES (
  "s3select.format" = "json"
);
Example SELECT TABLE Statement
SET s3select.filter=true;
SELECT * FROM mys3selecttable WHERE col2 > 10;
Using S3 Select with Hive to Improve Performance - Amazon EMR

*1:当り前の話ですが、クエリで絞り込みの効くフィルタ条件が指定されて push down されるケースで効果が出ます