参考
$ aws s3 cp s3://amazon-reviews-pds/parquet/product_category=Apparel/part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet ./
$ java -jar parquet-tools-1.6.0.jar meta ~/part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
file: file:/Users/yohei-a/part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
creator: parquet-mr version 1.8.2 (build c6522788629e590a53eb79874b95f6c3ff11f16c)
extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"marketplace","type":"string","nullable":true,"metadata":{}},{"name":"customer_id","type":"string","nullable":true,"metadata":{}},{"name":"review_id","type":"string","nullable":true,"metadata":{}},{"name":"product_id","type":"string","nullable":true,"metadata":{}},{"name":"product_parent","type":"string","nullable":true,"metadata":{}},{"name":"product_title","type":"string","nullable":true,"metadata":{}},{"name":"star_rating","type":"integer","nullable":true,"metadata":{}},{"name":"helpful_votes","type":"integer","nullable":true,"metadata":{}},{"name":"total_votes","type":"integer","nullable":true,"metadata":{}},{"name":"vine","type":"string","nullable":true,"metadata":{}},{"name":"verified_purchase","type":"string","nullable":true,"metadata":{}},{"name":"review_headline","type":"string","nullable":true,"metadata":{}},{"name":"review_body","type":"string","nullable":true,"metadata":{}},{"name":"review_date","type":"date","nullable":true,"metadata":{}},{"name":"year","type":"integer","nullable":true,"metadata":{}}]}
file schema: spark_schema
--------------------------------------------------------------------------------
marketplace: OPTIONAL BINARY O:UTF8 R:0 D:1
customer_id: OPTIONAL BINARY O:UTF8 R:0 D:1
review_id: OPTIONAL BINARY O:UTF8 R:0 D:1
product_id: OPTIONAL BINARY O:UTF8 R:0 D:1
product_parent: OPTIONAL BINARY O:UTF8 R:0 D:1
product_title: OPTIONAL BINARY O:UTF8 R:0 D:1
star_rating: OPTIONAL INT32 R:0 D:1
helpful_votes: OPTIONAL INT32 R:0 D:1
total_votes: OPTIONAL INT32 R:0 D:1
vine: OPTIONAL BINARY O:UTF8 R:0 D:1
verified_purchase: OPTIONAL BINARY O:UTF8 R:0 D:1
review_headline: OPTIONAL BINARY O:UTF8 R:0 D:1
review_body: OPTIONAL BINARY O:UTF8 R:0 D:1 ★
review_date: OPTIONAL INT32 O:DATE R:0 D:1
year: OPTIONAL INT32 R:0 D:1 ★
row group 1: RC:589900 TS:192293412 OFFSET:4
--------------------------------------------------------------------------------
marketplace: BINARY SNAPPY DO:0 FPO:4 SZ:231/221/0.96 VC:589900 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY
customer_id: BINARY SNAPPY DO:0 FPO:235 SZ:4581545/6957912/1.52 VC:589900 ENC:RLE,PLAIN,BIT_PACKED
review_id: BINARY SNAPPY DO:0 FPO:4581780 SZ:8637455/10463253/1.21 VC:589900 ENC:RLE,PLAIN,BIT_PACKED
product_id: BINARY SNAPPY DO:0 FPO:13219235 SZ:5034158/8259079/1.64 VC:589900 ENC:RLE,PLAIN,BIT_PACKED
product_parent: BINARY SNAPPY DO:0 FPO:18253393 SZ:5187236/7484331/1.44 VC:589900 ENC:RLE,PLAIN,BIT_PACKED,PLAIN_DICTIONARY
product_title: BINARY SNAPPY DO:0 FPO:23440629 SZ:21585566/34792104/1.61 VC:589900 ENC:RLE,PLAIN,BIT_PACKED,PLAIN_DICTIONARY
star_rating: INT32 SNAPPY DO:0 FPO:45026195 SZ:221797/221763/1.00 VC:589900 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY
helpful_votes: INT32 SNAPPY DO:0 FPO:45247992 SZ:276866/558457/2.02 VC:589900 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY
total_votes: INT32 SNAPPY DO:0 FPO:45524858 SZ:306153/590359/1.93 VC:589900 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY
vine: BINARY SNAPPY DO:0 FPO:45831011 SZ:962/1025/1.07 VC:589900 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY
verified_purchase: BINARY SNAPPY DO:0 FPO:45831973 SZ:62135/74471/1.20 VC:589900 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY
review_headline: BINARY SNAPPY DO:0 FPO:45894108 SZ:8309372/14205327/1.71 VC:589900 ENC:RLE,PLAIN,BIT_PACKED,PLAIN_DICTIONARY
review_body: BINARY SNAPPY DO:0 FPO:54203480 SZ:66263112/107760057/1.63 VC:589900 ENC:RLE,PLAIN,BIT_PACKED
review_date: INT32 SNAPPY DO:0 FPO:120466592 SZ:70391/738861/10.50 VC:589900 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY
year: INT32 SNAPPY DO:0 FPO:120536983 SZ:9111/186192/20.44 VC:589900 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY
準備手順
データを S3 から HDFS にコピーする
$ ssh -i ~/mykey.pem hadoop@ec2-**-***-***-233.compute-1.amazonaws.com
$ sudo su - hdfs
$ hadoop fs -mkdir /amazon-reviews-pds
$ nohup s3-dist-cp --src s3://amazon-reviews-pds/ --dest /amazon-reviews-pds &
$ hadoop fs -ls -R -h /amazon-reviews-pds/
外部テーブルを定義する。
$ hive
> create database if not exists parquet;
> show databases;
> use parquet;
> CREATE EXTERNAL TABLE parquet.amazon_reviews_parquet(
marketplace string,
customer_id string,
review_id string,
product_id string,
product_parent string,
product_title string,
star_rating int,
helpful_votes int,
total_votes int,
vine string,
verified_purchase string,
review_headline string,
review_body string,
review_date bigint,
year int)
PARTITIONED BY (product_category string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'hdfs:///amazon-reviews-pds/parquet';
> MSCK REPAIR TABLE parquet.amazon_reviews_parquet;
> quit;
$ presto-cli
presto> use hive.parquet;
presto:parquet> select count(*) from amazon_reviews_parquet;
presto:parquet> select count(year) from amazon_reviews_parquet;
presto:parquet> select count(review_body) from amazon_reviews_parquet;
性能情報取得ツールのインストール
- マスターノードとコアノードに以下をインストールする。
$ sudo yum -y install htop sysstat dstat iotop ltrace strace perf blktrace gnuplot
- perf-map-agent をインストールする。
$ sudo yum -y install cmake
$ git clone --depth=1 https://github.com/jrudolph/perf-map-agent
$ cd perf-map-agent
$ cmake .
$ make
$ git clone https://github.com/brendangregg/FlameGraph
$ chmod +x FlameGraph/*.pl
$ vi ~/.bashrc
export FLAMEGRAPH_DIR=~/FlameGraph