ablog

不器用で落着きのない技術者のメモ

Parquet-tools で Parquet ファイルのメタデータや中身を見る

Apache Parquet とは

2010年に Google が発表した Dremel 論文の "record shredding and assembly algorithm" の内容をベースに Twitter と Cloudera が開発した列指向データ構造で、現在は Apache プロジェクトになっている。
詳しくは Retty 林田さんのこちらの記事 参照。
Parquet はバイナリフォーマットで Parquet-tools でメタデータやデータを見ることができる。

環境

インストール手順

Maven のインストール
$ brew  install maven32
thrift のインストール
$ brew install thrift
parquet-tools のインストール
$ git clone https://github.com/Parquet/parquet-mr.git 
  • "Failure to find com.twitter:parquet-hadoop:jar:1.6.0rc3-SNAPSHOT"*1 エラー回避のため、pom.xml を以下の通り編集する。
    • "1.6.0rc3-SNAPSHOT" を "1.6.0" に変更。
$ cd parquet-mr/parquet-tools/ 
$ vi pom.xml
    <groupId>com.twitter</groupId>
    <artifactId>parquet</artifactId>
    <relativePath>../pom.xml</relativePath>
    <!-- <version>1.6.0rc3-SNAPSHOT</version> -->
    <version>1.6.0</version>
  • ビルドする
$ mvn clean package -Plocal 

使ってみる

$ java -jar parquet-tools-1.6.0.jar meta sample.snappy.parquet
file:                    file:/Users/yoheia/sample.snappy.parquet
creator:                 parquet-mr
extra:                   org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"request_timestamp","type":"string","nullable":true,"metadata":{}},{"name":"elb_name","type":"string","nullable":true,"metadata":{}},{"name":"request_ip","type":"string","nullable":true,"metadata":{}},{"name":"request_port","type":"integer","nullable":true,"metadata":{}},{"name":"backend_ip","type":"string","nullable":true,"metadata":{}},{"name":"backend_port","type":"integer","nullable":true,"metadata":{}},{"name":"request_processing_time","type":"double","nullable":true,"metadata":{}},{"name":"backend_processing_time","type":"double","nullable":true,"metadata":{}},{"name":"client_response_time","type":"double","nullable":true,"metadata":{}},{"name":"elb_response_code","type":"string","nullable":true,"metadata":{}},{"name":"backend_response_code","type":"string","nullable":true,"metadata":{}},{"name":"received_bytes","type":"long","nullable":true,"metadata":{}},{"name":"sent_bytes","type":"long","nullable":true,"metadata":{}},{"name":"request_verb","type":"string","nullable":true,"metadata":{}},{"name":"url","type":"string","nullable":true,"metadata":{}},{"name":"protocol","type":"string","nullable":true,"metadata":{}},{"name":"user_agent","type":"string","nullable":true,"metadata":{}},{"name":"ssl_cipher","type":"string","nullable":true,"metadata":{}},{"name":"ssl_protocol","type":"string","nullable":true,"metadata":{}}]}

file schema:             spark_schema
--------------------------------------------------------------------------------
request_timestamp:       OPTIONAL BINARY O:UTF8 R:0 D:1
elb_name:                OPTIONAL BINARY O:UTF8 R:0 D:1
request_ip:              OPTIONAL BINARY O:UTF8 R:0 D:1
request_port:            OPTIONAL INT32 R:0 D:1
backend_ip:              OPTIONAL BINARY O:UTF8 R:0 D:1
backend_port:            OPTIONAL INT32 R:0 D:1
request_processing_time: OPTIONAL DOUBLE R:0 D:1
backend_processing_time: OPTIONAL DOUBLE R:0 D:1
client_response_time:    OPTIONAL DOUBLE R:0 D:1
elb_response_code:       OPTIONAL BINARY O:UTF8 R:0 D:1
backend_response_code:   OPTIONAL BINARY O:UTF8 R:0 D:1
received_bytes:          OPTIONAL INT64 R:0 D:1
sent_bytes:              OPTIONAL INT64 R:0 D:1
request_verb:            OPTIONAL BINARY O:UTF8 R:0 D:1
url:                     OPTIONAL BINARY O:UTF8 R:0 D:1
protocol:                OPTIONAL BINARY O:UTF8 R:0 D:1
user_agent:              OPTIONAL BINARY O:UTF8 R:0 D:1
ssl_cipher:              OPTIONAL BINARY O:UTF8 R:0 D:1
ssl_protocol:            OPTIONAL BINARY O:UTF8 R:0 D:1

...
$ java -jar parquet-tools-1.6.0.jar schema sample.snappy.parquet
message spark_schema {
  optional binary request_timestamp (UTF8);
  optional binary elb_name (UTF8);
  optional binary request_ip (UTF8);
  optional int32 request_port;
  optional binary backend_ip (UTF8);
  optional int32 backend_port;
  optional double request_processing_time;
  optional double backend_processing_time;
  optional double client_response_time;
  optional binary elb_response_code (UTF8);
  optional binary backend_response_code (UTF8);
  optional int64 received_bytes;
  optional int64 sent_bytes;
  optional binary request_verb (UTF8);
  optional binary url (UTF8);
  optional binary protocol (UTF8);
  optional binary user_agent (UTF8);
  optional binary ssl_cipher (UTF8);
  optional binary ssl_protocol (UTF8);
}

...
  • データの中身を参照する(先頭だけ)。
$ java -jar parquet-tools-1.6.0.jar head -n 1 sample.snappy.parquet
request_timestamp = 2015-01-09T03:10:16.796840Z
elb_name = elb_demo_008
request_ip = ...
request_port = 6266
backend_ip = 172.30.55.212
backend_port = 80
request_processing_time = 6.05E-4
backend_processing_time = 0.001817
client_response_time = 3.56E-4
elb_response_code = 200
backend_response_code = 200
received_bytes = 0
sent_bytes = 2452
request_verb = GET
url = http://www.example.com/articles/407
protocol = HTTP/1.1
user_agent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36"
ssl_cipher = -
ssl_protocol = -
  • データの中身を参照する(全て)。
$ java -jar parquet-tools-1.6.0.jar cat sample.parquet 
(出力結果は割愛)