Apache Parquet とは
2010年に Google が発表した Dremel 論文の "record shredding and assembly algorithm" の内容をベースに Twitter と Cloudera が開発した列指向データ構造で、現在は Apache プロジェクトになっている。
詳しくは Retty 林田さんのこちらの記事 参照。
Parquet はバイナリフォーマットで Parquet-tools でメタデータやデータを見ることができる。
インストール手順
Maven のインストール
$ brew install maven32
thrift のインストール
$ brew install thrift
parquet-tools のインストール
$ git clone https://github.com/Parquet/parquet-mr.git
- "Failure to find com.twitter:parquet-hadoop:jar:1.6.0rc3-SNAPSHOT"*1 エラー回避のため、pom.xml を以下の通り編集する。
- "
1.6.0rc3-SNAPSHOT " を "1.6.0 " に変更。
- "
$ cd parquet-mr/parquet-tools/ $ vi pom.xml <groupId>com.twitter</groupId> <artifactId>parquet</artifactId> <relativePath>../pom.xml</relativePath> <!-- <version>1.6.0rc3-SNAPSHOT</version> --> <version>1.6.0</version>
- ビルドする
$ mvn clean package -Plocal
使ってみる
- メタデータを確認する。
$ java -jar parquet-tools-1.6.0.jar meta sample.snappy.parquet file: file:/Users/yoheia/sample.snappy.parquet creator: parquet-mr extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"request_timestamp","type":"string","nullable":true,"metadata":{}},{"name":"elb_name","type":"string","nullable":true,"metadata":{}},{"name":"request_ip","type":"string","nullable":true,"metadata":{}},{"name":"request_port","type":"integer","nullable":true,"metadata":{}},{"name":"backend_ip","type":"string","nullable":true,"metadata":{}},{"name":"backend_port","type":"integer","nullable":true,"metadata":{}},{"name":"request_processing_time","type":"double","nullable":true,"metadata":{}},{"name":"backend_processing_time","type":"double","nullable":true,"metadata":{}},{"name":"client_response_time","type":"double","nullable":true,"metadata":{}},{"name":"elb_response_code","type":"string","nullable":true,"metadata":{}},{"name":"backend_response_code","type":"string","nullable":true,"metadata":{}},{"name":"received_bytes","type":"long","nullable":true,"metadata":{}},{"name":"sent_bytes","type":"long","nullable":true,"metadata":{}},{"name":"request_verb","type":"string","nullable":true,"metadata":{}},{"name":"url","type":"string","nullable":true,"metadata":{}},{"name":"protocol","type":"string","nullable":true,"metadata":{}},{"name":"user_agent","type":"string","nullable":true,"metadata":{}},{"name":"ssl_cipher","type":"string","nullable":true,"metadata":{}},{"name":"ssl_protocol","type":"string","nullable":true,"metadata":{}}]} file schema: spark_schema -------------------------------------------------------------------------------- request_timestamp: OPTIONAL BINARY O:UTF8 R:0 D:1 elb_name: OPTIONAL BINARY O:UTF8 R:0 D:1 request_ip: OPTIONAL BINARY O:UTF8 R:0 D:1 request_port: OPTIONAL INT32 R:0 D:1 backend_ip: OPTIONAL BINARY O:UTF8 R:0 D:1 backend_port: OPTIONAL INT32 R:0 D:1 request_processing_time: OPTIONAL DOUBLE R:0 D:1 backend_processing_time: OPTIONAL DOUBLE R:0 D:1 client_response_time: OPTIONAL DOUBLE R:0 D:1 elb_response_code: OPTIONAL BINARY O:UTF8 R:0 D:1 backend_response_code: OPTIONAL BINARY O:UTF8 R:0 D:1 received_bytes: OPTIONAL INT64 R:0 D:1 sent_bytes: OPTIONAL INT64 R:0 D:1 request_verb: OPTIONAL BINARY O:UTF8 R:0 D:1 url: OPTIONAL BINARY O:UTF8 R:0 D:1 protocol: OPTIONAL BINARY O:UTF8 R:0 D:1 user_agent: OPTIONAL BINARY O:UTF8 R:0 D:1 ssl_cipher: OPTIONAL BINARY O:UTF8 R:0 D:1 ssl_protocol: OPTIONAL BINARY O:UTF8 R:0 D:1 ...
- スキーマを確認する。
$ java -jar parquet-tools-1.6.0.jar schema sample.snappy.parquet message spark_schema { optional binary request_timestamp (UTF8); optional binary elb_name (UTF8); optional binary request_ip (UTF8); optional int32 request_port; optional binary backend_ip (UTF8); optional int32 backend_port; optional double request_processing_time; optional double backend_processing_time; optional double client_response_time; optional binary elb_response_code (UTF8); optional binary backend_response_code (UTF8); optional int64 received_bytes; optional int64 sent_bytes; optional binary request_verb (UTF8); optional binary url (UTF8); optional binary protocol (UTF8); optional binary user_agent (UTF8); optional binary ssl_cipher (UTF8); optional binary ssl_protocol (UTF8); } ...
- データの中身を参照する(先頭だけ)。
$ java -jar parquet-tools-1.6.0.jar head -n 1 sample.snappy.parquet request_timestamp = 2015-01-09T03:10:16.796840Z elb_name = elb_demo_008 request_ip = ... request_port = 6266 backend_ip = 172.30.55.212 backend_port = 80 request_processing_time = 6.05E-4 backend_processing_time = 0.001817 client_response_time = 3.56E-4 elb_response_code = 200 backend_response_code = 200 received_bytes = 0 sent_bytes = 2452 request_verb = GET url = http://www.example.com/articles/407 protocol = HTTP/1.1 user_agent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36" ssl_cipher = - ssl_protocol = -
- データの中身を参照する(全て)。
$ java -jar parquet-tools-1.6.0.jar cat sample.parquet
(出力結果は割愛)