ablog

不器用で落着きのない技術者のメモ

Parquet のタイプ(型)について

Type
  • ファイルのプリミティヴな型は最小限の種類に絞られている。

Types
The types supported by the file format are intended to be as minimal as possible, with a focus on how the types effect on disk storage. For example, 16-bit ints are not explicitly supported in the storage format since they are covered by 32-bit ints with an efficient encoding. This reduces the complexity of implementing readers and writers for the format. The types are:

  • BOOLEAN: 1 bit boolean
  • INT32: 32 bit signed ints
  • INT64: 64 bit signed ints
  • INT96: 96 bit signed ints
  • FLOAT: IEEE 32-bit floating point values
  • DOUBLE: IEEE 64-bit floating point values
  • BYTE_ARRAY: arbitrarily long byte arrays
  • FIXED_LEN_BYTE_ARRAY: fixed length byte arrays
GitHub - apache/parquet-format: Apache Parquet
Logical Types
  • Logical Types はプリミティヴな型をどのように decode、interpret するかを定義している。

Logical types are used to extend the types that parquet can be used to store, by specifying how the primitive types should be interpreted. This keeps the set of primitive types to a minimum and reuses parquet's efficient encodings. For example, strings are stored as byte arrays (binary) with a UTF8 annotation. These annotations define how to further decode and interpret the data. Annotations are stored as LogicalType fields in the file metadata and are documented in LogicalTypes.md.

GitHub - apache/parquet-format: Apache Parquet
  • Parquet は Column Chunk、Column Index、Data Page のレベルで最小値・最大値を統計情報として持っている。タイプによって値の比較方法が変わる。 unknown logical type の場合、統計情報によるプルーニングがされない。

Sort Order
Parquet stores min/max statistics at several levels (such as Column Chunk, Column Index and Data Page). Comparison for values of a type obey the following rules:

  1. Each logical type has a specified comparison order. If a column is annotated with an unknown logical type, statistics may not be used for pruning data. The sort order for logical types is documented in the LogicalTypes.md page.
  2. For primitive types, the following rules apply:
  • BOOLEAN - false, true
  • INT32, INT64 - Signed comparison.
  • FLOAT, DOUBLE - Signed comparison with special handling of NaNs and signed zeros. The details are documented in the Thrift definition in the ColumnOrder union. They are summarized here but the Thrift definition is considered authoritative:
    • NaNs should not be written to min or max statistics fields.
    • If the computed max value is zero (whether negative or positive), +0.0 should be written into the max statistics field.
    • If the computed min value is zero (whether negative or positive), -0.0 should be written into the min statistics field.
  • For backwards compatibility when reading files:
    • If the min is a NaN, it should be ignored.
    • If the max is a NaN, it should be ignored.
    • If the min is +0, the row group may contain -0 values as well.
    • If the max is -0, the row group may contain +0 values as well.
    • When looking for NaN values, min and max should be ignored.
  • BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY - Lexicographic unsigned byte-wise comparison.
GitHub - apache/parquet-format: Apache Parquet