2018-12-08

Hive で Parquet にクエリすると ParquetDecodingException: Can not read value at 0 in block -1 in file

AWS

事象

Hive で Parquet にクエリすると "java.io.IOException: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file" と怒られる。

$ hive

Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j2.properties Async: true
hive> select count(*) from sales;
FAILED: SemanticException [Error 10001]: Line 1:21 Table not found 'sales'
hive> select count(*) from  sh10_option.sales;
Query ID = hadoop_20181208023457_ca9f628a-2f32-493a-8900-f22f51ff774b
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1543787350286_0018)

----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
----------------------------------------------------------------------------------------------
Map 1            container       RUNNING     28          0        0       28      46       0
Reducer 2        container        INITED      1          0        0        1       0       0
----------------------------------------------------------------------------------------------
VERTICES: 00/02  [>>--------------------------] 0%    ELAPSED TIME: 139.23 s
----------------------------------------------------------------------------------------------
Status: Failed
Vertex failed, vertexName=Map 1, vertexId=vertex_1543787350286_0018_1_00, diagnostics=[Task failed, taskId=task_1543787350286_0018_1_00_000008, diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( failure ) : attempt_1543787350286_0018_1_00_000008_0:java.lang.RuntimeException:java.lang.RuntimeException: java.io.IOException: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file s3://bigdata-handson-123456789012/data/parquet_pyspark/sh10/sales/year=2012/month=6/part-00048-86118253-0ee3-45f9-8a9b-c0df410712e0.c000.snappy.parquet
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211)
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168)
        at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370)
        at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
        at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
        at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
        at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
        at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: java.io.IOException: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file s3://bigdata-handson-123456789012/data/parquet_pyspark/sh10/sales/year=2012/month=6/part-00048-86118253-0ee3-45f9-8a9b-c0df410712e0.c000.snappy.parquet
        at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:206)
        at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.<init>(TezGroupedSplitsInputFormat.java:145)
        at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat.getRecordReader(TezGroupedSplitsInputFormat.java:111)
        at org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:157)
        at org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:83)
        at org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:694)
        at org.apache.tez.mapreduce.input.MRInput.initFromEvent(MRInput.java:653)
        at org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:145)
        at org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:109)
        at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.getMRInput(MapRecordProcessor.java:525)
        at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:171)
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:184)
        ... 14 more

原因

Hive では decimal 型は INT 32 で定義されるが、Spark 1.4 以降では精度によって INT32 か INT64 か変わるため。INT 64 になると上記のエラーになる。

This issue is caused because of different parquet conventions used in Hive and Spark. In Hive, the decimal datatype is represented as fixed bytes (INT 32). In Spark 1.4 or later the default convention is to use the Standard Parquet representation for decimal data type. As per the Standard Parquet representation based on the precision of the column datatype, the underlying representation changes.
eg: DECIMAL can be used to annotate the following types: int32: for 1 <= precision <= 9 int64: for 1 <= precision <= 18; precision < 10 will produce a warning
Hence this issue happens only with the usage of datatypes which have different representations in the different Parquet conventions. If the datatype is DECIMAL (10,3), both the conventions represent it as INT32, hence we won't face an issue. If you are not aware of the internal representation of the datatypes it is safe to use the same convention used for writing while reading. With Hive, you do not have the flexibility to choose the Parquet convention. But with Spark, you do.
java - parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file - Stack Overflow

解決策

PySpark のコードで sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true") を指定するかもしくは、/etc/spark/conf/spark-defaults.conf で以下の通り設定する。

spark.sql.parquet.writeLegacyFormat     true

設定を確認する

$ pyspark
>>> from pyspark.context import SparkContext
>>> from pyspark.sql import SQLContext
>>> from pyspark.sql.functions import year, month, date_format
>>> sc._jsc.hadoopConfiguration().set("mapreduce.fileoutputcommitter.algorithm.version", "2")
>>> sc._conf.getAll()
[(u'spark.eventLog.enabled', u'true'), (u'spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version.emr_internal_use_only.EmrFileSystem', u'2'), （中略）(u'spark.sql.parquet.writeLegacyFormat', u'true'), （中略）(u'spark.blacklist.decommissioning.enabled', u'true')]

環境

emr-5.19.0
Spark 2.3.2

参考

2018-12-08

QuickSight で Athena にクエリしようとすると Access Denied と怒られる

AWS

事象

Athenaから直接クエリを実行できるが、QuickSightから同じテーブルにクエリすると "Access Denied" と怒られる。

[Simba][AthenaJDBC](100071) An error has been thrown from the AWS Athena client. Access Denied

原因

QuickSight でユーザー作成時にS3へのアクセス許可設定をしていない。

解決策

以下の通り設定する。当然の設定だが、この事象が発生していると相談を受けた場合の逆引き用にメモ。

アプリケーションバーでユーザー名を選択してから、[Manage QuickSight] を選択します。

[Account settings] を選択します。

[Account permissions] (アカウント権限) で、[Edit AWS Permissions] (AWS アクセス許可の編集) を選択します。

(Amazon QuickSight Enterprise Edition アカウントのみ) AWS サインインページで、AWS または IAM の認証情報を入力します。

[Edit QuickSight read-only access to AWS resources] ページで、[Enable autodiscovery of your data and users in your AWS Redshift, RDS, and IAM services] を選択します。これにより、Amazon QuickSight は、AWS アカウントに関連付けられたこれらのタイプのリソースを自動検出することができます。または、このセクションを展開し、Amazon QuickSight で使用するリソースの個々のオプションを選択します。

Amazon S3 バケットが 1 つ以上ある場合は、[Amazon S3 (all buckets) ( (すべてのバケット))] チェックボックスをオンにして、それらのバケットに対する Amazon QuickSight アクセス権限を編集します。[Choose Amazon S3 buckets ( バケットを選択)] で、Amazon QuickSight に対して使用可能にするバケットを選択してから、[Select buckets (バケットを選択)] を選択します。

Amazon Athena データベースがある場合は、[Athena] を選択して、Amazon QuickSight がそれらのデータベースにアクセスできるようにします。

[Apply] を選択します。

Amazon QuickSight の AWS リソースへのアクセス権限の管理 - Amazon QuickSight

2018-12-06

Spark DataFrame の repartition メソッドは何をするものか

Spark

Spark DataFrame の repartition(パーティション数, カラム名) とすると指定したカラムで指定したパーティション数にパーティショニングする。パーティション数を省略するとデフォルト値（Spark 2.3.2 では 200）になる。

repartition(numPartitions, *cols)
Returns a new DataFrame partitioned by the given partitioning expressions. The resulting DataFrame is hash partitioned.
numPartitions can be an int to specify the target number of partitions or a Column. If it is a Column, it will be used as the first partitioning column. If not specified, the default number of partitions is used.
pyspark.sql module — PySpark 2.3.2 documentation

Other Configuration Options
The following options can also be used to tune the performance of query execution. It is possible that these options will be deprecated in future release as more optimizations are performed automatically.

Property Name Default Meaning

spark.sql.shuffle.partitions 200 Configures the number of partitions to use when shuffling data for joins or aggregations.

Spark SQL and DataFrames - Spark 2.3.2 Documentation

Property Name	Default	Meaning
spark.sql.shuffle.partitions	200	Configures the number of partitions to use when shuffling data for joins or aggregations.

以下、実行例。
PySpark を起動して、S3 の JSON を読んで DataFrame を生成する。

$ pyspark

（中略）

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.3.2
      /_/

Using Python version 2.7.14 (default, May  2 2018 18:31:34)
SparkSession available as 'spark'.
>>> from pyspark.context import SparkContext
>>> from pyspark.sql import SQLContext
>>> from pyspark.sql.functions import year, month, date_format
>>>
>>> your_backet_name = "data-bucket"
>>> dataset = "sh10"
>>> in_path = "s3://{your_backet_name}/data/json/{dataset}/sales/*.gz".format(your_backet_name=your_backet_name, dataset=dataset)
>>> out_path = "s3://{your_backet_name}/data/parquet_pyspark/{dataset}/sales/".format(your_backet_name=your_backet_name, dataset=dataset)
>>>
>>> sc._jsc.hadoopConfiguration().set("mapreduce.fileoutputcommitter.algorithm.version", "2")
>>> sqlContext = SQLContext(sc)
>>> sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")
>>>
>>> df = sqlContext.read.json(in_path)

Dataframe の内容を表示する。

>>> df.show()
+-----------+----------+-----------+--------+------------------+-------+--------+-------------+------+-----------+----------+-------------------+
|amount_sold|channel_id|courier_org| cust_id|fulfillment_center|prod_id|promo_id|quantity_sold|seller|tax_country|tax_region|            time_id|
+-----------+----------+-----------+--------+------------------+-------+--------+-------------+------+-----------+----------+-------------------+
|       82.0|         3|       1190|11635215|             13260|     62|     256|         44.0| 10691|         NQ|        HA|1995-01-01 00:00:00|
|       66.0|         3|       1459| 7887314|             14912|    102|     120|         84.0| 10944|         CI|        MU|1995-01-01 00:00:00|
|       16.0|         5|       1124| 7245615|             13368|     66|     136|         16.0| 10337|         YH|          |1995-01-01 00:00:00|
|       26.0|         2|       1319| 5363281|             14091|     59|     445|         93.0| 10863|         LH|        CU|1995-01-01 00:00:00|
|       91.0|         2|       1201|14565805|             13145|    126|     422|         21.0| 10568|         PH|        QW|1995-01-01 00:00:00|
|       72.0|         9|       1497|13073124|             14572|     43|     495|         57.0| 10778|         EW|        JB|1995-01-01 00:00:00|
|       83.0|         4|       1419|13223847|             13466|     15|     185|          6.0| 10153|         DW|          |1995-01-01 00:00:00|
|       28.0|         5|       1083| 8831083|             13148|     72|      91|         75.0| 10038|         NP|          |1995-01-01 00:00:00|
|       79.0|         2|       1364| 1122827|             14594|     76|     307|         83.0| 10896|         HR|          |1995-01-01 00:00:00|
|       36.0|         4|       1408|13016476|             14703|     65|      84|         70.0| 10116|         SB|        RV|1995-01-01 00:00:00|
|       65.0|         3|       1183| 6557310|             14384|     89|     362|         65.0| 10765|         CD|        OC|1995-01-01 00:00:00|
|        9.0|         9|       1023|11633931|             13586|     72|     316|         48.0| 10572|         JN|        LU|1995-01-01 00:00:00|
|        1.0|         5|       1515| 7868183|             14906|     97|     244|         98.0| 10605|         HE|        WO|1995-01-01 00:00:00|
|       93.0|         4|       1243| 4513870|             14918|     66|     170|         83.0| 10336|         XG|        US|1995-01-01 00:00:00|
|       13.0|         4|       1456| 9731389|             14339|     21|      90|         74.0| 10326|         TR|        NO|1995-01-01 00:00:00|
|        8.0|         9|       1541|12388703|             13591|    125|     124|         45.0| 10733|         AU|        AA|1995-01-01 00:00:00|
|       39.0|         9|       1221| 4713346|             14190|    111|     400|         43.0| 10979|         PD|          |1995-01-01 00:00:00|
|       48.0|         5|       1554| 4589564|             13727|    121|     331|         70.0| 10626|         AF|        PV|1995-01-01 00:00:00|
|        4.0|         4|       1048| 1051232|             14155|     48|     511|         36.0| 10224|         TF|          |1995-01-01 00:00:00|
|        7.0|         4|       1031|14342970|             14068|    130|     507|         29.0| 10128|         TU|          |1995-01-01 00:00:00|
+-----------+----------+-----------+--------+------------------+-------+--------+-------------+------+-----------+----------+-------------------+
only showing top 20 rows

DataFrame に year カラムを追加して表示する。

>>> yearAddedDf = df.withColumn("year", year(df.time_id))
>>> yearAddedDf.show()
+-----------+----------+-----------+--------+------------------+-------+--------+-------------+------+-----------+----------+-------------------+----+
|amount_sold|channel_id|courier_org| cust_id|fulfillment_center|prod_id|promo_id|quantity_sold|seller|tax_country|tax_region|            time_id|year|
+-----------+----------+-----------+--------+------------------+-------+--------+-------------+------+-----------+----------+-------------------+----+
|       82.0|         3|       1190|11635215|             13260|     62|     256|         44.0| 10691|         NQ|        HA|1995-01-01 00:00:00|1995|
|       66.0|         3|       1459| 7887314|             14912|    102|     120|         84.0| 10944|         CI|        MU|1995-01-01 00:00:00|1995|
|       16.0|         5|       1124| 7245615|             13368|     66|     136|         16.0| 10337|         YH|          |1995-01-01 00:00:00|1995|
|       26.0|         2|       1319| 5363281|             14091|     59|     445|         93.0| 10863|         LH|        CU|1995-01-01 00:00:00|1995|
|       91.0|         2|       1201|14565805|             13145|    126|     422|         21.0| 10568|         PH|        QW|1995-01-01 00:00:00|1995|
|       72.0|         9|       1497|13073124|             14572|     43|     495|         57.0| 10778|         EW|        JB|1995-01-01 00:00:00|1995|
|       83.0|         4|       1419|13223847|             13466|     15|     185|          6.0| 10153|         DW|          |1995-01-01 00:00:00|1995|
|       28.0|         5|       1083| 8831083|             13148|     72|      91|         75.0| 10038|         NP|          |1995-01-01 00:00:00|1995|
|       79.0|         2|       1364| 1122827|             14594|     76|     307|         83.0| 10896|         HR|          |1995-01-01 00:00:00|1995|
|       36.0|         4|       1408|13016476|             14703|     65|      84|         70.0| 10116|         SB|        RV|1995-01-01 00:00:00|1995|
|       65.0|         3|       1183| 6557310|             14384|     89|     362|         65.0| 10765|         CD|        OC|1995-01-01 00:00:00|1995|
|        9.0|         9|       1023|11633931|             13586|     72|     316|         48.0| 10572|         JN|        LU|1995-01-01 00:00:00|1995|
|        1.0|         5|       1515| 7868183|             14906|     97|     244|         98.0| 10605|         HE|        WO|1995-01-01 00:00:00|1995|
|       93.0|         4|       1243| 4513870|             14918|     66|     170|         83.0| 10336|         XG|        US|1995-01-01 00:00:00|1995|
|       13.0|         4|       1456| 9731389|             14339|     21|      90|         74.0| 10326|         TR|        NO|1995-01-01 00:00:00|1995|
|        8.0|         9|       1541|12388703|             13591|    125|     124|         45.0| 10733|         AU|        AA|1995-01-01 00:00:00|1995|
|       39.0|         9|       1221| 4713346|             14190|    111|     400|         43.0| 10979|         PD|          |1995-01-01 00:00:00|1995|
|       48.0|         5|       1554| 4589564|             13727|    121|     331|         70.0| 10626|         AF|        PV|1995-01-01 00:00:00|1995|
|        4.0|         4|       1048| 1051232|             14155|     48|     511|         36.0| 10224|         TF|          |1995-01-01 00:00:00|1995|
|        7.0|         4|       1031|14342970|             14068|    130|     507|         29.0| 10128|         TU|          |1995-01-01 00:00:00|1995|
+-----------+----------+-----------+--------+------------------+-------+--------+-------------+------+-----------+----------+-------------------+----+
only showing top 20 rows

DataFrame に month を追加して表示する。

>>> monthAddedDf = yearAddedDf.withColumn("month", month(yearAddedDf.time_id))
>>> monthAddedDf.show()
+-----------+----------+-----------+--------+------------------+-------+--------+-------------+------+-----------+----------+-------------------+----+-----+
|amount_sold|channel_id|courier_org| cust_id|fulfillment_center|prod_id|promo_id|quantity_sold|seller|tax_country|tax_region|            time_id|year|month|
+-----------+----------+-----------+--------+------------------+-------+--------+-------------+------+-----------+----------+-------------------+----+-----+
|       82.0|         3|       1190|11635215|             13260|     62|     256|         44.0| 10691|         NQ|        HA|1995-01-01 00:00:00|1995|    1|
|       66.0|         3|       1459| 7887314|             14912|    102|     120|         84.0| 10944|         CI|        MU|1995-01-01 00:00:00|1995|    1|
|       16.0|         5|       1124| 7245615|             13368|     66|     136|         16.0| 10337|         YH|          |1995-01-01 00:00:00|1995|    1|
|       26.0|         2|       1319| 5363281|             14091|     59|     445|         93.0| 10863|         LH|        CU|1995-01-01 00:00:00|1995|    1|
|       91.0|         2|       1201|14565805|             13145|    126|     422|         21.0| 10568|         PH|        QW|1995-01-01 00:00:00|1995|    1|
|       72.0|         9|       1497|13073124|             14572|     43|     495|         57.0| 10778|         EW|        JB|1995-01-01 00:00:00|1995|    1|
|       83.0|         4|       1419|13223847|             13466|     15|     185|          6.0| 10153|         DW|          |1995-01-01 00:00:00|1995|    1|
|       28.0|         5|       1083| 8831083|             13148|     72|      91|         75.0| 10038|         NP|          |1995-01-01 00:00:00|1995|    1|
|       79.0|         2|       1364| 1122827|             14594|     76|     307|         83.0| 10896|         HR|          |1995-01-01 00:00:00|1995|    1|
|       36.0|         4|       1408|13016476|             14703|     65|      84|         70.0| 10116|         SB|        RV|1995-01-01 00:00:00|1995|    1|
|       65.0|         3|       1183| 6557310|             14384|     89|     362|         65.0| 10765|         CD|        OC|1995-01-01 00:00:00|1995|    1|
|        9.0|         9|       1023|11633931|             13586|     72|     316|         48.0| 10572|         JN|        LU|1995-01-01 00:00:00|1995|    1|
|        1.0|         5|       1515| 7868183|             14906|     97|     244|         98.0| 10605|         HE|        WO|1995-01-01 00:00:00|1995|    1|
|       93.0|         4|       1243| 4513870|             14918|     66|     170|         83.0| 10336|         XG|        US|1995-01-01 00:00:00|1995|    1|
|       13.0|         4|       1456| 9731389|             14339|     21|      90|         74.0| 10326|         TR|        NO|1995-01-01 00:00:00|1995|    1|
|        8.0|         9|       1541|12388703|             13591|    125|     124|         45.0| 10733|         AU|        AA|1995-01-01 00:00:00|1995|    1|
|       39.0|         9|       1221| 4713346|             14190|    111|     400|         43.0| 10979|         PD|          |1995-01-01 00:00:00|1995|    1|
|       48.0|         5|       1554| 4589564|             13727|    121|     331|         70.0| 10626|         AF|        PV|1995-01-01 00:00:00|1995|    1|
|        4.0|         4|       1048| 1051232|             14155|     48|     511|         36.0| 10224|         TF|          |1995-01-01 00:00:00|1995|    1|
|        7.0|         4|       1031|14342970|             14068|    130|     507|         29.0| 10128|         TU|          |1995-01-01 00:00:00|1995|    1|
+-----------+----------+-----------+--------+------------------+-------+--------+-------------+------+-----------+----------+-------------------+----+-----+
only showing top 20 rows

DataFrame に yyyyMM を追加して表示する。

>>> yyyymmAddedDf = monthAddedDf.withColumn("yyyymm", date_format(monthAddedDf.time_id, 'yyyyMM'))
>>> yyyymmAddedDf.show()
+-----------+----------+-----------+--------+------------------+-------+--------+-------------+------+-----------+----------+-------------------+----+-----+------+
|amount_sold|channel_id|courier_org| cust_id|fulfillment_center|prod_id|promo_id|quantity_sold|seller|tax_country|tax_region|            time_id|year|month|yyyymm|
+-----------+----------+-----------+--------+------------------+-------+--------+-------------+------+-----------+----------+-------------------+----+-----+------+
|       82.0|         3|       1190|11635215|             13260|     62|     256|         44.0| 10691|         NQ|        HA|1995-01-01 00:00:00|1995|    1|199501|
|       66.0|         3|       1459| 7887314|             14912|    102|     120|         84.0| 10944|         CI|        MU|1995-01-01 00:00:00|1995|    1|199501|
|       16.0|         5|       1124| 7245615|             13368|     66|     136|         16.0| 10337|         YH|          |1995-01-01 00:00:00|1995|    1|199501|
|       26.0|         2|       1319| 5363281|             14091|     59|     445|         93.0| 10863|         LH|        CU|1995-01-01 00:00:00|1995|    1|199501|
|       91.0|         2|       1201|14565805|             13145|    126|     422|         21.0| 10568|         PH|        QW|1995-01-01 00:00:00|1995|    1|199501|
|       72.0|         9|       1497|13073124|             14572|     43|     495|         57.0| 10778|         EW|        JB|1995-01-01 00:00:00|1995|    1|199501|
|       83.0|         4|       1419|13223847|             13466|     15|     185|          6.0| 10153|         DW|          |1995-01-01 00:00:00|1995|    1|199501|
|       28.0|         5|       1083| 8831083|             13148|     72|      91|         75.0| 10038|         NP|          |1995-01-01 00:00:00|1995|    1|199501|
|       79.0|         2|       1364| 1122827|             14594|     76|     307|         83.0| 10896|         HR|          |1995-01-01 00:00:00|1995|    1|199501|
|       36.0|         4|       1408|13016476|             14703|     65|      84|         70.0| 10116|         SB|        RV|1995-01-01 00:00:00|1995|    1|199501|
|       65.0|         3|       1183| 6557310|             14384|     89|     362|         65.0| 10765|         CD|        OC|1995-01-01 00:00:00|1995|    1|199501|
|        9.0|         9|       1023|11633931|             13586|     72|     316|         48.0| 10572|         JN|        LU|1995-01-01 00:00:00|1995|    1|199501|
|        1.0|         5|       1515| 7868183|             14906|     97|     244|         98.0| 10605|         HE|        WO|1995-01-01 00:00:00|1995|    1|199501|
|       93.0|         4|       1243| 4513870|             14918|     66|     170|         83.0| 10336|         XG|        US|1995-01-01 00:00:00|1995|    1|199501|
|       13.0|         4|       1456| 9731389|             14339|     21|      90|         74.0| 10326|         TR|        NO|1995-01-01 00:00:00|1995|    1|199501|
|        8.0|         9|       1541|12388703|             13591|    125|     124|         45.0| 10733|         AU|        AA|1995-01-01 00:00:00|1995|    1|199501|
|       39.0|         9|       1221| 4713346|             14190|    111|     400|         43.0| 10979|         PD|          |1995-01-01 00:00:00|1995|    1|199501|
|       48.0|         5|       1554| 4589564|             13727|    121|     331|         70.0| 10626|         AF|        PV|1995-01-01 00:00:00|1995|    1|199501|
|        4.0|         4|       1048| 1051232|             14155|     48|     511|         36.0| 10224|         TF|          |1995-01-01 00:00:00|1995|    1|199501|
|        7.0|         4|       1031|14342970|             14068|    130|     507|         29.0| 10128|         TU|          |1995-01-01 00:00:00|1995|    1|199501|
+-----------+----------+-----------+--------+------------------+-------+--------+-------------+------+-----------+----------+-------------------+----+-----+------+
only showing top 20 rows

yyyymm 列でパーティショニングして表示する。

>>> repartitionedDf = yyyymmAddedDf.repartition("yyyymm")
>>> repartitionedDf.show()
+-----------+----------+-----------+--------+------------------+-------+--------+-------------+------+-----------+----------+-------------------+----+-----+------+
|amount_sold|channel_id|courier_org| cust_id|fulfillment_center|prod_id|promo_id|quantity_sold|seller|tax_country|tax_region|            time_id|year|month|yyyymm|
+-----------+----------+-----------+--------+------------------+-------+--------+-------------+------+-----------+----------+-------------------+----+-----+------+
|       16.0|         3|       1209|10459355|             14599|     87|     532|         28.0| 10706|         RQ|        XO|2000-02-01 00:00:00|2000|    2|200002|
|       77.0|         2|       1412| 5872809|             13794|     17|     286|         10.0| 10566|         LE|        DX|2000-02-01 00:00:00|2000|    2|200002|
|       85.0|         2|       1133| 1736310|             14520|    143|     232|         70.0| 10234|         GK|        RL|2000-02-01 00:00:00|2000|    2|200002|
|       16.0|         4|       1061| 2921916|             13141|    146|     498|         53.0| 10542|         UN|        JT|2000-02-01 00:00:00|2000|    2|200002|
|       74.0|         3|       1169| 7134487|             13887|     55|     198|         67.0| 10557|         XV|          |2000-02-01 00:00:00|2000|    2|200002|
|        1.0|         9|       1465| 9105076|             14640|     48|     363|          9.0| 10349|         LU|        QC|2000-02-01 00:00:00|2000|    2|200002|
|       73.0|         9|       1408| 5426572|             13072|     74|     339|         66.0| 10914|         BQ|        BD|2000-02-01 00:00:00|2000|    2|200002|
|       35.0|         4|       1013|11750382|             14504|    144|      89|         12.0| 10220|         FC|        UY|2000-02-01 00:00:00|2000|    2|200002|
|        3.0|         4|       1467|14292100|             14318|     70|     526|         77.0| 10458|         EW|        HL|2000-02-01 00:00:00|2000|    2|200002|
|       96.0|         4|       1255|14304714|             13334|     77|     460|         34.0| 10409|         EU|          |2000-02-01 00:00:00|2000|    2|200002|
|       74.0|         5|       1373|12033050|             14882|     74|     476|          3.0| 10191|         BC|        US|2000-02-01 00:00:00|2000|    2|200002|
|       81.0|         9|       1363| 5426406|             14751|     84|     195|         14.0| 10506|         RK|        EL|2000-02-01 00:00:00|2000|    2|200002|
|       32.0|         4|       1548| 3004427|             13914|     66|     266|         21.0| 10631|         DD|        TY|2000-02-01 00:00:00|2000|    2|200002|
|       79.0|         3|       1508| 6564626|             14302|     39|     520|         52.0| 10762|         RZ|        CN|2000-02-01 00:00:00|2000|    2|200002|
|       64.0|         3|       1478| 6779800|             13550|    110|     309|         38.0| 10794|         KY|        EN|2000-02-01 00:00:00|2000|    2|200002|
|       67.0|         5|       1542|14438640|             14058|     46|     514|         19.0| 10945|         AY|          |2000-02-01 00:00:00|2000|    2|200002|
|       38.0|         2|       1092|13135799|             13862|     31|     295|         72.0| 10349|         RD|          |2000-02-01 00:00:00|2000|    2|200002|
|       80.0|         5|       1178|  458983|             13180|     99|     526|         64.0| 10931|         IY|        QG|2000-02-01 00:00:00|2000|    2|200002|
|       68.0|         2|       1511| 7818776|             13687|    147|     506|         67.0| 10979|         UY|          |2000-02-01 00:00:00|2000|    2|200002|
|       71.0|         3|       1514|10769032|             14082|    101|     122|         36.0| 10187|         YE|          |2000-02-01 00:00:00|2000|    2|200002|
+-----------+----------+-----------+--------+------------------+-------+--------+-------------+------+-----------+----------+-------------------+----+-----+------+
only showing top 20 rows

パーティショニングのために追加した yyyymm 列を削除する。

>>> droppedDf = repartitionedDf.drop("yyyymm")
>>> droppedDf.show()
+-----------+----------+-----------+--------+------------------+-------+--------+-------------+------+-----------+----------+-------------------+----+-----+
|amount_sold|channel_id|courier_org| cust_id|fulfillment_center|prod_id|promo_id|quantity_sold|seller|tax_country|tax_region|            time_id|year|month|
+-----------+----------+-----------+--------+------------------+-------+--------+-------------+------+-----------+----------+-------------------+----+-----+
|       64.0|         3|       1238|  240661|             14588|    115|      79|         59.0| 10109|         IC|        QY|2000-02-01 00:00:00|2000|    2|
|       35.0|         5|       1305| 5279056|             13013|    131|      61|         11.0| 10523|         UG|        HU|2000-02-01 00:00:00|2000|    2|
|       18.0|         9|       1149|  562588|             14627|     87|     515|         91.0| 10146|         CE|        CY|2000-02-01 00:00:00|2000|    2|
|       50.0|         4|       1387|12492889|             13649|     73|     289|         67.0| 10447|         ZS|          |2000-02-01 00:00:00|2000|    2|
|       20.0|         4|       1457| 4023446|             14150|    125|     173|         94.0| 10694|         GE|        DB|2000-02-01 00:00:00|2000|    2|
|       24.0|         4|       1212|12139139|             13012|    140|     280|         67.0| 10111|         VV|        DJ|2000-02-01 00:00:00|2000|    2|
|       66.0|         3|       1311| 2633171|             14815|    106|     242|         21.0| 10388|         MJ|        BD|2000-02-01 00:00:00|2000|    2|
|       73.0|         5|       1511|10583767|             14488|    109|     531|         89.0| 10392|         PP|          |2000-02-01 00:00:00|2000|    2|
|       21.0|         3|       1363|  877617|             14501|     29|     311|         28.0| 10962|         GF|        FX|2000-02-01 00:00:00|2000|    2|
|       62.0|         4|       1549| 4429647|             13447|     33|     355|         52.0| 10699|         VC|        BF|2000-02-01 00:00:00|2000|    2|
|       16.0|         3|       1164|11654304|             13910|    125|     274|         25.0| 10337|         JI|        BI|2000-02-01 00:00:00|2000|    2|
|        5.0|         3|       1167| 4927333|             13123|     51|     508|         82.0| 10219|         FB|          |2000-02-01 00:00:00|2000|    2|
|        7.0|         3|       1001| 8434437|             14778|     18|     407|         19.0| 10192|         WD|        OW|2000-02-01 00:00:00|2000|    2|
|       53.0|         3|       1038|14058980|             14742|    111|     160|         92.0| 10527|         BM|        XN|2000-02-01 00:00:00|2000|    2|
|       55.0|         4|       1169| 9633844|             14862|     90|     322|          4.0| 10712|         HJ|        XC|2000-02-01 00:00:00|2000|    2|
|       80.0|         2|       1380| 1516950|             14065|     56|     166|          5.0| 10301|         BF|          |2000-02-01 00:00:00|2000|    2|
|       90.0|         9|       1039| 9921850|             14812|     46|     361|         23.0| 10062|         CY|          |2000-02-01 00:00:00|2000|    2|
|       40.0|         5|       1150| 7263915|             13059|     48|     116|         10.0| 10455|         CL|        OS|2000-02-01 00:00:00|2000|    2|
|       93.0|         4|       1131| 4353455|             13106|    110|     415|         83.0| 10533|         FJ|        AJ|2000-02-01 00:00:00|2000|    2|
|       97.0|         4|       1016|12261020|             14187|     27|     281|          1.0| 10495|         UO|        XY|2000-02-01 00:00:00|2000|    2|
+-----------+----------+-----------+--------+------------------+-------+--------+-------------+------+-----------+----------+-------------------+----+-----+
only showing top 20 rows

パーティション数を確認する。

>>> droppedDf.rdd.getNumPartitions()
200

参考

withColumn(colName, col)
Returns a new DataFrame by adding a column or replacing the existing column that has the same name.
The column expression must be an expression over this DataFrame; attempting to add a column from some other dataframe will raise an error.

Parameters:

colName – string, name of the new column.

col – a Column expression for the new column.
>>> df.withColumn('age2', df.age + 2).collect()
[Row(age=2, name='Alice', age2=4), Row(age=5, name='Bob', age2=7)]
pyspark.sql module — PySpark 2.3.2 documentation

repartition(numPartitions, *cols)
Returns a new DataFrame partitioned by the given partitioning expressions. The resulting DataFrame is hash partitioned.
numPartitions can be an int to specify the target number of partitions or a Column. If it is a Column, it will be used as the first partitioning column. If not specified, the default number of partitions is used.
Changed in version 1.6: Added optional arguments to specify the partitioning columns. Also made numPartitions optional if partitioning columns are specified.
>>> df.repartition(10).rdd.getNumPartitions()
10
>>> data = df.union(df).repartition("age")
>>> data.show()
+---+-----+
|age| name|
+---+-----+
|  5|  Bob|
|  5|  Bob|
|  2|Alice|
|  2|Alice|
+---+-----+
>>> data = data.repartition(7, "age")
>>> data.show()
+---+-----+
|age| name|
+---+-----+
|  2|Alice|
|  5|  Bob|
|  2|Alice|
|  5|  Bob|
+---+-----+
>>> data.rdd.getNumPartitions()
7
>>> data = data.repartition("name", "age")
>>> data.show()
+---+-----+
|age| name|
+---+-----+
|  5|  Bob|
|  5|  Bob|
|  2|Alice|
|  2|Alice|
+---+-----+
pyspark.sql module — PySpark 2.3.2 documentation

2018-12-06

CloudFront で Distribution の Origin Domain Name を変更して切り替るまでのタイムラグ

AWS

CloudFront で Distribution の Origin Domain Name を変更して切り替るまでのタイムラグは TTL でコントロールできる。

AWSマネジメントコンソールで CloudFront を選択し、[Distribution]-[Origins and Origin Groups]を選択し、[Edit]で編集する。

f:id:yohei-a:20181205174617p:plain

Origin Domain Name で参照先を切り替える。

f:id:yohei-a:20181205174608p:plain
[Distribution]-[Behaviors]で Behavior を選択して、Edit で
f:id:yohei-a:20181205174634p:plain

TTL を編集する。

f:id:yohei-a:20181205174641p:plain

参考

最小 TTL
オブジェクトが更新されたかどうかを調べるために CloudFront がオリジンに別のリクエストを送るまでに、オブジェクトを CloudFront キャッシュに保持する最小期間 (秒) を指定します。[Minimum TTL] のデフォルト値は 0 (秒) です。

重要

キャッシュ動作のためにすべてのヘッダーをオリジンに転送するように CloudFront を設定した場合、CloudFront は関連付けられたオブジェクトをキャッシュしません。その代わり、CloudFront はそのオブジェクトに関するすべてのリクエストをオリジンに転送します。その設定では、[Minimum TTL] の値を 0 にする必要があります。
（中略）
最大 TTL
オブジェクトが更新されたかどうかを CloudFront がオリジンに照会するまでに、オブジェクトを CloudFront キャッシュに保持する最大期間 (秒) を指定します。[Maximum TTL] に指定する値は、オリジンが Cache-Control max-age、Cache-Control s-maxage、Expires などの HTTP ヘッダーをオブジェクトに追加するときにのみ適用されます。詳細については、「コンテンツがエッジキャッシュに保持される期間の管理 (有効期限)」を参照してください。
（中略）
デフォルト TTL
オブジェクトが更新されたかどうかを調べるために CloudFront がオリジンに別のリクエストを送るまでオブジェクトを CloudFront キャッシュに保持するデフォルト期間 (秒) を指定します。[Default TTL] に指定する値は、オリジンが Cache-Control max-age、Cache-Control s-maxage、Expires などの HTTP ヘッダーをオブジェクトに追加しないときにのみ適用されます。詳細については、「コンテンツがエッジキャッシュに保持される期間の管理 (有効期限)」を参照してください。
（中略）
[Default TTL] のデフォルト値は 86,400 (秒)、つまり 1 日です。[Minimum TTL] の値を 86,400 (秒) より大きい値に変更する場合は、[Default TTL] のデフォルト値を [Minimum TTL] の値に変更します。
ディストリビューションを作成または更新する場合に指定する値 - Amazon CloudFront

2018-12-05

macOS Sierra で DNS リゾルバキャッシュを確認する方法

/var/log/system.log を tail して、

% tail -f /var/log/system.log

以下のコマンドで出力する。

% sudo killall -INFO mDNSResponder

参考

sierra - mDNSResponder not logging - Ask Different

% man mDNSResponder

（中略）

     By default, only log level Error is logged.

     A SIGUSR1 signal toggles additional logging, with Warning and Notice enabled by default:

           % sudo killall -USR1 mDNSResponder

     Once this logging is enabled, users can additionally use syslog(1) to change the log filter for the process. For example, to enable log levels Emer-
     gency - Debug:

           % sudo syslog -c mDNSResponder -d

     A SIGUSR2 signal toggles packet logging:

           % sudo killall -USR2 mDNSResponder

     A SIGINFO signal will dump a snapshot summary of the internal state to /var/log/system.log:

           % sudo killall -INFO mDNSResponder

2018-12-05

DNS の TTL はキャッシュサーバとリゾルバキャッシュで使われる

AWS

DNSのレコードはDNSのキャッシュサーバでもキャッシュされるし、クライアントOSのDNSリゾルバでもキャッシュされ、キャッシュ期間は TTL で決まる。従って、DNS変更時に浸透するまでのタイムラグを短くしたい場合は、事前に TTL を短くしておき十分に浸透した上で、切り替えるとタイムラグが小さくなる。後はOSではなくアプリケーションレイヤーでキャッシュするものもある。「DNS浸透いうな」と言いますが、よい表現が見つからず一旦「浸透」と書きました。

参考

Windows 2000／XPにおけるDNSリゾルバ・キャッシュは、次のようにして機能している。
（中略）
3. キャッシュに保持される時間には限度がある。DNSサーバから返された結果のレコード（SOAリソース・レコード）には必ずTTL（Time To Live。生存時間）が指定されているので、その時間になるまではDNSサーバに対する再クエリーは行われない（キャッシュされる）。ただしこの有効時間は（デフォルトでは）最大で1日（8万6400秒）に制限されており、1日以上経過すれば、元のDNSレコードのTTLにかかわらず、キャッシュから破棄される。否定応答の場合は、最大300秒キャッシュされ、その後キャッシュから破棄される。
Windowsの名前解決のトラブルシューティング（DNSリゾルバーキャッシュ編）：Tech TIPS - ＠IT

「DNS浸透いうな」と言うけれど… (#ssmjp 2018/07) from Yoshikazu GOTO

なぜ「DNSの浸透」は問題視されるのか:Geekなぺーじ

2018-12-04

Presto でParquet にクエリするとjava.lang.UnsupportedOperationException: com.facebook.presto.spi.type.LongDecimalType

事象

S3のParquetファイルにHiveカタログで外部表を定義して、Presto でクエリすると、"java.lang.UnsupportedOperationException: com.facebook.presto.spi.type.LongDecimalType" という例外が発生する。

$ hive
CREATE EXTERNAL TABLE IF NOT EXISTS sh10.sales(
  prod_id DECIMAL(38,0),
  cust_id DECIMAL(38,0),
  time_id TIMESTAMP,
  channel_id DECIMAL(38,0),
  promo_id DECIMAL(38,0),
  quantity_sold DECIMAL(38,2),
  seller INT,
  fulfillment_center INT,
  courier_org INT,
  tax_country VARCHAR(3),
  tax_region VARCHAR(3),
  amount_sold DECIMAL(38,2)
)
PARTITIONED BY (year int, month int)
STORED AS PARQUET
LOCATION 's3://.../data/parquet_pyspark/sh10/sales/'
  tblproperties ("parquet.compress"="SNAPPY")
;
hive> MSCK REPAIR TABLE sh10.sales;
hive> exit;

$ presto-cli
presto> use hive.sh10;
presto:sh10> select * from sales where prod_id is not null limit 10;
Query failed (#query_id): com.facebook.presto.spi.type.LongDecimalType

/var/log/presto/server.log

2018-12-04T03:22:56.740Z        ERROR   remote-task-callback-73 com.facebook.presto.execution.StageStateMachine Stage 20181204_032254_00002_emvnj.1 failed
java.lang.UnsupportedOperationException: com.facebook.presto.spi.type.LongDecimalType
        at com.facebook.presto.spi.type.AbstractType.writeLong(AbstractType.java:111)
        at com.facebook.presto.hive.parquet.reader.ParquetIntColumnReader.readValue(ParquetIntColumnReader.java:32)
        at com.facebook.presto.hive.parquet.reader.ParquetPrimitiveColumnReader.lambda$readValues$1(ParquetPrimitiveColumnReader.java:184)
        at com.facebook.presto.hive.parquet.reader.ParquetPrimitiveColumnReader.processValues(ParquetPrimitiveColumnReader.java:204)
        at com.facebook.presto.hive.parquet.reader.ParquetPrimitiveColumnReader.readValues(ParquetPrimitiveColumnReader.java:183)
        at com.facebook.presto.hive.parquet.reader.ParquetPrimitiveColumnReader.readPrimitive(ParquetPrimitiveColumnReader.java:171)
        at com.facebook.presto.hive.parquet.reader.ParquetReader.readPrimitive(ParquetReader.java:209)
        at com.facebook.presto.hive.parquet.reader.ParquetReader.readColumnChunk(ParquetReader.java:259)
        at com.facebook.presto.hive.parquet.reader.ParquetReader.readBlock(ParquetReader.java:242)
        at com.facebook.presto.hive.parquet.ParquetPageSource$ParquetBlockLoader.load(ParquetPageSource.java:244)
        at com.facebook.presto.hive.parquet.ParquetPageSource$ParquetBlockLoader.load(ParquetPageSource.java:222)
        at com.facebook.presto.spi.block.LazyBlock.assureLoaded(LazyBlock.java:269)
        at com.facebook.presto.spi.block.LazyBlock.getLoadedBlock(LazyBlock.java:260)
        at com.facebook.presto.operator.project.DictionaryAwarePageProjection$DictionaryAwarePageProjectionWork.<init>(DictionaryAwarePageProjection.java:97)
        at com.facebook.presto.operator.project.DictionaryAwarePageProjection.project(DictionaryAwarePageProjection.java:75)
        at com.facebook.presto.operator.project.PageProcessor$PositionsPageProcessorIterator.processBatch(PageProcessor.java:276)
        at com.facebook.presto.operator.project.PageProcessor$PositionsPageProcessorIterator.computeNext(PageProcessor.java:182)
        at com.facebook.presto.operator.project.PageProcessor$PositionsPageProcessorIterator.computeNext(PageProcessor.java:129)
        at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141)
        at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136)
        at com.facebook.presto.operator.project.PageProcessorOutput.hasNext(PageProcessorOutput.java:49)
        at com.facebook.presto.operator.project.MergingPageOutput.getOutput(MergingPageOutput.java:110)
        at com.facebook.presto.operator.ScanFilterAndProjectOperator.processPageSource(ScanFilterAndProjectOperator.java:287)
        at com.facebook.presto.operator.ScanFilterAndProjectOperator.getOutput(ScanFilterAndProjectOperator.java:226)
        at com.facebook.presto.operator.Driver.processInternal(Driver.java:379)
        at com.facebook.presto.operator.Driver.lambda$processFor$8(Driver.java:283)
        at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:675)
        at com.facebook.presto.operator.Driver.processFor(Driver.java:276)
        at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1053)
        at com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:162)
        at com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:456)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

解決策

Hiveカタログで外部表を定義する際に DECIMALの精度は 17 以下にする。

$ hive 
hive> drop table sh10.sales;
hive> CREATE EXTERNAL TABLE IF NOT EXISTS sh10.sales(
  prod_id DECIMAL(17,0),
  cust_id DECIMAL(17,0),
  time_id TIMESTAMP,
  channel_id DECIMAL(17,0),
  promo_id DECIMAL(17,0),
  quantity_sold DECIMAL(17,2),
  seller INT,
  fulfillment_center INT,
  courier_org INT,
  tax_country VARCHAR(3),
  tax_region VARCHAR(3),
  amount_sold DECIMAL(17,2)
)
PARTITIONED BY (year int, month int)
STORED AS PARQUET
LOCATION 's3://.../data/parquet_pyspark/sh10/sales/'
  tblproperties ("parquet.compress"="SNAPPY")
;
hive> MSCK REPAIR TABLE sh10_option.sales;

環境

emr-5.19.0
Hive 2.3.3
Presto 0.212

参考

thanks @nezihyigitbasi - so Decimals(>17,x) weren't supported in 0.164
https://github.com/prestodb/presto/issues/8484

presto/AbstractType.java at master · prestodb/presto · GitHub

    @Override
    public void writeLong(BlockBuilder blockBuilder, long value)
    {
        throw new UnsupportedOperationException(getClass().getName());
    }

presto/LongDecimalType.java at master · prestodb/presto · GitHub

ackage com.facebook.presto.spi.type;

import com.facebook.presto.spi.ConnectorSession;
import com.facebook.presto.spi.block.Block;
import com.facebook.presto.spi.block.BlockBuilder;
import com.facebook.presto.spi.block.BlockBuilderStatus;
import com.facebook.presto.spi.block.FixedWidthBlockBuilder;
import com.facebook.presto.spi.block.PageBuilderStatus;
import io.airlift.slice.Slice;

import static com.facebook.presto.spi.type.Decimals.MAX_PRECISION;

presto/Decimals.java at master · prestodb/presto · GitHub

public final class Decimals
{
    private Decimals() {}

    public static final int MAX_PRECISION = 38;
    public static final int MAX_SHORT_PRECISION = 18;

ablog

不器用で落着きのない技術者のメモ