@maropu さんの TPCDS data generator for Apache Spark を使って TPC-DS のデータを生成してみた。
準備
EC2 インスタンス作成
$ sudo yum -y install git java-1.8.0-openjdk-devel.x86_64
Spark のインストール
- github からコードを取得する
$ git clone https://github.com/apache/spark.git
- Spark をビルドする
$ cd spark && ./build/mvn clean package -DskipTests $ export SPARK_HOME=`pwd`
TPCDS data generator のインストール
- github からコードを取得する
$ cd ..
$ git clone https://github.com/maropu/spark-tpcds-datagen.git
- ビルドする
$ cd spark-tpcds-datagen && ./build/mvn clean package -DskipTests
テストデータ生成
例1
$ mkdir -p /tmp/spark-tpcds-data/10 $ nohup ./bin/dsdgen --scale-factor 10 --output-location /tmp/spark-tpcds-data/10 &
- 進行状況を確認する。
$ tail -f nohup.out Using `spark-submit` from path: /home/ec2-user/spark 20/05/06 08:27:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 20/05/06 08:27:44 INFO SparkContext: Running Spark version 3.1.0-SNAPSHOT 20/05/06 08:27:44 INFO ResourceUtils: ============================================================== 20/05/06 08:27:44 INFO ResourceUtils: No custom resources configured for spark.driver. 20/05/06 08:27:44 INFO ResourceUtils: ============================================================== 20/05/06 08:27:44 INFO SparkContext: Submitted application: org.apache.spark.sql.execution.benchmark.TPCDSDatagen (中略) 20/05/06 12:28:50 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 20/05/06 12:28:50 INFO SparkContext: Successfully stopped SparkContext 20/05/06 12:28:50 INFO ShutdownHookManager: Shutdown hook called 20/05/06 12:28:50 INFO ShutdownHookManager: Deleting directory /tmp/spark-608db487-493f-4e17-b7d2-22f4802d3c97 20/05/06 12:28:50 INFO ShutdownHookManager: Deleting directory /tmp/spark-912f353f-a193-45a1-b3ec-7e2ebf0d05a6
例2
$ mkdir /data/spark-tpcds-data/500 $ nohup ./bin/dsdgen --scale-factor 500 --partition-tables --cluster-by-partition-columns --output-location /data/spark-tpcds-data/500/
ベンチマーク実行
$ ./bin/spark-submit \ --class org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark \ sql/core/target/spark-sql_2.12-3.1.0-SNAPSHOT-tests.jar \ --data-location /tmp/spark-tpcds-data
環境
$ cat /etc/system-release Amazon Linux release 2 (Karoo) $ uname -r 4.14.173-137.229.amzn2.x86_64