列指向データベースのページのデータ構造

行指向データベースは行単位でページ(Oracle Database でいうデータブロック)にデータを格納しているのに対して、列指向データベースは列ごとにページに格納している。クエリ実行時に結果セットを返す際に列別にバラバラのページに格納されているデータをどうやってタプル(レコード)に復元している*1のかと思ったがやはり行IDのようなものを持っているようだ。
行ID は C-Store では pid、MonetDB では BAT(Binary Association Tables) の oid と呼ばれている。
The Design and Implementation of Modern Column-Oriented Database Systems

NSM(N-ary Storage Model): 行方向でブロック（ページ）にデータを格納する方式
DSM(Decomposition Storage Model): 列方向でブロック（ページ）にデータを格納する方式

Introduction to Modern Analytical DB from Takeshi Yamamuro

C-Store の pid や MonetDB の oid のような ID 以外に Join Index という手法もある。

Introduction to Modern Analytical DB from Takeshi Yamamuro

具体的な実装は C-Store のソースコードで調べることができる。

Introduction to Modern Analytical DB from Takeshi Yamamuro

さらに詳しくは諸橋さんの以下のエントリで紹介されている

VLDB 2009 Tutorial on Column-Stores *1 が表示されたとき、笑ってしまい「わたしの教科書です、教科書」と言ってしまった appengine ja night #21 *2。
Google BigQueryなどの仕組みを知りたいときの列指向データベースの説明に - wmo6hash::blog

この大作スライド参照。

VLDB 2009 Tutorial on Column-Stores from Daniel Abadi

https://www.cs.duke.edu/courses/fall01/cps216/lectures/10-physical.pdf　もシンプルで良い資料。

NSM = N-ary Storage Model
DSM = Decomposition Storage Model

補足

C-Store はあのマイケル・ストーンブレーカー教授が産んだ列指向データベースで、商業的にも成功している Vertica(HP社) のルーツです。

VLDB2015 会議報告 from Yuto Hayamizu

ストーンブレーカー教授と一緒にこの論文を書いている Daniel Abadi さんは上の VLDB 2009 Tutorial on Column-Stores を書いているうちの一人でもあり、分かりやすい資料を書かれているので要チェックです。

MonetDB については以下参照

あと、カバリングインデックス(インデックスオンリースキャン)と違うのは以下の点だと思います。

カバリングインデックスは複数列のデータがページに格納されるが、C-Store のような列指向データベースでは列ごとにページに格納され、クエリ実行時に行IDのようなものでタプル(レコード)に復元される。
列指向に最適化している
- 列指向に最適化した統計
- クエリーオプティマイザ
- データ構造
- SIMD

Introduction to Modern Analytical DB from Takeshi Yamamuro

MonetDB/X100 - A DBMS In The CPU Cache

We argue that the poor performance of
MySQL and the commercial RDBMS
“X” is related to the Volcano iterator pipelining model [6]. While the Volcano approach is elegant and efficient for I/O bound queries, in computation-intensive tasks, its tuple-at-a-time execution makes runtime interpretation overhead dominate execution. That is, the time spent by the RDBMS on actual computations is dwarfed by the time spent interpreting
the expressions in a query tree, and looking up the needed attribute values in pages storing tuples in the N-ary storage model (NSM). As a result, quite a high number of CPU instructions are needed for each simple computational operation occurring in a query. Additionally, the instructions per cycle (IPC) efficiency achieved tends to be low, because the tuple-at-a-time model hides from the CPU most parallel execution opportunities, namely those provided by performing computations for different tuples in an overlapping fashion.
MonetDB [2], developed at CWI 1, is an alternative approach to building a query execution system. Its MIL algebra [3] minimizes the interpretation overhead by providing execution primitives that process data in a column-at-a-time fashion. Since MonetDB uses vertically decomposed storage model (DSM) [5], where columns are simple arrays, and each primitive performs only one fixed operation on the input data, the primitive implementation boils down to a sequential loop over aligned input arrays. Modern compilers can apply loop pipelining to such code, allowing MIL primitives to achieve high IPC on super-scalar CPUs. However, the column-at-a-time execution model introduces the extra cost of intermediate result materialization, which is only sustainable inside main memory, thus limiting the scalability of the system.
Figure 1 presents the architecture of X100, a new execution engine for MonetDB that combines the benefits of low-overhead column-wise query execution with the absence of intermediate result materialization in the Volcano iterator model. One of its distinguishing features is vectorized execution: instead of single tuples, entire vectors of values are passed through the pipeline. Another is in-cache processing: all the execution is happening within the CPU cache, using main memory only as a buffer for I/O and large intermediate data structures. These techniques make X100 execute efficiently on modern CPUs and allow it to achieve raw performance comparable to C code.

ビッグデータ分析のデータベース Netezza と Matrix は兄弟？！そして、Amazon Redshiftも？？？
Matrix』と『Netezza』の開発者はアメリカ人のBarry Zane (バリー・ゼイン) 氏。アメリカトップクラスのカーネギーメロン大学を卒業後、Prime Computer社へ入社。1983年にApplix社に移り、CTOに就任、13年間在籍しました。(Applix社は2007年にCognos社により買収。)
その後、2000年にNetezza社を立ち上げ、PostgeSQLをベースに開発したデータベースアプライアンス製品である『Netezza』を開発しました。『Netezza』は、「エンタープライズ向けDWHアプライアンスとは」を再考して開発され、革新的なアーキテクチャ、サーバー、データベース、ストレージをひとつのマシンに統合することで、大量データの分析を可能にしました。
Barry Zane氏: Netezzaを作り, ParAccel MPP (Matrix)を作り, 今はグラフ分析エンジンを手がけるビッグデータ分析向けデータベースのスーパーアーキテクト | Insight Technology, Inc.

Amazon Redshift and the Case for Simpler Data Warehouses

6. Related Work
While Amazon Redshift, when launched, was the first widely available data warehouse-as-a-service, its core database technology (parser, optimizer, engine, storage organization, MPP architecture) was derived from technology licensed from ParAccel. ParAccel belongs to a group of column-oriented DBMS products that appeared from the middle through the end of the 2000s: Vertica, Ingres VectorWise, Infobright, Kickfire, and many others [1]. These systems had several similarities in their design philosophy and list of features, with many of them influenced by two pioneering modern column-store systems: CStore [8] and MonetDB/X100 [3].
Redshift’s compression techniques are similar to those used by Vertica, and their performance tradeoffs are well understood [2]. Redshift foregoes traditional indexes (or projections in CStore/Vertica) and instead focuses on sequential scan speed through compiled code execution and column-block skipping based on value-ranges stored in memory. Infobright’s Knowledge Grid and Netezza’s Zone Maps also rely on block skipping which as a technique was first discussed in [5]. Code compilation techniques in query execution have received renewed attention in academia [6][9] and are also used in other systems such as Microsoft’s Hekaton [4].
http://dl.acm.org/ft_gateway.cfm?ftid=1586891&id=2742795

参考

*1:別々のページに格納されているカラムのデータを結合してレコード（タプル）に何をキーに復元するのか？

ablog

不器用で落着きのない技術者のメモ

列指向データベースのページのデータ構造

補足

参考