SOSP21 で best paper 賞を受賞した S3 の論文を読んでみる

2024/10/02 追記: サマリースライドへのリンクをはっておく。
www.slideshare.net

SOSP21 で best paper 賞を受賞した S3 の論文 "Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3" を読んでみる。

Abstract
This paper reports our experience applying lightweight formal methods to validate the correctness of ShardStore, a new key-value storage node implementation for the Amazon S3 cloud object storage service. By “lightweight formal methods" we mean a pragmatic approach to verifying the correctness of a production storage node that is under ongoing feature development by a full-time engineering team. We do not aim to achieve full formal verification, but instead emphasize automation, usability, and the ability to continually ensure correctness as both software and its specification evolve over time. Our approach decomposes correctness into independent properties, each checked by the most appropriate tool, and develops executable reference models as specifications to be checked against the implementation. Our work has prevented 16 issues from reaching production, including subtle crash consistency and concurrency problems, and has been extended by non-formal-methods experts to check new features and properties as ShardStore has evolved.

この論文では、S3 の新しい KVS ストレージノード実装 ShardStore を開発する過程で軽量な形式手法を適用した経験談について書いている。「軽量な形式手法」はエンジニアリングチームがこの先も開発し続ける本番環境が正しいことを検証できる実用的なアプローチを意味する。完全なる形式手法でははなく、自動化、使いやすさ、進化し続けるソフトウェア自体と仕様の正しさを保証し続けることができることを目指した。独立したプロパティに分解し、それぞれが最適なツールでチェックされ、実装をチェックできるリファレンスモデルとしての仕様を開発するアプローチをとった。この取組によって、16の課題をリリースする前に発見することができ、形式手法の専門家以外も ShardStore の新機能やプロパティをチェックできるようになった。

1 Introduction
Amazon S3 is a cloud object storage service that offers customers elastic storage with extremely high durability and availability. At the core of S3 are storage node servers that persist object data on hard disks. These storage nodes are key-value stores that hold shards of object data, replicated by the control plane across multiple nodes for durability. S3 is building a new key-value storage node called ShardStore that is being gradually deployed within our current service.

S3 のコアはオブジェクトデータをハードディスクに永続化するストレージノードサーバである。ストレージノードはオブジェクトデータの断片を保持するキーバリューストアである。耐久性のためにコントロールプレーンによって、複数のノードにレプリケーションされる。S3 は ShardStore と呼ばれる新しいキーバリューストアを構築しており、サービスに徐々にデプロイされている。

Production storage systems such as ShardStore are notoriously difficult to get right [25]. To achieve high performance, ShardStore combines a soft-updates crash consistency protocol [16], extensive concurrency, append-only IO and garbage collection to support diverse storage media, and other complicating factors. Its current implementation reflects that complexity, comprising over 40,000 lines of Rust code. The implementation is fast moving and frequently changing, and even the specification is not set in stone. It is developed and operated by a team of engineers whose changes are continuously deployed worldwide, and currently stores hundreds of petabytes of customer data as part of a gradual rollout.

ShardStore のような本番のストレージシステムは正しく理解することが難しいことで有名である。高い性能を発揮するために、ShardStore は soft-updates crash consistency protocol、高い並行性、追記のみの IO、ガベージコレクションを組み合わせて多様なストレージメディアやその他の複雑な要素をサポートしている。現在の実装はその複雑さを反映して、4万行の Rust コードから構成されている。実装は頻繁に変わり、仕様は固定されたものではない。これは、変更が世界中に継続的に展開されているエンジニアのチームによって開発および運用されており、現在、段階的な展開の一環として数百ペタバイトの顧客データを保存している。

This paper reports our experiences applying lightweight formal methods [24] such as property-based testing and stateless model checking to validate ShardStore’s correctness. We sought lightweight methods that were automated and usable by non-formal-methods experts to validate new features on their own; our goal was to mainstream formal methods into our daily engineering practice. We sought formal methods because we wanted to validate deep properties of the implementation - functional correctness of API-level calls, crash consistency of on-disk data structures, and concurrent correctness of API calls and background maintenance tasks - not just general properties like memory safety. In return for being lightweight and easy to apply, we were willing to accept weaker correctness guarantees than full formal verification, which is currently impractical for storage systems at this scale and level of complexity. We were inspired by recent efforts on verified storage systems [8, 18, 49] and hope that our experience provides encouragement for future work.

この論文では、ShardStore の正しさを検証するための property-based testing や stateless model checking のような軽量な形式手法を適用した我々の経験について書いている。我々は形式手法の専門家以外が自分たちで新機能を検証するための自動化され、使いやすい軽量な形式手法を探していた。我々のゴールは形式手法を日々のエンジニアリングの実践で形式手法を主流にするとこだった。実装の深い特性（APIレベル呼び出しの機能的な正しさ、ディスク上のデータ構造のクラッシュ時の一貫性、API呼出しの同時並行の正しさ、バックグラウンドのメンテナンスタスクなど、メモリ安全性のような一般的な特性ではなく）を検証したかったので、形式手法を求めていた。軽量で適用が容易である見返りとして、完全な形式的検証よりも弱い正確性の保証を受け入れる用意がありました。これは、現在のこの規模と複雑さのレベルのストレージシステムでは実用的ではありません。

We settled on an approach with three elements.
（中略）
In summary, this paper makes three main contributions:

A lightweight approach to specifying and validating a production-scale storage system in the face of frequent changes and new features.

A decomposition of storage system correctness that allows applying a diverse suite of formal methods tools to best check each property.

Our experience integrating lightweight formal methods into the practice of a production engineering team and handing over the validation artifacts to be maintained by them rather than formal methods experts.

私たちは3つの要素からなるアプローチに落ち着きました。
要約すると、この論文は主に3つの貢献をします。

頻繁な変更や新機能に直面して、本番規模のストレージシステムの仕様の記述および検証するための軽量なアプローチ。
ストレージシステムの正しさを分解し、各特性のチェックにベストな形式手法ツールのスイートを適用する。
軽量の形式手法をプロダクションのエンジニアリングチームの実践に統合し、形式手法の専門家ではなく、彼ら自身が維持する検証アーティファクトを引き継いだ経験。

2 ShardStore
ShardStore is a key-value store currently being deployed within the Amazon S3 cloud object storage service. This section provides background on ShardStore’s design and its crash consistency protocol.
Context. The ShardStore key-value store is used by S3 as a storage node. Each storage node stores shards of customer objects, which are replicated across multiple nodes for durability, and so storage nodes need not replicate their stored data internally. ShardStore is API-compatible with our existing storage node software, and so requests can be served by either ShardStore or our existing key-value stores.
（以下略）

ShardStore は現在 S3 にデプロイされているキーバリューストアである。このセクションでは ShardSore の設計とクラッシュ一貫性プロトコルについて説明する。
コンテキスト：ShardStore のキーバリューストアは S3 によって、ストレージノードとして使われている。各ストレージノードは顧客のオブジェクトの断片を保存しており、耐久性のために複数ノードにレプリケーションされているので、ストレージノード内でデータを二重化する必要はない。ShardStore と現行ストレージノードとAPIの互換性があり、ShardSotore でも現行キューバリューストアのどちらもリクエストに応えることができる。

2.1 Design Overview
ShardStore presents a key-value store interface to the rest of the S3 system, where keys are shard identifiers and values are shards of customer object data. Customer requests are mapped to shard identifiers by S3’s metadata subsystem [53]. ShardStore’s key-value store comprises a log-structured merge tree (LSM tree) [41] but with shard data stored out- side the tree to reduce write amplification, similar to Wisc Key [31]. Figure 1 shows an overview of how this LSM tree stores object data on disk. The LSM tree maps each shard identifier to a list of (pointers to) chunks, each of which is stored within an extent. Extents are contiguous regions of physical storage on a disk; a typical disk has tens of thou- sands of extents. ShardStore requires that writes within each extent are sequential, tracked by a write pointer defining the next valid write position, and so data on an extent cannot be immediately overwritten. Each extent has a reset operation to return the write pointer to the beginning of the extent and allow overwrites.

ShardStore は、S3システムの残りの部分にキーバリューストアインターフェイスを提供します。キーはシャード識別子であり、値は顧客オブジェクトデータのシャードです。顧客のリクエストは、S3のメタデータサブシステムによってシャード識別子にマッピングされます[53]　。ShardStoreのKey-Valueストアは、ログ構造のマージツリー（LSMツリー）[41]で構成されますが、Wisc Key [31]と同様に、書き込み増幅を減らすために、ツリーの外側にシャードデータが格納されます。図1は、このLSMツリーがオブジェクトデータをディスクに格納する方法の概要を示しています。LSMツリーは、各シャード識別子をチャンク（へのポインター）のリストにマップします。各チャンクはエクステント内に格納されます。エクステントは、ディスク上の物理ストレージの連続した領域です。典型的なディスクには、数万のエクステントがあります。ShardStoreでは、各エクステント内の書き込みがシーケンシャルであり、次の有効な書き込み位置を定義する書き込みポインターによって追跡される必要があるため、エクステントのデータをすぐに上書きすることはできません。各エクステントには、書き込みポインタをエクステントの先頭に戻し、上書きを許可するリセット操作があります。

Figure 1. ShardStore’s on-disk layout. Each extent offers append-only writes. A log-structured merge tree (LSM tree) maps shards to their data, stored as chunks on the extents. The LSM tree itself is also stored as chunks on disk (on extent 17). In (a), extent 18 has an unreferenced chunk left by a deleted shard. To make that space available for reuse, chunk reclamation evacuates live chunks from extent 18 to elsewhere on disk, resulting in the state in (b).

ShardStore’s のディスク上のレイアウト。各エクステントは、追加のみの書き込みを提供します。ログ構造のマージツリー（LSMツリー）は、シャードをデータにマップし、エクステントにチャンクとして格納します。LSM ツリー自体はディスク上にチャンクとして保存される（エクステント17）。（a）では、エクステント18に、削除されたシャードによって残された参照されていないチャンクがあります。そのスペースを再利用できるようにするために、チャンクの再利用により、ライブチャンクがエクステント18からディスク上の他の場所に排出され、（b）の状態になります。

Rather than centralizing all shard data in a single shared log on disk, ShardStore spreads shard data across extents. This approach gives us flexibility in placing each shard’s data on disk to optimize for expected heat and access patterns (e.g., to minimize seek latency). However, the lack of a single log makes crash consistency more complex, as § 2.2 describes.

ShardStoreは、すべてのシャードデータをディスク上の単一の共有ログに一元化するのではなく、シャードデータを複数のエクステントに分散します。このアプローチにより、各シャードのデータをディスクに柔軟に配置して、予想される熱とアクセスのパターンを最適化できます（たとえば、シークの待ち時間を最小限に抑えることができます）。ただし、sect; 2.2で説明しているように、ログを一元化しないと、クラッシュ時の一貫性がより複雑になります。

Chunk storage and chunk reclamation. All persistent data is stored in chunks, including the backing storage for the LSM tree itself, as Fig. 1 shows. A chunk store abstraction arranges the mapping of chunks onto extents. The chunk store offers PUT(data) → locator and GET(locator) → data interfaces, where locators are opaque chunk identifiers and used as pointers. A single shard comprises one or more chunks depending on its size.

チャンクストレージとチャンク回収。図1に示すように、LSMツリー自体のバッキングストレージを含め、すべての永続データはチャンクに格納されます。チャンクストアの抽象化により、チャンクのエクステントへのマッピングが調整されます。チャンクストアは、PUT（データ）→ロケーターおよびGET（ロケーター）→データインターフェイスを提供します。ロケーターは不透明なチャンク識別子であり、ポインターとして使用されます。単一のシャードは、そのサイズに応じて1つ以上のチャンクで構成されます。

Because extents are append-only, deleting a shard cannot immediately recover the free space occupied by that shard’s chunks. For example, in Fig. 1a, extent 18 has a hole left behind by a chunk whose corresponding shard has recently been deleted. To recover and reuse free space, the chunk store has a reclamation background task that performs garbage collection. Reclamation selects an extent and scans it to find all chunks it stores. For each chunk, reclamation performs a reverse lookup in the index (the LSM tree); chunks that are still referenced in the index are evacuated to a new extent and their pointers updated in the index as Fig. 1b shows, while unreferenced chunks are simply dropped. Once the entire extent has been scanned, its write pointer is reset and it is available for reuse. Resetting an extent’s write pointer makes all data on that extent unreadable even if not yet physically overwritten (ShardStore forbids reads beyond an extent’s write pointer), and so the chunk store must enforce a crash-consistent ordering for chunk evacuations, index updates, and extent resets.

エクステントは追記専用であるため、シャードを削除しても、そのシャードのチャンクが占めていた空き領域をすぐに回復することはできません。たとえば、図1aでは、エクステント18には、対応するシャードが最近削除されたチャンクによって残された穴があります。空き領域を回復して再利用するために、チャンクストアにはガベージコレクションを実行する再生バックグラウンドタスクがあります。回収処理はエクステントを選択し、それをスキャンして、格納されているすべてのチャンクを見つけます。チャンクごとに、回収処理はインデックス（LSMツリー）で逆ルックアップを実行します。図1bに示すように、インデックスでまだ参照されているチャンクは新しい範囲で退避され、ポインタがインデックスで更新されますが、参照されていないチャンクは単に削除されます。エクステント全体がスキャンされると、その書き込みポインタがリセットされ、再利用できるようになります。エクステントの書き込みポインターをリセットすると、物理的に上書きされていなくても、そのエクステント上のすべてのデータが読み取れなくなります（ShardStoreは、エクステントの書き込みポインターを超える読み取りを禁止します）。そのため、チャンクストアは、チャンクの退避、インデックスの更新、およびエクステントに対してクラッシュ整合性のある順序を適用する必要があります。エクステントをリセットします。

As noted above, the LSM tree itself is also stored as chunks written to extents. Maintenance operations like LSM compaction can render these chunks unused. The free space those chunks occupy is reclaimed in the same way as above, except that the reverse lookup is into the LSM tree’s meta- data structure (stored on disk in a reserved metadata extent) that records locators of chunks currently in use by the tree.

上記のように、LSMツリー自体もエクステントに書き込まれたチャンクとして格納されます。LSM compaction などのメンテナンス操作により、これらのチャンクが未使用になる可能性があります。これらのチャンクが占める空き領域は、逆引き参照が、ツリーで現在使用されているチャンクのロケーターを記録するLSMツリーのメタデータ構造（予約されたメタデータエクステントでディスクに格納される）を調べることを除いて、上記と同じ方法で再利用されます。

RPC interface. ShardStore runs on storage hosts with multiple HDDs. Each disk is an isolated failure domain and runs an independent key-value store. Clients interact with ShardStore through a shared RPC interface that steers requests to target disks based on shard IDs. The RPC interface provides the usual request-plane calls (put, get, delete) and controlplane operations for migration and repair.

ShardStore は複数のHDDを持つストレージホストで実行されている。各ディスクは分離された障害ドメインであり、独立した Key-Value ストアを実行している。クライアントは、シャードIDに基づいてターゲットディスクへの要求を操作する共有RPCインターフェイスを介してシャードストアとやりとりをする。RPC インターフェイスは、通常の要求プレーン呼び出し（put、get、delete）と移行と修復のためのコントロールプレーン操作を提供します。

Append-only IO. To support both zoned and conventional disks, ShardStore provides its own implementation of the extent append operation in terms of the write system call. It does this by tracking in memory a soft write pointer for each extent, internally translating extent appends to write system calls accordingly, and persisting the soft write pointer for each extent in a superblock flushed on a regular cadence.

ゾーンディスクと従来のディスクの両方をサポートするために、write システムコールの観点では、ShardStore はエクステント追加操作の独自実装を提供している。これは、メモリ内で各エクステントの soft wrriter pointer を追跡し、それに応じてエクステント追加を write システムコールに内部変換し、各エクステントの soft write pointer を通常のケイデンスでフラッシュされるスーパーブロックに永続化することによって行われている。

2.2 Crash Consistency

ShardStore uses a crash consistency approach inspired by soft updates [16]. A soft updates implementation orchestrates the order in which writes are sent to disk to ensure that any crash state of the disk is consistent. Soft updates avoid the cost of redirecting writes through a write-ahead log and allow flexibility in physical placement of data on disk.

ShardSotre は soft updates インスパイアなクラッシュ整合性アプローチを採用している。soft updates の実装は、書き込みがディスクに送信される順序を調整して、ディスクのクラッシュ状態が一貫していることを保証する。soft updates は write-ahead log を通じて書き込みをリダイレクトするコストを回避し、ディスク上のデータの物理的な配置に柔軟性をもたらす。

Correctly implementing soft updates requires global reasoning about all possible orderings of writebacks to disk. To reduce this complexity, ShardStore’s implementation specifies crash-consistent orderings declaratively, using a Dependency type to construct dependency graphs at run time that dictate valid write orderings. ShardStore’s extent append operation, which is the only way to write to disk, has the type signature:
fn append(&self, ..., dep: Dependency) -> Dependency
both taking as input and returning a dependency. The contract for the append API is that the append will not be issued to disk until the input dependency has been persisted. Shard Store’s IO scheduler ensures that writebacks respect these dependencies. The dependency returned from append can be passed back into subsequent append operations, or first combined with other dependencies (e.g., dep1.and(dep2)) to construct more complex dependency graphs. Dependencies also have an is_persistent operation that clients can use to poll the persistence of an operation.

soft updates を正しく実装するには、ディスクへの write back のすべての可能な順序についてグローバルな推論が必要になる。この複雑さを軽減するために、ShardStore の実装では、クラッシュ整合性のある順序を宣言的に指定し、依存関係タイプを使用して、実行時に有効な書き込み順序を指示する依存関係グラフを作成します。ディスクに書き込む唯一の方法である ShardStore のエクステント追加操作には、型署名があります。

fn append(&self, ..., dep: Dependency) -> Dependency

両方を入力として受け取り、依存関係を返します。入力の依存関係が永続化されるまで append がディスクに発行されないのが、append API の規約です。ShardStore の IO スケジューラは、writeback がこれらの依存関係を尊重することを保証します。 appendから返された依存関係は、後続のappend操作に戻すか、さらに複雑な依存関係グラフを構成するために最初に他の依存関係と組み合わせることができます。依存関係には、クライアントが操作が永続化されたかをポーリングするために使用できる is_persistent 操作もあります。

Figure 2. Dependency graph for three put operations (a) and the corresponding on-disk data (b) after all three puts persist successfully. Each put is only durable once both the shard data and the index entry that points to it are durable. Writing to each extent also requires updating the soft write pointers stored in the superblock.

(a) は3つの put 操作の依存関係グラフ、(b) は 3 つの put の永続化が成功した後のディスク上のデータである。各エクステントに書き込むには、スーパーブロックに格納されている soft updates pointer を更新する必要があります。

Example. Figure 2a shows dependency graphs for three put operations to ShardStore, which after completion result in the on-disk state in Fig. 2b. Each put’s graph follows the same pattern involving three writes:
(a) the shard data is chunked and written to an extent. (b) the index entry for the put is flushed in the LSM tree. (c) the metadata for the LSM tree is updated to point to
the new on-disk index data.
In addition to these writes, every time ShardStore appends to an extent it also updates the corresponding soft write pointer in the superblock (extent 0).

例. 図2a は ShardSotre への 3 つの put 操作の依存関係グラフ、図2b は書込み完了後のディスク上での状態を表している。それぞれの put のグラフは3回の書込みを含む同じパターンに従う:
（a）シャードデータはチャンク化され、エクステントをに書き込まれます。
（b）put のインデックスエントリが LSM ツリーにフラッシュされます。
（c）新しいディスク上のインデックスデータを指すためにLSMツリーのメタデータが更新されます。
これらの書き込みに加えて、ShardStoreがエクステントを追加するたびに、スーパーブロック（エクステント0）内の対応する soft writes pointer を更新します。

The dependency graphs for these puts are constructed dynamically at run time. The chunk store allocated the shard data chunks for puts #1 and #2 to the same extent on disk, so their writebacks can be coalesced into one IO by the sched- uler, and thus their soft write pointer updates are combined into the same superblock update. The shard data chunk for put #3 is allocated to a different extent and so requires a separate soft write pointer update. All three puts arrive close enough together in time to participate in the same LSM-tree flush, which writes a new chunk of LSM tree data (on extent 12) and then updates the LSM tree metadata (on chunk 9) to point to that new chunk.

これらの put の依存関係グラフは、実行時に動的に作成されます。チャンクストアは、put＃1と＃2のシャードデータチャンクをディスク上の同じエクステントに割り当てたため、IO スケジューラーはそれらの write back を1つの IO に統合できるため、soft update pointer の更新は同じスーパーブロックの更新に結合されます。put＃3のシャードデータチャンクは異なるエクステントに割り当てられるため、別に soft writtes pointer の更新が必要です。3つの出力はすべて、同じLSMツリーフラッシュに参加するのに十分な時間で接近して到着します。これにより、LSMツリーデータの新しいチャンク（エクステント12）が書き込まれ、LSMツリーメタデータ（チャンク9）が更新されてその新しいチャンクがポイントされます。

3 Validating a Storage System
A production storage system combines several difficult to implement complexities [25]: intricate on-disk data structures, concurrent accesses and mutations to them, and the need to maintain consistency across crashes. The scale of a realistic implementation mirrors this complexity: ShardStore
is over 40,000 lines of code and changes frequently.
Facing these challenges early in ShardStore’s design process, we took inspiration from recent successes in storage system verification [8, 18, 49] and resolved to apply formal methods to increase our confidence. We chose formal methods because they would allow us to validate deep properties of ShardStore’s implementation that are difficult to test with off-the-shelf tools at S3’s scale and complexity - functional correctness of API-level calls, crash consistency of on-disk data structures, and correctness of concurrent executions including API calls and maintenance tasks like garbage collection. Given the complexity of the system and the rapid rate of change, we needed our results to outlast the involvement of formal methods experts, and so we sought a lightweight approach that could be automated and developed by the engineering team itself. This section gives an overview of our approach to validating ShardStore, including the properties we focused on and how we distilled reference model specifications to check against the implementation. § 4-7 detail how we check that the implementation conforms to these specifications. § 8 summarizes our experiences with this approach, which has prevented subtle issues from reaching production.
（以下略）

ストレージシステムの妥当性検証
本番ストレージシステムは、実装が難しいいくつかの複雑さが組み合わさっている。複雑なディスク上のデータ構造、それらへの同時アクセスと変更、およびクラッシュ間で一貫性を維持する必要性。現実的な実装の規模は、この複雑さを反映している。ShardStore は4万行を超えるコードであり、頻繁に変更される。

ShardStoreの設計プロセスの早い段階でこれらの課題に直面し、ストレージシステムの検証における最近の成功からインスピレーションを得て[8、18、49]、自信を高めるために形式手法を適用することを決意した。形式手法を選択したのは、S3の規模と複雑さで既製のツールでテストするのが難しいShardStore の実装の深い特性を検証できるためです。つまり、APIレベルの呼び出しの機能の正確さ、ディスク上のデータ構造のクラッシュの一貫性、API呼び出しやガベージコレクションなどのメンテナンスタスクを含む同時並行の正確性など。システムの複雑さと急速な変化の速度を考えると、形式手法の専門家の関与よりも持続する結果が必要だったため、エンジニアリングチーム自体が自動化および開発できる軽量のアプローチを模索した。このセクションでは、ShardStore を検証するためのアプローチの概要を説明する。これには、焦点を当てた特性や、実装をチェックするため参照モデルの仕様を抽出する方法が含まれる。セクション 4-7は、実装がこれらの仕様に準拠していることを確認する方法を詳しく説明しています。セクション 8 は、微妙な問題が本番環境に到達するのを防いだこのアプローチの経験をまとめたものです。

3.1 Correctness Properties
Correctness for a production storage system is multifaceted: it should be durable in the absence of crashes, consistent in the presence of crashes and concurrency, highly available during normal operation, and meet our performance goals. Our validation effort considers availability and per formance properties out of scope, as S3 has successfully established other methods for validating these properties, including integration tests, load testing in pre-production environments, and staggered deployments with monitoring in production [29]. We chose to focus on durability and con sistency properties, which are hard to establish by traditional testing alone and more difficult to recover from if violated.

本番ストレージシステムの正しさは多面的です。クラッシュがなくても耐久性があり、クラッシュと同時実行性が存在する場合でも一貫性があり、通常の運用中に高い可用性を発揮し、パフォーマンス目標を達成する必要があります。S3は、結合テスト、準本番環境での負荷テスト、実稼働での監視を伴う時間差デプロイメントなどの特性を検証するための方法を正常に確立しているため、検証では可用性とパフォーマンスの特性をスコープ外とする[29]。従来のテストだけでは確立することが難しく、違反した場合に回復するのが難しい耐久性と一貫性の特性に焦点を当てることを選択しました。

In our approach, we ask developers to write an executable reference model (§3.2) to specify the expected state of the system after each API call. Our durability property then is that the model and implementation remain in equivalent states after each API call. Since the system is a key-value store, we define equivalence as having the same key-value mapping. This invariant establishes durability because it ensures that the implementation only loses or changes data when the specification allows (e.g., by a delete operation). However, this property is too strong in the face of two types of non-determinism: crashes may cause data loss that the reference model does not allow, and concurrency allows operations to overlap, meaning concurrent operations may be in flight when we check equivalence after each API call.

私たちのアプローチでは、開発者に各API呼び出し後に期待するシステムの状態を定義する実行可能な参照モデル（§3.2）を書いてもらった。耐久性の特性は、各API呼び出し後、モデルと実装が同じ状態になることです。システムは Key-Value ストアであるため、同等性を同じKey-Value マッピングを持つものと定義します。この不変条件は、仕様で許可されている場合にのみ実装がデータを失うか変更することを保証するため、耐久性を確立します（例. delete 操作などによって）。しかしながら、この特性は2種類の非決定論に直面すると強すぎる：クラッシュにより、参照モデルでは許可されていないデータ損失が発生する可能性があり、同時実行により操作が重複する可能性があります。つまり、各API呼び出しの後に同等性を確認すると、同時操作が実行されている可能性があります。

To address these limitations, we found it useful to decompose the durability property into three parts and reason about each separately:
1. For sequential crash-free executions, we check for equivalence directly (§4).
2. For sequential crashing executions, we extend the reference model to define which data can be lost after a crash, and check a corresponding weaker equivalence that establishes both durability and consistency (§5).
3. For concurrent crash-free executions, we write separate reference models and check linearizability (§6).

これらの制限に対処するために、耐久性特性を3つのパートに分解し、それぞれについて個別に理由を説明することが有用であることがわかりました。
1. クラッシュのない順次実行の場合、同等性を直接チェックします
2. 順次クラッシュ実行の場合、参照モデルを拡張して、クラッシュ後に失われる可能性のあるデータを定義し、耐久性と一貫性の両方を確立する、対応する弱い同等性を確認します。
3. クラッシュのない同時実行のために、個別の参照モデルを作成し、線形化可能性を確認します

(We do not currently check properties of concurrent crashing executions because we have not found an effective automated approach.) This decomposition aids the specification effort by allowing us to separate concerns - the initial reference models are simple and easy to audit, and then we extend them separately to handle more complex properties. Decomposition also helps us scale the validation effort by using the most appropriate tools for each type of execution, including property-based testing [9] and stateless model checking [5, 35].

（効果的な自動化されたアプローチが見つからないため、現在、同時クラッシュ実行のプロパティをチェックしていません。）この分解は、関心の分離を可能にすることで仕様の取り組みを支援します。最初の参照モデルはシンプルで監査が簡単で、次にそれらを個別に拡張して、より複雑なプロパティを処理します。分解は、プロパティベースのテスト[9]やステートレスモデル検査[5、35]など、実行の種類ごとに最適なツールを使用して、検証作業を拡張するのにも役立ちます。

4 Conformance Checking
Once we have a reference model specifying the expected behavior of the system, we need to check that the implementation conforms to the reference model according to the correctness properties above. This section details how we implement this check using property-based testing, and how we ensure good coverage of the possible behaviors of the implementation.
（以下略）

4 適合性チェック
システムの期待する動作を規定する参照モデルができたら、上記の正確性特性に従って、実装が参照モデルに準拠していることを確認する必要があります。このセクションでは、プロパティベースのテストを使用してこのチェックを実装する方法と、実装で発生する可能性のある動作を適切にカバーする方法について詳しく説明します。

5 Checking Crash Consistency
Crash consistency can be a source of bugs in storage systems [42]. ShardStore uses a crash consistency protocol based on soft updates (§ 2.2), which introduces implementation complexity. For ShardStore, reasoning about crash consistency was a primary motivation for introducing formal methods during development, and so it was a focus of our efforts. We validate crash consistency by using the user-space dependencies discussed in § 2.2. Each mutating ShardStore operation returns a Dependency object that can be polled to determine whether it has been persisted. As a specification, we define two crash consistency properties in terms of these dependencies:
1. persistence: if a dependency says an operation has persisted before a crash, it should be readable after a crash(unless superseded by a later persisted operation)
2. forward progress: after a non-crashing shutdown, every operation’s dependency should indicate it is persistent

5 クラッシュの一貫性のチェック
クラッシュの一貫性は、ストレージシステムのバグの原因となる可能性があります[42]。 ShardStoreは、ソフトアップデート（§ 2.2）に基づくクラッシュ整合性プロトコルを使用します。これにより、実装が複雑になります。ShardStoreの場合、クラッシュの一貫性についての推論は、形式手法を導入する主な動機でした。開発中だったので、それが私たちの努力の焦点でした。§ 2.2 で説明されているユーザースペースの依存関係を使用して、クラッシュの一貫性を検証します。変更する各ShardStore操作は、永続化されているかどうかを判断するためにポーリングできるDependencyオブジェクトを返します。仕様として、これらの依存関係の観点から2つのクラッシュ整合性特性を定義します：
1. 永続性：依存関係により、クラッシュ前に操作が永続化されたことが示されている場合、クラッシュ後に読み取り可能である必要があります（後で永続化された操作に置き換えられた場合を除く）
2. 前進：クラッシュしないシャットダウン後、すべての操作の依存関係は、それが永続的であることを示す必要があります

6 Checking Concurrent Executions
Our validation approach thus far deals only with sequential correctness, as the conformance checking approach in § 4 tests only deterministic single-threaded executions. In practice, a production storage system like ShardStore is highly concurrent, with each disk servicing several concurrent requests and background maintenance tasks (e.g., LSM tree compaction, buffer cache flushing, etc). Rust’s type system guarantees data-race freedom within the safe fragment of the language [52], but cannot make guarantees about higher level race conditions (e.g., atomicity violations), which are difficult to test and debug as they introduce non-determinism into the execution.
To extend our approach to check concurrent properties, we hand-wrote harnesses for key properties and validated them using stateless model checking [17], which explores concurrent interleavings of a program. We use this approach both to check concurrent executions of the storage system and to validate some ShardStore-specific concurrency primitives. Stateless model checkers can both validate concurrency properties (e.g., consistency) and test for deadlocks (by finding interleavings that end with all threads blocked).
（以下略）

§4の適合性チェックアプローチは決定論的なシングルスレッド実行のみをテストするため、これまでの検証アプローチはシーケンシャルな正しさのみを扱います。実際には、ShardStoreのような本番ストレージシステムは並行度が高く、各ディスクが複数の同時並行リクエストとバックグラウンドのメンテナンスタスク（例. LSM tree compaction, buffer cache flushing, etc）を処理します。Rustの型システムは、言語の安全なフラグメント内でのデータ競合の自由を保証します[52]が、実行に非決定論を導入するため、テストとデバッグが困難な高レベルの競合状態（例. atomicity violations）については保証できません。並行特性をチェックするアプローチを拡張するために、キーとなる複数の特性を束ねる処理を手書きし、プログラムの並行インターリーブを調査するステートレスモデルチェック[17]を使ってそれらを検証しました。このアプローチを使用して、ストレージシステムの同時並行実行をチェックし、ShardStore固有の同時実行プリミティブを検証します。ステートレスモデルチェッカーは、同時並行実行特性（整合性など）の検証とデッドロックのテスト（すべてのスレッドがブロックされた状態で終了するインターリーブを見つけることによる）の両方を実行できます。

7 Other Properties
The validation approach described so far covers the properties in sect; 3.1. However, while developing ShardStore, we identified two more localized classes of issues that our existing approach would not detect, and adopted specific reasoning techniques to check for them.
（以下略）

これまでに説明した検証アプローチは、 sect; 3.1のプロパティを対象としています。ただし、ShardStoreの開発中に、既存のアプローチでは検出できない、さらに2つのローカライズされたクラスの問題を特定し、それらをチェックするために特定の推論手法を採用しました。

8 Experience
This section reports on our experience validating ShardStore, including the effort to perform validation and the level of adoption by the engineering team.
（以下略）

このセクションでは、検証を実行するための取り組みやエンジニアリングチームによる適応レベルなど、ShardStoreの検証の経験について報告します。

9 Related Work

TBD

10 Conclusion
ShardStore is a new storage backend for Amazon S3 for which we decided early in the design process to involve formal methods. Our experience with lightweight formal methods has been positive, with a number of issues prevented from reaching production and substantial adoption by the ShardStore engineering team. We are excited to further improve our results by applying stronger verification techniques and expanding the scope of our effort to validate more of S3.

S3 の新しいストレージバックエンド ShardStore の開発初期から形式手法の導入を決めていた。軽量な形式手法はポジティブであった、数々の課題が本番にリリースされるのを防ぎ、ShardStore のエンジニアリングチームが実質的に適応することができた。より強い検証技法を適用し、提供範囲を S3 のより広い領域に拡大してさらに改善していくことに興奮している。

参考

S3 の新しい KVS 実装 ShardStore (≈ 40K lines in Rust) を開発する過程で軽量な形式手法を適用した経験談．ソフトウェアの進化に応じて，エンジニア自らが仕様を記述し，継続的に正しさを検証できるように，使いやすさや自動化を求めた．https://t.co/ZwB57nuHYQ (SOSP 2021 Best Paper)
— Youki Shiraishi (@y_sira) November 4, 2021

muratbuffalo.blogspot.com
www.amazon.science

swet.dena.com
ja.wikipedia.org

Property based testingとは
歴史的な背景として2000年にJohn HughesとKoen Clasessenによって
開発されたQuickCheckがProperty based testingとしてHaskellエコシステムに実装されました。
QuickCheckはプロパティ（特定の入力が与えられると出力として期待される特性）
を与えることで、
テストデータをランダムに生成して、失敗するケースを見つけるHaskellで実装されたフレームワークです。
それによってテスト対象のシステムがプロパティに従っているかどうかをチェックします。
QuickCheckは単体テストだけではなく、統合テストやサンプルベースのテスト等幅広く使われています。
このテスト方法はProperty based testingとして知られるようになりました。
2020年現在は、HaskellのQuickCheckだけでなく、様々な言語でProperty based testingを実装したフレームワークが開発されています。
Property based testing を試してみよう #property-based-testing - Qiita

dev.classmethod.jp

Log Structured Merge Trees

It’s nearly a decade since Google released its ‘Big Table’ paper. One of the many cool aspects of that paper was the file organisation it uses. The approach is more generally known as the Log Structured Merge Tree, after this 1996 paper, although the algorithm described there differs quite significantly from most real-world implementations.
LSM is now used in a number of products as the main file organisation strategy. HBase, Cassandra, LevelDB, SQLite, even MongoDB 3.0 comes with an optional LSM engine, after it’s acquisition of Wired Tiger.
What makes LSM trees interesting is their departure from binary tree style file organisations that have dominated the space for decades. LSM seems almost counter intuitive when you first look at it, only making sense when you closely consider how files work in modern, memory heavy systems.

Log Structured Merge Trees - ben stopford

ablog

不器用で落着きのない技術者のメモ

SOSP21 で best paper 賞を受賞した S3 の論文を読んでみる

2.2 Crash Consistency

参考

Log Structured Merge Trees