DRBD Block Replication

Level: Expert Module: Monitoring & High Availability 9 min read Lesson 65 of 66

Overview

What you’ll learn: Block-level replication concepts and how they differ from file-level replication, DRBD resource configuration including disk, network, and syncer settings, primary/secondary role management, split-brain detection and recovery, key drbdadm commands, and performance tuning with replication protocols A, B, and C.
Prerequisites: Backup with Bacula and rsnapshot (Lesson 64), LVM basics, networking fundamentals
Estimated reading time: 22 minutes

Introduction

Traditional backups protect against data loss but introduce a recovery gap — the time between the last backup and the failure event. In mission-critical environments where even minutes of data loss are unacceptable, you need continuous data protection. DRBD (Distributed Replicated Block Device) provides exactly this by replicating block devices in real time between two or more servers over a network connection.

Think of DRBD as a network RAID-1 mirror. Just as a local RAID-1 array mirrors writes across two physical disks in the same machine, DRBD mirrors writes across two disks on separate machines connected by a network. When the primary server writes a block to its local disk, DRBD simultaneously transmits that block to the secondary server, keeping both copies synchronized at the block level.

DRBD operates in the Linux kernel as a virtual block device driver. Applications and filesystems interact with the DRBD device (/dev/drbd0) exactly as they would with any other block device. The replication is transparent — no application changes are required. This makes DRBD an ideal foundation for high-availability clusters, where it is commonly paired with Pacemaker and Corosync to provide automatic failover.

Block-Level vs. File-Level Replication

Understanding why block-level replication matters requires comparing it with file-level alternatives:

File-level replication (e.g., rsync, lsyncd) copies entire files when changes are detected. For a 10 GB database file where one record changes, the entire file may need to be transferred or at least checksummed. File-level tools also struggle with open files and consistency.
Block-level replication (DRBD) operates below the filesystem. When an application writes even a single byte, only the affected disk blocks (typically 4 KB) are replicated. This is dramatically more efficient for databases, virtual machines, and any workload with large files that change incrementally.

Block-level replication also guarantees write-order consistency. Writes arrive at the replica in the same order they were issued on the primary, ensuring that the secondary always has a crash-consistent copy of the data.

DRBD Replication Protocols

DRBD offers three replication protocols that define when a write operation is considered complete. Choosing the right protocol is a critical trade-off between data safety and write latency:

Protocol A (asynchronous): A write is considered complete as soon as it reaches the local disk and the network send buffer. The primary does not wait for the secondary to acknowledge receipt. This offers the best write performance but risks data loss if the primary fails before the secondary receives the data. Suitable for long-distance replication where latency makes synchronous replication impractical.
Protocol B (semi-synchronous / memory-synchronous): A write is considered complete when it reaches the local disk and the secondary has received the data in memory (but not yet flushed to disk). This protects against primary failure but not simultaneous dual-node failure.
Protocol C (synchronous): A write is considered complete only when it has been written to both the local disk and the remote disk. This provides the strongest data guarantee — zero data loss on primary failure — but adds network round-trip latency to every write operation. This is the recommended protocol for HA clusters on a local network.

Installing and Configuring DRBD

We will set up a DRBD resource between two Ubuntu servers to replicate a partition in real time.

# On BOTH nodes — install DRBD
$ sudo apt update
$ sudo apt install -y drbd-utils

# Verify kernel module is available
$ sudo modprobe drbd
$ lsmod | grep drbd
drbd                  409600  0

# Prepare a partition or LVM volume for DRBD on BOTH nodes
# Using LVM (recommended):
$ sudo lvcreate -n drbd-data -L 50G vg0

# --- DRBD resource configuration ---
# Create the resource file on BOTH nodes (must be identical)
$ sudo tee /etc/drbd.d/r0.res > /dev/null <<'EOF'
resource r0 {
  protocol C;

  net {
    # Verify data integrity on the wire
    verify-alg sha256;
    # Timeout and keepalive settings
    timeout 60;         # 6 seconds (unit: 0.1s)
    connect-int 10;     # reconnect interval in seconds
    ping-int 10;        # keepalive ping interval
    ping-timeout 5;     # ping response timeout
  }

  disk {
    # Resync rate limit (adjust based on network capacity)
    resync-rate 100M;
    # Enable online verification
    on-io-error detach;
  }

  on node1 {
    device    /dev/drbd0;
    disk      /dev/vg0/drbd-data;
    address   192.168.1.100:7789;
    meta-disk internal;
  }

  on node2 {
    device    /dev/drbd0;
    disk      /dev/vg0/drbd-data;
    address   192.168.1.101:7789;
    meta-disk internal;
  }
}
EOF

# Initialize the DRBD metadata on BOTH nodes
$ sudo drbdadm create-md r0
initializing activity log
initializing bitmap (1600 KB) to all zero
Writing meta data...
New drbd meta data block successfully created.

# Start the DRBD resource on BOTH nodes
$ sudo drbdadm up r0

# On the DESIGNATED PRIMARY node — force initial sync
$ sudo drbdadm primary --force r0

# Monitor the initial synchronization
$ watch cat /proc/drbd
version: 8.4.11 (api:1/proto:86-101)
 0: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent
    [====>...............] sync'ed: 28.4% (36224/50000)M

Day-to-Day DRBD Operations

Once the initial synchronization is complete, the DRBD resource is ready for use. The primary node can create a filesystem on the device and mount it.

# Create a filesystem on the DRBD device (PRIMARY only)
$ sudo mkfs.ext4 /dev/drbd0
$ sudo mkdir -p /data
$ sudo mount /dev/drbd0 /data

# Check DRBD status
$ sudo drbdadm status r0
r0 role:Primary
  disk:UpToDate
  node2 role:Secondary
    peer-disk:UpToDate

# Detailed connection info
$ sudo drbdsetup status r0 --verbose --statistics

# --- Manual failover procedure ---
# On the PRIMARY node — demote and unmount
$ sudo umount /data
$ sudo drbdadm secondary r0

# On the NEW PRIMARY node — promote and mount
$ sudo drbdadm primary r0
$ sudo mount /dev/drbd0 /data

# Verify data integrity with online verification
$ sudo drbdadm verify r0
# Check results
$ sudo drbdadm status r0
# Look for "out-of-sync" count — should be 0

# Temporarily disconnect for maintenance
$ sudo drbdadm disconnect r0
# Reconnect when maintenance is complete
$ sudo drbdadm connect r0

Split-Brain Detection and Recovery

Split-brain is the most dangerous failure scenario in any replicated system. It occurs when both nodes become primary simultaneously — typically because the network link between them fails while both nodes continue operating. Each node accepts writes independently, creating divergent datasets that cannot be automatically merged.

DRBD detects split-brain upon reconnection and refuses to synchronize until an administrator resolves the conflict. Automated split-brain recovery policies can be configured, but they inherently discard data from one side.

# Configure automatic split-brain recovery policies
# Add to the net { } section of the resource definition:
net {
  # If one node had no changes, auto-resolve
  after-sb-0pri discard-zero-changes;
  # If one node was primary, keep its data
  after-sb-1pri discard-secondary;
  # If both were primary, disconnect (manual intervention required)
  after-sb-2pri disconnect;
}

# --- Manual split-brain recovery ---
# On the VICTIM node (whose data you want to DISCARD):
$ sudo drbdadm secondary r0
$ sudo drbdadm disconnect r0
$ sudo drbdadm -- --discard-my-data connect r0

# On the SURVIVING node (whose data you want to KEEP):
$ sudo drbdadm connect r0

# The victim node will resync from the survivor
$ sudo drbdadm status r0

Performance Tuning

DRBD performance depends on network bandwidth, latency, and disk I/O speed. Several parameters can be tuned to optimize throughput.

# Key performance parameters in the resource definition:

resource r0 {
  net {
    # Maximum number of write requests in flight
    max-buffers 8000;
    max-epoch-size 8000;

    # Sndbuf/rcvbuf size (0 = auto-tune)
    sndbuf-size 0;
    rcvbuf-size 0;
  }

  disk {
    # Resync rate — maximum bandwidth for background sync
    resync-rate 200M;

    # Use checksum-based resync for efficiency
    c-plan-ahead 20;
    c-max-rate 500M;
    c-fill-target 24M;

    # Align I/O requests for better SSD performance
    disk-barrier no;
    disk-flushes no;     # Only if battery-backed cache
  }
}

# Monitor DRBD performance metrics
$ sudo drbdsetup status r0 --verbose --statistics
# Key metrics to watch:
#   - send/receive throughput (KB/s)
#   - pending/unacked writes
#   - out-of-sync blocks (should be 0 in steady state)

# Benchmark write performance
$ sudo dd if=/dev/zero of=/data/testfile bs=1M count=1024 oflag=direct
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 5.23 s, 205 MB/s

Key Takeaways

DRBD provides real-time block-level replication between servers, functioning as a network RAID-1 mirror that is transparent to applications and filesystems above it.
Protocol C (synchronous) guarantees zero data loss on primary failure and is recommended for HA clusters on local networks; Protocols A and B trade data safety for lower write latency.
Split-brain is the most critical failure scenario; it occurs when both nodes accept writes independently, and recovery always involves discarding one side’s changes.
The drbdadm command is the primary administration tool — use it for promotion, demotion, connection management, status checks, and online verification.
Performance tuning involves matching DRBD buffer sizes, resync rates, and I/O barriers to your network bandwidth, latency, and storage hardware capabilities.

What’s Next

In the next lesson, we will explore Pacemaker HA Clustering — learning how to build on DRBD with Corosync and Pacemaker to create a fully automated high-availability cluster with resource management, fencing, and automatic failover.

繁體中文

概述

您將學到：區塊層級複製概念及其與檔案層級複製的區別、DRBD 資源設定（磁碟、網路和同步器設定）、主/從角色管理、腦裂偵測和復原、關鍵 drbdadm 命令，以及複製協定 A、B、C 的效能調整。
先決條件：使用 Bacula 和 rsnapshot 進行備份（第 64 課）、LVM 基礎、網路基礎
預計閱讀時間：22 分鐘

介紹

傳統備份可以防止資料遺失，但會引入復原間隙——最後一次備份和故障事件之間的時間。在即使幾分鐘的資料遺失都不可接受的關鍵任務環境中，您需要持續資料保護。DRBD（分散式複製區塊裝置）透過網路連線在兩台或多台伺服器之間即時複製區塊裝置，正好提供了這一點。

可以把 DRBD 想像成網路 RAID-1 鏡像。就像本地 RAID-1 陣列在同一台機器中的兩個實體磁碟之間鏡像寫入一樣，DRBD 在透過網路連接的兩台不同機器的磁碟之間鏡像寫入。

DRBD 作為虛擬區塊裝置驅動程式在 Linux 核心中運作。應用程式和檔案系統與 DRBD 裝置（/dev/drbd0）的互動方式與任何其他區塊裝置完全相同。複製是透明的——不需要修改應用程式。

區塊層級 vs. 檔案層級複製

檔案層級複製（如 rsync、lsyncd）在偵測到變更時複製整個檔案。對於一個 10 GB 的資料庫檔案，即使只有一條記錄變更，可能也需要傳輸整個檔案。
區塊層級複製（DRBD）在檔案系統之下運作。當應用程式甚至只寫入一個位元組時，只有受影響的磁碟區塊（通常為 4 KB）會被複製。這對資料庫和虛擬機器來說效率顯著提高。

區塊層級複製還保證寫入順序一致性，確保次要節點始終擁有崩潰一致性的資料副本。

DRBD 複製協定

協定 A（非同步）：寫入到達本地磁碟和網路傳送緩衝區即視為完成。效能最佳但有資料遺失風險。適合長距離複製。
協定 B（半同步/記憶體同步）：寫入到達本地磁碟且次要節點在記憶體中收到資料即視為完成。
協定 C（同步）：寫入必須寫入本地磁碟和遠端磁碟才視為完成。提供最強的資料保證——主節點故障時零資料遺失。建議用於本地網路上的 HA 叢集。

安裝和設定 DRBD

# 在兩個節點上安裝 DRBD
$ sudo apt install -y drbd-utils

# 驗證核心模組可用
$ sudo modprobe drbd

# 為 DRBD 準備 LVM 卷
$ sudo lvcreate -n drbd-data -L 50G vg0

# 建立資源設定檔（兩個節點必須相同）
resource r0 {
  protocol C;
  on node1 {
    device    /dev/drbd0;
    disk      /dev/vg0/drbd-data;
    address   192.168.1.100:7789;
    meta-disk internal;
  }
  on node2 {
    device    /dev/drbd0;
    disk      /dev/vg0/drbd-data;
    address   192.168.1.101:7789;
    meta-disk internal;
  }
}

# 初始化中繼資料並啟動
$ sudo drbdadm create-md r0
$ sudo drbdadm up r0
$ sudo drbdadm primary --force r0   # 僅在指定的主節點上

日常 DRBD 操作

# 在 DRBD 裝置上建立檔案系統（僅主節點）
$ sudo mkfs.ext4 /dev/drbd0
$ sudo mount /dev/drbd0 /data

# 檢查狀態
$ sudo drbdadm status r0

# 手動容錯移轉
# 在主節點上：
$ sudo umount /data
$ sudo drbdadm secondary r0

# 在新主節點上：
$ sudo drbdadm primary r0
$ sudo mount /dev/drbd0 /data

# 線上驗證資料完整性
$ sudo drbdadm verify r0

腦裂偵測和復原

腦裂是任何複製系統中最危險的故障情境。當兩個節點同時成為主節點時就會發生——通常是因為節點之間的網路連結中斷而兩個節點繼續運作。每個節點獨立接受寫入，建立無法自動合併的分歧資料集。

# 設定自動腦裂復原策略
net {
  after-sb-0pri discard-zero-changes;
  after-sb-1pri discard-secondary;
  after-sb-2pri disconnect;
}

# 手動腦裂復原
# 在受害節點上（要捨棄其資料的節點）：
$ sudo drbdadm secondary r0
$ sudo drbdadm disconnect r0
$ sudo drbdadm -- --discard-my-data connect r0

# 在存活節點上：
$ sudo drbdadm connect r0

重點總結

DRBD 在伺服器之間提供即時區塊層級複製，作為對應用程式和檔案系統透明的網路 RAID-1 鏡像。
協定 C（同步）保證主節點故障時零資料遺失，建議用於本地網路上的 HA 叢集；協定 A 和 B 以資料安全性換取較低的寫入延遲。
腦裂是最關鍵的故障情境；它發生在兩個節點獨立接受寫入時，復原始終涉及捨棄一方的變更。
drbdadm 命令是主要的管理工具——用於提升、降級、連線管理、狀態檢查和線上驗證。
效能調整涉及將 DRBD 緩衝區大小、重新同步速率和 I/O 屏障與網路頻寬、延遲和儲存硬體能力相匹配。

下一步

在下一課中，我們將探討 Pacemaker HA 叢集——學習如何在 DRBD 的基礎上使用 Corosync 和 Pacemaker 建立具有資源管理、隔離和自動容錯移轉的全自動高可用性叢集。

日本語

概要

学習内容：ブロックレベルレプリケーションの概念とファイルレベルレプリケーションとの違い、DRBD リソース設定（ディスク、ネットワーク、シンカー設定）、プライマリ/セカンダリロール管理、スプリットブレインの検出と復旧、主要な drbdadm コマンド、レプリケーションプロトコル A、B、C によるパフォーマンスチューニング。
前提条件：Bacula と rsnapshot によるバックアップ（レッスン64）、LVM の基礎、ネットワークの基礎
推定読了時間：22分

はじめに

従来のバックアップはデータ損失を防ぎますが、リカバリギャップ — 最後のバックアップと障害イベントの間の時間 — が生じます。数分のデータ損失さえ許容できないミッションクリティカルな環境では、継続的なデータ保護が必要です。DRBD（Distributed Replicated Block Device）は、ネットワーク接続を介して2台以上のサーバー間でブロックデバイスをリアルタイムにレプリケートすることで、まさにこれを提供します。

DRBD をネットワーク RAID-1 ミラーと考えてください。ローカル RAID-1 アレイが同じマシン内の2つの物理ディスク間で書き込みをミラーリングするように、DRBD はネットワークで接続された別々のマシン上の2つのディスク間で書き込みをミラーリングします。

DRBD は仮想ブロックデバイスドライバーとして Linux カーネルで動作します。アプリケーションとファイルシステムは DRBD デバイス（/dev/drbd0）と他のブロックデバイスとまったく同じように対話します。レプリケーションは透過的であり、アプリケーションの変更は不要です。

ブロックレベル vs. ファイルレベルレプリケーション

ファイルレベルレプリケーション（rsync、lsyncd など）は変更が検出されるとファイル全体をコピーする。10 GB のデータベースファイルで1つのレコードが変更されても、ファイル全体の転送が必要になる可能性がある。
ブロックレベルレプリケーション（DRBD）はファイルシステムの下で動作する。アプリケーションが1バイトでも書き込むと、影響を受けたディスクブロック（通常 4 KB）のみがレプリケートされる。データベースや仮想マシンでは劇的に効率が良い。

ブロックレベルレプリケーションは書き込み順序の一貫性も保証し、セカンダリが常にクラッシュ一貫性のあるデータコピーを持つことを保証します。

DRBD レプリケーションプロトコル

プロトコル A（非同期）：書き込みがローカルディスクとネットワーク送信バッファに到達した時点で完了とみなす。最高の書き込みパフォーマンスだがデータ損失のリスクがある。長距離レプリケーションに適している。
プロトコル B（準同期/メモリ同期）：書き込みがローカルディスクに到達し、セカンダリがメモリ内でデータを受信した時点で完了とみなす。
プロトコル C（同期）：書き込みがローカルディスクとリモートディスクの両方に書き込まれた場合にのみ完了とみなす。最も強力なデータ保証を提供 — プライマリ障害時のデータ損失ゼロ。ローカルネットワーク上の HA クラスターに推奨。

DRBD のインストールと設定

# 両ノードに DRBD をインストール
$ sudo apt install -y drbd-utils

# カーネルモジュールが利用可能であることを確認
$ sudo modprobe drbd

# DRBD 用の LVM ボリュームを準備
$ sudo lvcreate -n drbd-data -L 50G vg0

# リソース設定ファイルを作成（両ノードで同一）
resource r0 {
  protocol C;
  on node1 {
    device    /dev/drbd0;
    disk      /dev/vg0/drbd-data;
    address   192.168.1.100:7789;
    meta-disk internal;
  }
  on node2 {
    device    /dev/drbd0;
    disk      /dev/vg0/drbd-data;
    address   192.168.1.101:7789;
    meta-disk internal;
  }
}

# メタデータを初期化して起動
$ sudo drbdadm create-md r0
$ sudo drbdadm up r0
$ sudo drbdadm primary --force r0   # 指定したプライマリノードでのみ

日常の DRBD 操作

# DRBD デバイスにファイルシステムを作成（プライマリのみ）
$ sudo mkfs.ext4 /dev/drbd0
$ sudo mount /dev/drbd0 /data

# ステータスを確認
$ sudo drbdadm status r0

# 手動フェイルオーバー
# プライマリノードで：
$ sudo umount /data
$ sudo drbdadm secondary r0

# 新しいプライマリノードで：
$ sudo drbdadm primary r0
$ sudo mount /dev/drbd0 /data

# オンラインでデータ整合性を検証
$ sudo drbdadm verify r0

スプリットブレインの検出と復旧

スプリットブレインは、あらゆるレプリケーションシステムで最も危険な障害シナリオです。両ノードが同時にプライマリになると発生します — 通常はノード間のネットワークリンクが障害を起こしている間に両ノードが動作を継続する場合です。各ノードが独立して書き込みを受け付け、自動的にマージできない分岐したデータセットが作成されます。

# 自動スプリットブレイン復旧ポリシーの設定
net {
  after-sb-0pri discard-zero-changes;
  after-sb-1pri discard-secondary;
  after-sb-2pri disconnect;
}

# 手動スプリットブレイン復旧
# 被害ノードで（データを破棄するノード）：
$ sudo drbdadm secondary r0
$ sudo drbdadm disconnect r0
$ sudo drbdadm -- --discard-my-data connect r0

# 存続ノードで：
$ sudo drbdadm connect r0

重要ポイント

DRBD はサーバー間でリアルタイムのブロックレベルレプリケーションを提供し、アプリケーションとファイルシステムに対して透過的なネットワーク RAID-1 ミラーとして機能する。
プロトコル C（同期）はプライマリ障害時のデータ損失ゼロを保証し、ローカルネットワーク上の HA クラスターに推奨される。プロトコル A と B はデータの安全性と低い書き込みレイテンシのトレードオフ。
スプリットブレインは最も重大な障害シナリオ。両ノードが独立して書き込みを受け付けると発生し、復旧には常に一方の変更の破棄が伴う。
drbdadm コマンドが主要な管理ツール — プロモーション、デモーション、接続管理、ステータスチェック、オンライン検証に使用する。
パフォーマンスチューニングでは、DRBD バッファサイズ、再同期レート、I/O バリアをネットワーク帯域幅、レイテンシ、ストレージハードウェア能力に合わせる。

次のステップ

次のレッスンでは、Pacemaker HA クラスタリングについて学びます。DRBD の上に Corosync と Pacemaker を使用して、リソース管理、フェンシング、自動フェイルオーバーを備えた完全自動の高可用性クラスターを構築する方法を学びます。

Lessons