High Availability: Clustering Setup Example

CommunityEnterprise

This section provides an example on how to set up a cluster over the Bacula Director service, creating a master-slave cluster that can automatically switch over from one node to another. To simplify the examples and to focus specifically on the clustering aspects, we assume that configuration synchronization is handled externally or independently, and therefore no explicit technique is configured here. Likewise, we assume the use of an external Catalog that manages its own high-availability layer and data replication.

Pacemaker is used as the main software to achieve this. This setup aligns with the architecture described earlier as “Pacemaker with Shared Network Storage” when combined with an “External Catalog”.

Note

This section is not intended to be a comprehensive guide to high availability with Pacemaker. Instead, it is designed to help readers consolidate the main concepts and understand how the different layers interact when building a clustering Bacula service.

Pacemaker’s internal configuration format is XML, which is well suited for machines but difficult for humans. To address this, several higher-level configuration tools are available that provide more user-friendly interfaces. These include crm shell, the High Availability Web Console (HAWK), the Linux Cluster Management Console (LCMC), and the Pacemaker/Corosync Configuration System (PCS). This guide focuses on PCS.

PCS provides both a command-line tool (pcs) and the PCS Web UI for managing the full lifecycle of cluster components, including Pacemaker, Corosync, QDevice, SBD, and Booth. More information about PCS and related tools commonly used with the Pacemaker/Corosync stack is available at: https://clusterlabs.org/projects/ .

The example configurations presented in this section are based on Ubuntu Server 24.04. However, all referenced tools are available on most major Linux distributions.

Architecture Overview

Nodes (physical or VMs on Ubuntu 24.04):

dir.01 192.168.1.180 — Pacemaker/bacula-dir/bweb
dir.02 192.168.1.181 — Pacemaker/bacula-dir/bweb
virtualip (dir.01 and dir.02): 192.168.1.190

The Catalog is remotely connected and implements its own HA layer.

The configuration is mounted from a shared remote resource using NFS and it is externally protected and replicated.

The following ports are required for the HA services:

Corosync uses UDP: 5404 and UDP: 5405
PCSD uses TCP: 2224

High-level Steps

Prepare the environment and install Bacula.
Set up Pacemaker on each node.
Set up the cluster.
Set up the shared resources.
Address STONITH considerations.
Perform tests and checks.
Integrate Pacemaker with DRBD.

Setup

1. Preparation and installation of Bacula

Update your system in the two nodes:

apt -y update
apt upgrade

Then install Bacula Director on both nodes. To install Bacula Director with BIM, follow the instructions provided in here.

2. Setup Pacemaker

Install the required packages on both nodes using the system package manager:

apt install pcs pacemaker

Because the Bacula Director and BWeb services will be managed by Pacemaker, they must be disabled on both nodes:

systemctl disable bacula-dir
systemctl disable bweb

3. Setup Cluster

Create the user that Pacemaker uses for authentication on both nodes:

passwd hacluster

Authenticate the cluster nodes using the newly created user:

:~# pcs host auth 192.168.1.181
Username: hacluster
Password:
192.168.1.181: Authorized

:~# pcs host auth 192.168.1.180
Username: hacluster
Password:
192.168.1.180: Authorized

Next, create the cluster. Assign a name to the cluster and use the --force option because the cluster services were not stopped beforehand.

# One node: Setup the cluster
:~# pcs cluster setup baculadir 192.168.1.180 192.168.1.181 --force
No addresses specified for host '192.168.1.180', using '192.168.1.180'
No addresses specified for host '192.168.1.181', using '192.168.1.181'
Warning: 192.168.1.180: The host seems to be in a cluster already as the following services are found to be running: 'corosync', 'pacemaker'. If the host is not part of a cluster, stop the services and retry
Warning: 192.168.1.180: The host seems to be in a cluster already as cluster configuration files have been found on the host. If the host is not part of a cluster, run 'pcs cluster destroy' on host '192.168.1.180' to remove those configuration files
Warning: 192.168.1.181: The host seems to be in a cluster already as the following services are found to be running: 'corosync', 'pacemaker'. If the host is not part of a cluster, stop the services and retry
Warning: 192.168.1.181: The host seems to be in a cluster already as cluster configuration files have been found on the host. If the host is not part of a cluster, run 'pcs cluster destroy' on host '192.168.1.181' to remove those configuration files
Destroying cluster on hosts: '192.168.1.180', '192.168.1.181'...
192.168.1.180: Successfully destroyed cluster
192.168.1.181: Successfully destroyed cluster
Requesting remove 'pcsd settings' from '192.168.1.180', '192.168.1.181'
192.168.1.180: successful removal of the file 'pcsd settings'
192.168.1.181: successful removal of the file 'pcsd settings'
Sending 'corosync authkey', 'pacemaker authkey' to '192.168.1.180', '192.168.1.181'
192.168.1.180: successful distribution of the file 'corosync authkey'
192.168.1.180: successful distribution of the file 'pacemaker authkey'
192.168.1.181: successful distribution of the file 'corosync authkey'
192.168.1.181: successful distribution of the file 'pacemaker authkey'
Sending 'corosync.conf' to '192.168.1.180', '192.168.1.181'
192.168.1.181: successful distribution of the file 'corosync.conf'
192.168.1.180: successful distribution of the file 'corosync.conf'
Cluster has been successfully set up.

Once the cluster has been created, start it and enable it on both nodes:

# Both nodes: start the service
:~# pcs cluster start
Starting Cluster...

:~# pcs cluster enable

4. Setup Shared Resources

Now create the resources that Pacemaker will manage:

# One node
:~# pcs resource create VirtualIP ocf:heartbeat:IPaddr2 ip=192.168.1.190
:~# pcs resource status
  * VirtualIP        (ocf:heartbeat:IPaddr2):         Stopped

:~# pcs resource create bacula-dir systemd:bacula-dir

:~# pcs resource create bacula-dir systemd:bweb

:~# pcs resource status
  * VirtualIP        (ocf:heartbeat:IPaddr2):         Stopped
  * bacula-dir       (systemd:bacula-dir):    Stopped
  * bweb           (systemd:bweb):    Stopped

Next, define the startup order for the resources. The Bacula Director must start first, followed by BWeb and then the virtual IP. To enforce this, group the resources together:

:~# pcs constraint order start bacula-dir then VirtualIP
Adding bacula-dir VirtualIP (kind: Mandatory) (Options: first-action=start then-action=start)

:~# pcs constraint order start bacula-dir then bweb
Adding bacula-dir bweb (kind: Mandatory) (Options: first-action=start then-action=start)

:~# pcs resource group add BaculaDir bacula-dir bweb VirtualIP

5. STONITH Considerations

STONITH (Shoot The Other Node In The Head, or Shoot The Offending Node In The Head) is highly recommended to prevent split-brain situations for in two-node clusters. Even in three-node clusters, configuring STONITH and fencing remains useful and is generally recommended.

The exact procedure for enabling STONITH and fencing depends on the hardware or virtualization platform used in your environment.

For instance, if you are using VMware hosts, a fencing agent can be configured as follows:

crm configure primitive fence_vmware stonith:fence_vmware_rest \
params \
ipaddr="<VSPHERE IP ADDRESS>" \
action=reboot \
login="<LOGIN>" \
passwd="<PASSWORD>" \
ssl=1 ssl_insecure=1 \
pcmk_reboot_timeout=900 \
power_timeout=60 \
op monitor \
interval=3600 \
timeout=120

Once configured, the fencing agent appears in the cluster status output. You can also test it manually using a command such as:

sudo crm node fence <NODE>

Available fencing agents vary widely and depend entirely on the underlying hypervisor or physical infrastructure. For this reason - and for the sake of simplicity - STONITH is disabled in the remainder of this example.

6. Tests and checks

We need to configure it or to disable the STONITH mechanism, otherwise it will prevent the cluster to start as we can see in the status output.

:~# pcs status
Cluster name: baculadir

WARNINGS:
No stonith devices and stonith-enabled is not false

Cluster Summary:
....

As explained in the previous section, we disable STONITH for this example:

# Disable stonith
:~# pcs property set stonith-enabled=false

The cluster can start normally. Verify the cluster status again:

:~# pcs status
Cluster name: baculadir
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: 192.168.1.181 (version 2.1.6-6fdc9deea29) - partition with quorum
  * Last updated: Fri Nov 14 16:03:11 2025 on 192.168.1.181
  * Last change:  Fri Nov 14 16:02:05 2025 by root via cibadmin on 192.168.1.181
  * 2 nodes configured
  * 3 resource instances configured

Node List:
  * Online: [ 192.168.1.180 192.168.1.181 ]

Full List of Resources:
 * Resource Group: BaculaDir:
   * bacula-dir      (systemd:bacula-dir):    Started 192.168.1.180
   * VirtualIP       (ocf:heartbeat:IPaddr2):         Started 192.168.1.180
   * bweb    (systemd:bweb):  Started 192.168.1.180

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

At this point, the cluster is being served by node 192.168.1.180. To test failover behavior, manually stop the cluster on that node and observe how node 192.168.1.181 takes over:

:~# pcs cluster stop 192.168.1.180
192.168.1.180: Stopping Cluster (pacemaker)...
192.168.1.180: Stopping Cluster (corosync)...

Checking the cluster status again shows that the original node is offline and the remaining node has taken control of the services:

:~# pcs status
Cluster name: baculadir
Cluster Summary:
   * Stack: corosync (Pacemaker is running)
   * Current DC: 192.168.1.181 (version 2.1.6-6fdc9deea29) - partition with quorum
   * Last updated: Thu Nov 20 10:55:57 2025 on 192.168.1.181
   * Last change:  Wed Nov 19 07:35:24 2025 by root via cibadmin on 192.168.1.181
   * 2 nodes configured
   * 3 resource instances configured

Node List:
   * Online: [ 192.168.1.181 ]
   * OFFLINE: [ 192.168.1.180 ]

Full List of Resources:
 * Resource Group: BaculaDir:
   * bacula-dir      (systemd:bacula-dir):    Started 192.168.1.181
   * VirtualIP       (ocf:heartbeat:IPaddr2):         Started 192.168.1.181
   * bweb    (systemd:bweb):  Started 192.168.1.181

Daemon Status:
   corosync: active/enabled
   pacemaker: active/enabled
   pcsd: active/enabled

If you start node 192.168.1.180 again using pcs cluster start 192.168.1.180, the cluster will recover the previous state. These manual operations are primarily intended for testing or troubleshooting. In real-world scenarios, failures typically occur due to network loss or hardware issues, and the cluster reacts in the same way.

7. Integrate Pacemaker with DRBD

Pacemaker is frequently used in combination with a block-based replication layer, most commonly DRBD, as described earlier in the architecture sections.

If you plan to use Pacemaker together with DRBD, continue with the following section: HA: Drbd Setup.