Ondrej Famera - top logo

fence_kdump with RRP/multiple link support

Working with one customer last year I got my interest in making the fence_kdump work better in pacemaker cluster with multiple heartbeats (rings) sometimes also called RRP cluster. Motivation was to allow the fence_kdump to accept messages not only from primary IP of crashed node (ring0) but also any additional IPs (ring1, ring2, …). Recently the changes for this were accepted and merged in PR 374. Here you will find example configurations on how to take advantage of this new feature.

TL; DR: check the configuration example.

How does fence_kdump work?

Before starting a quick refresher on how fence_kdump works in pacemaker clusters. The whole mechanism consists of two parts:

  1. fence_kdump agent that is run during fencing on normal node and waits for special “message” from node that has crashed and should be fenced.
  2. fence_kdump_send binary that is part of kdump image and that will sending in regular intervals “message” while kdump image is running and collecting vmcore of crashed system.

Idea is that as soon as cluster running fence_kdump receives the “message” from correct host, it can assume that node to be fenced (=not running cluster, most probably saving vmcore).

In cluster configuration the fence_kdump is usually configured as “Level 1” stonith device for all nodes and each node has it’s own “Level 2” stonith device that is used when fence_kdump times out waiting for “message”. Example configuration is shown below (from CentOS 8.3).

node01 # grep fence_kdump_nodes /etc/kdump.conf
fence_kdump_nodes 10.0.0.12
node02 # grep fence_kdump_nodes /etc/kdump.conf
fence_kdump_nodes 10.0.0.11
# pcs stonith create kdump fence_kdump timeout=60
# pcs stonith level add 1 kdump node01
# pcs stonith level add 1 kdump node02
# pcs stonith level add 2 kdump ipmilan-node01
# pcs stonith level add 2 kdump ipmilan-node02
# pcs stonith config
 Resource: kdump (class=stonith type=fence_kdump)
  Attributes: timeout=60
  Operations: monitor interval=60s (kdump-monitor-interval-60s)
 Resource: ipmilan-node01 (class=stonith type=fence_ipmilan)
  Attributes: user=test password=test ip=172.30.30.11
  Operations: monitor interval=60s (ipmilan-node01-monitor-interval-60s)
 Resource: ipmilan-node02 (class=stonith type=fence_ipmilan)
  Attributes: user=test password=test ip=172.30.30.12
  Operations: monitor interval=60s (ipmilan-node02-monitor-interval-60s)
 Target: node01
   Level 1 - kdump
   Level 2 - ipmilan-node01
 Target: node02
   Level 1 - kdump
   Level 2 - ipmilan-node02

In case that cluster needs to fence any of nodes when system was NOT crashed (so the kdump image was not started to collect vmcore, therefore there is no fence_kdump_send sending the message) following are the best times for fencing to completes:

  1. wait for “Level 1” fence_kdump to timeout - 60s
  2. execute “Level 2” stonith device to fence the node - ~5s (just for illustration, this vary wildly by hardware) Total time for fencing the node would be ~65 seconds.

In case the cluster needs to fence any of node because of crash, the we can expect fencing to take less than 60 seconds assuming that we receive the “message”.

Before the RRP patch: introduction

Before the change in PR 374 was introduced the fence_kdump agent would accept message only from one IP which is OK for clusters with single heartbeat link (ring0). However if the cluster uses RRP (or multiple heartbeat links) the situation gets complicated:

  • fence_kdump can still accept only “message” from single IP
  • if “message” is received from other IP, then it will be ignored (discarded in the terminology of fence_kdump agent)

Example of node01(10.0.0.11) waiting for message from node02(10.0.0.12) and later receiving it.

node01 # fence_kdump -n 10.0.0.12
[debug]: waiting for message from '10.0.0.12'
[debug]: received valid message from '10.0.0.12'

Example of node01(10.0.0.11) waiting for message from node02(10.0.0.12) but receiving message from IP 192.168.1.12 (ring1) instead. NOTE: Without use of -v option there is no output when discarded message is received.

node01 # fence_kdump -v -n 10.0.0.12
...
[debug]: waiting for message from '10.0.0.12'
[debug]: discard message from '192.168.1.12'

So if there is issue on interface with ring0 addresses (10.0.0.11,10.0.0.12) the fence_kdump will not notice the “message” coming via interface with ring1 addresses (192.168.1.11,192.168.1.12) from node that crashed and will wait until it times out (default 60 seconds). This will therefore (unnecessarily) delay cluster from attempting fencing with stonith device in next level.

Is there any solution/workaround to this without changing fence_kdump? Yes there is: using another fence_kdump agent in next stonith level that would wait for “message” from other IP. Problem here is however the time. How long should we wait for “message” to arrive from ring0 interface before trying to wait for it from ring1 interface? We don’t want to wait too long as the node might have not crashed and requires a real power fencing device, so more waiting can make cluster recovery longer.

Before the RRP patch: workaround

To accommodate this approach in pacemaker cluster several fence_kdump devices can be created and organized into STONITH levels to listen for all node IPs (ring0 and ring1 in example here).

node01 # grep fence_kdump_nodes /etc/kdump.conf
fence_kdump_nodes 10.0.0.12 192.168.1.12
node02 # grep fence_kdump_nodes /etc/kdump.conf
fence_kdump_nodes 10.0.0.11 192.168.1.11
# pcs stonith create kdump-node01-ring0 fence_kdump timeout=60 nodename=10.0.0.11
# pcs stonith create kdump-node02-ring0 fence_kdump timeout=60 nodename=10.0.0.12
# pcs stonith create kdump-node01-ring1 fence_kdump timeout=60 nodename=192.168.1.11
# pcs stonith create kdump-node02-ring1 fence_kdump timeout=60 nodename=192.168.1.12
# pcs stonith level add 1 kdump kdump-node01-ring0
# pcs stonith level add 1 kdump kdump-node02-ring0
# pcs stonith level add 2 kdump kdump-node01-ring1
# pcs stonith level add 2 kdump kdump-node02-ring1
# pcs stonith level add 3 kdump ipmilan-node01
# pcs stonith level add 3 kdump ipmilan-node02
# pcs stonith config
 Resource: kdump-node01-ring0 (class=stonith type=fence_kdump)
  Attributes: timeout=60 nodename=10.0.0.11
  Operations: monitor interval=60s (kdump-node01-ring0-monitor-interval-60s)
 Resource: kdump-node01-ring1 (class=stonith type=fence_kdump)
  Attributes: timeout=15 nodename=192.168.1.11
  Operations: monitor interval=60s (kdump-node01-ring1-monitor-interval-60s)
 Resource: kdump-node02-ring0 (class=stonith type=fence_kdump)
  Attributes: timeout=60 nodename=10.0.0.12
  Operations: monitor interval=60s (kdump-node02-ring0-monitor-interval-60s)
 Resource: kdump-node02-ring1 (class=stonith type=fence_kdump)
  Attributes: timeout=15 nodename=192.168.1.12
  Operations: monitor interval=60s (kdump-node02-ring1-monitor-interval-60s)
 Resource: ipmilan-node01 (class=stonith type=fence_ipmilan)
  Attributes: user=test password=test ip=172.30.30.11
  Operations: monitor interval=60s (ipmilan-node01-monitor-interval-60s)
 Resource: ipmilan-node02 (class=stonith type=fence_ipmilan)
  Attributes: user=test password=test ip=172.30.30.12
  Operations: monitor interval=60s (ipmilan-node02-monitor-interval-60s)
 Target: node01
   Level 1 - kdump-node01-ring0
   Level 2 - kdump-node01-ring1
   Level 3 - ipmilan-node01
 Target: node02
   Level 1 - kdump-node02-ring0
   Level 2 - kdump-node02-ring1
   Level 3 - ipmilan-node02

As you may notice the above configuration got quite longer since we need to specify the IP of node. By default the pacemaker provides single IP automatically to fence_kdump which makes is easier to configure.

In addition to complexity the above example also shows that our waiting time before engaging the power fencing has effectively doubled if system has not crashed. While we can decrease the timeout it is important to note that it needs to be long enough for kdump image to start system saving vmcore and start all the required networking in it. Also the timeout should be longer than the interval in which the fence_kdump_send sends the “message” which defaults to 10 seconds.

In case that cluster needs to fence any of nodes when system was NOT crashed (so the kdump image was not started to collect vmcore, therefore there is no fence_kdump_send sending the message) following are the best times for fencing to completes:

  1. wait for “Level 1” fence_kdump to timeout - 60s
  2. wait for “Level 2” fence_kdump to timeout - 15s
  3. execute “Level 3” stonith device to fence the node - ~5s (just for illustration, this vary wildly by hardware) Total time for fencing the node would be ~80 seconds.

In case the cluster needs to fence any of node because of crash, the we can expect fencing to take:

  • less than 60 seconds assuming that we receive the “message” through ring0
  • 60-75 seconds assuming that we receive the “message” through ring1 - this is the case which the RRP patch is trying to improve

After the RRP patch: introduction

Patch in PR 374 introduced ability to specify more than one IP for fence_kdump to expect “message” from. This effectively means that during same timeout window we can wait for “message” from ANY of the IPs specified to make fence_kdump report success.

Example of node01(10.0.0.11) waiting for message from node02 on both addresses(10.0.0.12,192.168.1.12) and later receiving it from first one (10.0.0.12):

node01 # fence_kdump -n 10.0.0.12,192.168.1.12
[debug]: waiting for message from '10.0.0.12'
[debug]: waiting for message from '192.168.1.12'
[debug]: received valid message from '10.0.0.12'

Example of node01(10.0.0.11) waiting for message from node02 on both addresses(10.0.0.12,192.168.1.12) and later receiving it from second one (192.168.1.12):

node01 # fence_kdump -n 10.0.0.12,192.168.1.12
[debug]: waiting for message from '10.0.0.12'
[debug]: waiting for message from '192.168.1.12'
[debug]: received valid message from '192.168.1.12'

After the RRP patch: configuration example

To take advantage of new fence_kdump ability to listen for “message” from multiple IPs we need to, at present, create stonith device for each node specifying the nodename parameter with those addresses. We do not need the separate device for each IP anymore as shown in the workaround earlier.

How to determine if I’m using the fence_kdump with RRP patch? Check the help of the fence_kdump command. The nodename parameter should shows multiple NODE(s).

# fence_kdump -h|grep nodename
  -n, --nodename=NODE[,NODE...]List of names or IP addresses of node to be fenced

Old version would show only single NODE there.

Add the node IPs/hostnames for both ring0 and ring1 into /etc/kdump.conf and rebuild the kdump image (usually by restarting the kdump service).

node01 # grep fence_kdump_nodes /etc/kdump.conf
fence_kdump_nodes 10.0.0.12 192.168.1.12
node01 # systemctl restart kdump

node02 # grep fence_kdump_nodes /etc/kdump.conf
fence_kdump_nodes 10.0.0.11 192.168.1.11
node02 # systemctl restart kdump

Create stonith devices and add them to appropriate levels. Make sure to specify nodename for the fence_kdump stonith devices.

# pcs stonith create kdump-node01-rrp fence_kdump timeout=60 nodename=10.0.0.11,192.168.1.11
# pcs stonith create kdump-node02-rrp fence_kdump timeout=60 nodename=10.0.0.12,192.168.1.12
# pcs stonith level add 1 kdump kdump-node01-ring0
# pcs stonith level add 1 kdump kdump-node02-ring0
# pcs stonith level add 2 kdump ipmilan-node01
# pcs stonith level add 2 kdump ipmilan-node02

Verify the configuration (NOTE: the fence_ipmilan configuration is just as example here, use your own power fencing stonith device instead).

# pcs stonith config
 Resource: kdump-node01-rrp (class=stonith type=fence_kdump)
  Attributes: timeout=60 nodename=10.0.0.11,192.168.1.11
  Operations: monitor interval=60s (kdump-node01-rrp-monitor-interval-60s)
 Resource: kdump-node02-rrp (class=stonith type=fence_kdump)
  Attributes: timeout=60 nodename=10.0.0.12,192.168.1.12
  Operations: monitor interval=60s (kdump-node02-rrp-monitor-interval-60s)
 Resource: ipmilan-node01 (class=stonith type=fence_ipmilan)
  Attributes: user=test password=test ip=172.30.30.11
  Operations: monitor interval=60s (ipmilan-node01-monitor-interval-60s)
 Resource: ipmilan-node02 (class=stonith type=fence_ipmilan)
  Attributes: user=test password=test ip=172.30.30.12
  Operations: monitor interval=60s (ipmilan-node02-monitor-interval-60s)
 Target: node01
   Level 1 - kdump-node01-rrp
   Level 2 - ipmilan-node01
 Target: node02
   Level 1 - kdump-node02-rrp
   Level 2 - ipmilan-node02

To test if the new configuration works, you can test following scenario:

  • break the ring0 communication between nodes (for example disconnecting cables or blocking all traffic for ring0 on firewall)
  • crash one of the nodes and wait for other node to report success in fencing node with fence_kdump

In case that cluster needs to fence any of nodes when system was NOT crashed (so the kdump image was not started to collect vmcore, therefore there is no fence_kdump_send sending the message) following are the best times for fencing to completes:

  1. wait for “Level 1” fence_kdump to timeout - 60s
  2. execute “Level 2” stonith device to fence the node - ~5s (just for illustration, this vary wildly by hardware) Total time for fencing the node would be ~65 seconds.

In case the cluster needs to fence any of node because of crash, the we can expect fencing to take less than 60 seconds assuming that we received the “message” from any of the specified IPs.

Future plans

As you may have noticed the new fence_kdump requires specification of nodename attribute, that is not needed when using non-RRP (single hearbeat) cluster where fence_kdump receives the IP/hostname of node to fence from pacemaker. The possible improvement that can be done for future is same implementation for scenario in which RRP is used relieving the user initial configuration that could be “automagically detected”.

Another possible improvement could be autoconfiguration of all needed IPs for /etc/kdump.conf that would send the “message” to all IPs of other nodes and not just the first one.

While I have no immediate plans for attempting to implement above features I might have a look at them.

Last change .