fence_kdump with RRP/multiple link support
Working with one customer last year I got my interest in making the fence_kdump
work better in pacemaker cluster with multiple heartbeats (rings) sometimes also called RRP cluster. Motivation was to allow the fence_kdump
to accept messages not only from primary IP of crashed node (ring0
) but also any additional IPs (ring1
, ring2
, …). Recently the changes for this were accepted and merged in PR 374. Here you will find example configurations on how to take advantage of this new feature.
TL; DR: check the configuration example.
- How does fence_kdump work?
- Before the RRP patch: introduction
- Before the RRP patch: workaround
- After the RRP patch: introduction
- After the RRP patch: configuration example
- Future plans
How does fence_kdump work?
Before starting a quick refresher on how fence_kdump
works in pacemaker clusters. The whole mechanism consists of two parts:
fence_kdump
agent that is run during fencing on normal node and waits for special “message” from node that has crashed and should be fenced.fence_kdump_send
binary that is part of kdump image and that will sending in regular intervals “message” while kdump image is running and collecting vmcore of crashed system.
Idea is that as soon as cluster running fence_kdump
receives the “message” from correct host, it can assume that node to be fenced (=not running cluster, most probably saving vmcore).
In cluster configuration the fence_kdump
is usually configured as “Level 1” stonith device for all nodes and each node has it’s own “Level 2” stonith device that is used when fence_kdump
times out waiting for “message”. Example configuration is shown below (from CentOS 8.3).
node01 # grep fence_kdump_nodes /etc/kdump.conf
fence_kdump_nodes 10.0.0.12
node02 # grep fence_kdump_nodes /etc/kdump.conf
fence_kdump_nodes 10.0.0.11
# pcs stonith create kdump fence_kdump timeout=60
# pcs stonith level add 1 kdump node01
# pcs stonith level add 1 kdump node02
# pcs stonith level add 2 kdump ipmilan-node01
# pcs stonith level add 2 kdump ipmilan-node02
# pcs stonith config
Resource: kdump (class=stonith type=fence_kdump)
Attributes: timeout=60
Operations: monitor interval=60s (kdump-monitor-interval-60s)
Resource: ipmilan-node01 (class=stonith type=fence_ipmilan)
Attributes: user=test password=test ip=172.30.30.11
Operations: monitor interval=60s (ipmilan-node01-monitor-interval-60s)
Resource: ipmilan-node02 (class=stonith type=fence_ipmilan)
Attributes: user=test password=test ip=172.30.30.12
Operations: monitor interval=60s (ipmilan-node02-monitor-interval-60s)
Target: node01
Level 1 - kdump
Level 2 - ipmilan-node01
Target: node02
Level 1 - kdump
Level 2 - ipmilan-node02
In case that cluster needs to fence any of nodes when system was NOT crashed (so the kdump image was not started to collect vmcore, therefore there is no fence_kdump_send
sending the message) following are the best times for fencing to completes:
- wait for “Level 1”
fence_kdump
to timeout - 60s - execute “Level 2” stonith device to fence the node - ~5s (just for illustration, this vary wildly by hardware) Total time for fencing the node would be ~65 seconds.
In case the cluster needs to fence any of node because of crash, the we can expect fencing to take less than 60 seconds assuming that we receive the “message”.
Before the RRP patch: introduction
Before the change in PR 374 was introduced the fence_kdump
agent would accept message only from one IP which is OK for clusters with single heartbeat link (ring0
). However if the cluster uses RRP (or multiple heartbeat links) the situation gets complicated:
fence_kdump
can still accept only “message” from single IP- if “message” is received from other IP, then it will be ignored (
discarded
in the terminology offence_kdump
agent)
Example of node01(10.0.0.11
) waiting for message from node02(10.0.0.12
) and later receiving it.
node01 # fence_kdump -n 10.0.0.12
[debug]: waiting for message from '10.0.0.12'
[debug]: received valid message from '10.0.0.12'
Example of node01(10.0.0.11
) waiting for message from node02(10.0.0.12
) but receiving message from IP 192.168.1.12
(ring1
) instead. NOTE: Without use of -v
option there is no output when discarded message is received.
node01 # fence_kdump -v -n 10.0.0.12
...
[debug]: waiting for message from '10.0.0.12'
[debug]: discard message from '192.168.1.12'
So if there is issue on interface with ring0
addresses (10.0.0.11
,10.0.0.12
) the fence_kdump
will not notice the “message” coming via interface with ring1
addresses (192.168.1.11
,192.168.1.12
) from node that crashed and will wait until it times out (default 60 seconds). This will therefore (unnecessarily) delay cluster from attempting fencing with stonith device in next level.
Is there any solution/workaround to this without changing fence_kdump
? Yes there is: using another fence_kdump
agent in next stonith level that would wait for “message” from other IP. Problem here is however the time. How long should we wait for “message” to arrive from ring0
interface before trying to wait for it from ring1
interface? We don’t want to wait too long as the node might have not crashed and requires a real power fencing device, so more waiting can make cluster recovery longer.
Before the RRP patch: workaround
To accommodate this approach in pacemaker cluster several fence_kdump
devices can be created and organized into STONITH levels to listen for all node IPs (ring0
and ring1
in example here).
node01 # grep fence_kdump_nodes /etc/kdump.conf
fence_kdump_nodes 10.0.0.12 192.168.1.12
node02 # grep fence_kdump_nodes /etc/kdump.conf
fence_kdump_nodes 10.0.0.11 192.168.1.11
# pcs stonith create kdump-node01-ring0 fence_kdump timeout=60 nodename=10.0.0.11
# pcs stonith create kdump-node02-ring0 fence_kdump timeout=60 nodename=10.0.0.12
# pcs stonith create kdump-node01-ring1 fence_kdump timeout=60 nodename=192.168.1.11
# pcs stonith create kdump-node02-ring1 fence_kdump timeout=60 nodename=192.168.1.12
# pcs stonith level add 1 kdump kdump-node01-ring0
# pcs stonith level add 1 kdump kdump-node02-ring0
# pcs stonith level add 2 kdump kdump-node01-ring1
# pcs stonith level add 2 kdump kdump-node02-ring1
# pcs stonith level add 3 kdump ipmilan-node01
# pcs stonith level add 3 kdump ipmilan-node02
# pcs stonith config
Resource: kdump-node01-ring0 (class=stonith type=fence_kdump)
Attributes: timeout=60 nodename=10.0.0.11
Operations: monitor interval=60s (kdump-node01-ring0-monitor-interval-60s)
Resource: kdump-node01-ring1 (class=stonith type=fence_kdump)
Attributes: timeout=15 nodename=192.168.1.11
Operations: monitor interval=60s (kdump-node01-ring1-monitor-interval-60s)
Resource: kdump-node02-ring0 (class=stonith type=fence_kdump)
Attributes: timeout=60 nodename=10.0.0.12
Operations: monitor interval=60s (kdump-node02-ring0-monitor-interval-60s)
Resource: kdump-node02-ring1 (class=stonith type=fence_kdump)
Attributes: timeout=15 nodename=192.168.1.12
Operations: monitor interval=60s (kdump-node02-ring1-monitor-interval-60s)
Resource: ipmilan-node01 (class=stonith type=fence_ipmilan)
Attributes: user=test password=test ip=172.30.30.11
Operations: monitor interval=60s (ipmilan-node01-monitor-interval-60s)
Resource: ipmilan-node02 (class=stonith type=fence_ipmilan)
Attributes: user=test password=test ip=172.30.30.12
Operations: monitor interval=60s (ipmilan-node02-monitor-interval-60s)
Target: node01
Level 1 - kdump-node01-ring0
Level 2 - kdump-node01-ring1
Level 3 - ipmilan-node01
Target: node02
Level 1 - kdump-node02-ring0
Level 2 - kdump-node02-ring1
Level 3 - ipmilan-node02
As you may notice the above configuration got quite longer since we need to specify the IP of node. By default the pacemaker provides single IP automatically to fence_kdump
which makes is easier to configure.
In addition to complexity the above example also shows that our waiting time before engaging the power fencing has effectively doubled if system has not crashed. While we can decrease the timeout it is important to note that it needs to be long enough for kdump image to start system saving vmcore and start all the required networking in it. Also the timeout should be longer than the interval in which the fence_kdump_send
sends the “message” which defaults to 10 seconds.
In case that cluster needs to fence any of nodes when system was NOT crashed (so the kdump image was not started to collect vmcore, therefore there is no fence_kdump_send
sending the message) following are the best times for fencing to completes:
- wait for “Level 1”
fence_kdump
to timeout - 60s - wait for “Level 2”
fence_kdump
to timeout - 15s - execute “Level 3” stonith device to fence the node - ~5s (just for illustration, this vary wildly by hardware) Total time for fencing the node would be ~80 seconds.
In case the cluster needs to fence any of node because of crash, the we can expect fencing to take:
- less than 60 seconds assuming that we receive the “message” through
ring0
- 60-75 seconds assuming that we receive the “message” through
ring1
- this is the case which the RRP patch is trying to improve
After the RRP patch: introduction
Patch in PR 374 introduced ability to specify more than one IP for fence_kdump
to expect “message” from. This effectively means that during same timeout window we can wait for “message” from ANY of the IPs specified to make fence_kdump
report success.
Example of node01(10.0.0.11
) waiting for message from node02 on both addresses(10.0.0.12
,192.168.1.12
) and later receiving it from first one (10.0.0.12
):
node01 # fence_kdump -n 10.0.0.12,192.168.1.12
[debug]: waiting for message from '10.0.0.12'
[debug]: waiting for message from '192.168.1.12'
[debug]: received valid message from '10.0.0.12'
Example of node01(10.0.0.11
) waiting for message from node02 on both addresses(10.0.0.12
,192.168.1.12
) and later receiving it from second one (192.168.1.12
):
node01 # fence_kdump -n 10.0.0.12,192.168.1.12
[debug]: waiting for message from '10.0.0.12'
[debug]: waiting for message from '192.168.1.12'
[debug]: received valid message from '192.168.1.12'
After the RRP patch: configuration example
To take advantage of new fence_kdump
ability to listen for “message” from multiple IPs we need to, at present, create stonith device for each node specifying the nodename
parameter with those addresses. We do not need the separate device for each IP anymore as shown in the workaround earlier.
How to determine if I’m using the fence_kdump
with RRP patch?
Check the help of the fence_kdump
command. The nodename
parameter should shows multiple NODE(s).
# fence_kdump -h|grep nodename
-n, --nodename=NODE[,NODE...]List of names or IP addresses of node to be fenced
Old version would show only single NODE there.
Add the node IPs/hostnames for both ring0
and ring1
into /etc/kdump.conf
and rebuild the kdump image (usually by restarting the kdump service).
node01 # grep fence_kdump_nodes /etc/kdump.conf
fence_kdump_nodes 10.0.0.12 192.168.1.12
node01 # systemctl restart kdump
node02 # grep fence_kdump_nodes /etc/kdump.conf
fence_kdump_nodes 10.0.0.11 192.168.1.11
node02 # systemctl restart kdump
Create stonith devices and add them to appropriate levels. Make sure to specify nodename
for the fence_kdump
stonith devices.
# pcs stonith create kdump-node01-rrp fence_kdump timeout=60 nodename=10.0.0.11,192.168.1.11
# pcs stonith create kdump-node02-rrp fence_kdump timeout=60 nodename=10.0.0.12,192.168.1.12
# pcs stonith level add 1 kdump kdump-node01-ring0
# pcs stonith level add 1 kdump kdump-node02-ring0
# pcs stonith level add 2 kdump ipmilan-node01
# pcs stonith level add 2 kdump ipmilan-node02
Verify the configuration (NOTE: the fence_ipmilan configuration is just as example here, use your own power fencing stonith device instead).
# pcs stonith config
Resource: kdump-node01-rrp (class=stonith type=fence_kdump)
Attributes: timeout=60 nodename=10.0.0.11,192.168.1.11
Operations: monitor interval=60s (kdump-node01-rrp-monitor-interval-60s)
Resource: kdump-node02-rrp (class=stonith type=fence_kdump)
Attributes: timeout=60 nodename=10.0.0.12,192.168.1.12
Operations: monitor interval=60s (kdump-node02-rrp-monitor-interval-60s)
Resource: ipmilan-node01 (class=stonith type=fence_ipmilan)
Attributes: user=test password=test ip=172.30.30.11
Operations: monitor interval=60s (ipmilan-node01-monitor-interval-60s)
Resource: ipmilan-node02 (class=stonith type=fence_ipmilan)
Attributes: user=test password=test ip=172.30.30.12
Operations: monitor interval=60s (ipmilan-node02-monitor-interval-60s)
Target: node01
Level 1 - kdump-node01-rrp
Level 2 - ipmilan-node01
Target: node02
Level 1 - kdump-node02-rrp
Level 2 - ipmilan-node02
To test if the new configuration works, you can test following scenario:
- break the
ring0
communication between nodes (for example disconnecting cables or blocking all traffic forring0
on firewall) - crash one of the nodes and wait for other node to report success in fencing node with fence_kdump
In case that cluster needs to fence any of nodes when system was NOT crashed (so the kdump image was not started to collect vmcore, therefore there is no fence_kdump_send
sending the message) following are the best times for fencing to completes:
- wait for “Level 1”
fence_kdump
to timeout - 60s - execute “Level 2” stonith device to fence the node - ~5s (just for illustration, this vary wildly by hardware) Total time for fencing the node would be ~65 seconds.
In case the cluster needs to fence any of node because of crash, the we can expect fencing to take less than 60 seconds assuming that we received the “message” from any of the specified IPs.
Future plans
As you may have noticed the new fence_kdump
requires specification of nodename
attribute, that is not needed when using non-RRP (single hearbeat) cluster where fence_kdump
receives the IP/hostname of node to fence from pacemaker. The possible improvement that can be done for future is same implementation for scenario in which RRP is used relieving the user initial configuration that could be “automagically detected”.
Another possible improvement could be autoconfiguration of all needed IPs for /etc/kdump.conf
that would send the “message” to all IPs of other nodes and not just the first one.
While I have no immediate plans for attempting to implement above features I might have a look at them.