NOTA: Fencing (en anglès vol dir vallat, del verb to fence)
Fencing és el procés d'isolar un node d'un cluster d'ordinadors o protegir els recursos compartits quan un node sembla que funciona incorrectament.
Quant el nombre de nodes a un cluster s'incrementa també ho fa la probabilitat de que un dels nodes del cluster falli en algún moment concret. El node que falla pot tenir control sobre els recursos compartits i si el node estan funionant de forma incorrecte s'han de reclamar els recursos compartits per tal de protegir el sistema en cluster. Normalment es fa algunes de les següents accions: desactivar el node o no permetre l'accés als recursos compartits per tal d'assegurar la integritat de les dades.
DRBD is a perfect example of why fencing is needed. Most people setup DRBD replication on a dedicated network. So in most cases the DRBD replication is a separate isolated network from the Proxmox cluster network.
In our hypothetical setup Proxmox cluster communicates on eth0 and DRBD communicates on eth1. Node A is running HA VM 101 and everything is running along fine until someone bumps a cable unplugging eth0 on Node A. Proxmox cluster can not see eth0 so that node is assumed dead. However it is not really dead and VM 101 is still running and still replicating data via DRBD on eth1.
With fencing here is what happens: Node A is turned off(fenced) Then VM 101 is started on Node B When Node A comes back up DRBD reconnects and life moves on and everyone is happy.
Without fencing here is what happens: VM 101 keeps running on NodeA replicating data via DRBD. VM 101 is started on Node B Now your VM 101 disk is fubar with corruption beyond belief because you have two VMs running at the same time writing to the same disk, ouch. Instead of HA you now have a disaster on your hands, time to get the backups.
The last thing any of us will ever want to happen is to see the HA system mess up and actually CAUSE a problem rather than preventing it. That is why fencing is required. It is imperative that the node you "think" is dead, is actually dead. Just because you can not ping it does not make it dead. Just because it is not responding to the cluster does not mean it can not cause problems.
What if cman crashes on Node A? Or you mess up the network config for the cluster? Or some other unforeseen odd event happens?
The only 100% positive method to ensure that the VM is not still running, thus making it safe to start it elsewhere, is to fence the node that is no longer responding to the rest of the cluster.
What is a quorum? A quorum is a designation given to a group of nodes in a cluster which are still allowed to operate on shared storage. It comes up when there is a failure in the cluster which breaks the nodes up into groups which can communicate in their groups and with the shared storage but not between groups. How does OCFS2's cluster services define a quorum? The quorum decision is made by a single node based on the number of other nodes that are considered alive by heartbeating and the number of other nodes that are reachable via the network. A node has quorum when:
it sees an odd number of heartbeating nodes and has network connectivity to more than half of them. OR, it sees an even number of heartbeating nodes and has network connectivity to at least half of them *and* has connectivity to the heartbeating node with the lowest node number.
What is fencing? Fencing is the act of forecefully removing a node from a cluster. A node with OCFS2 mounted will fence itself when it realizes that it doesn't have quorum in a degraded cluster. It does this so that other nodes won't get stuck trying to access its resources. Currently OCFS2 will panic the machine when it realizes it has to fence itself off from the cluster. As described above, it will do this when it sees more nodes heartbeating than it has connectivity to and fails the quorum test. Due to user reports of nodes hanging during fencing, OCFS2 1.2.5 no longer uses "panic" for fencing. Instead, by default, it uses "machine restart". This should not only prevent nodes from hanging during fencing but also allow for nodes to quickly restart and rejoin the cluster. While this change is internal in nature, we are documenting this so as to make users aware that they are no longer going to see the familiar panic stack trace during fencing. Instead they will see the message "*** ocfs2 is very sorry to be fencing this system by restarting ***" and that too probably only as part of the messages captured on the netdump/netconsole server. If perchance the user wishes to use panic to fence (maybe to see the familiar oops stack trace or on the advise of customer support to diagnose frequent reboots), one can do so by issuing the following command after the O2CB cluster is online.
# echo 1 > /proc/fs/ocfs2_nodemanager/fence_method
Please note that this change is local to a node.
Consulteu Proxmox HA Cluster