This FAQ entry explains the meaning and implications of changing configuration values for the group-management-service element in domain.xml.
The group-management-service element provides attributes, whose values determine the health monitoring and discovery protocol behavior in Shoal GMS.
The default values in the GlassFish domain.xml have been arrived at by taking into account our testing, both functionally and with system under load, with an 8 instance cluster. Recently, we have received feedback from customers that with these values, they were able to have a large GMS group with Sailfin, working without issue and reporting group membership event notifications correctly - so these domain.xml values are a good default for large number of instances in the cluster.
The following attributes are present in the
fd stands for Failure Detection
fd-protocol-max-tries stands for the maximum number of missed heartbeats that the GMS service provider's HealthMonitor would wait for, before marking an instance as suspected to have failed - in addition to the max tries, GMS also tries to make a peer-2-peer connection with the suspected member and if that also fails, the member is marked suspected failed.
fd-protocol-timeout-in-millis stands for the number of milliseconds interval between each heartbeat message that an instance would wait to send out its Alive state, AND as a result, the number of milliseconds between missed heartbeats that the max-retry logic would wait for, in GMS service provider's Master Node, before counting another missed heartbeat.
Changing the value of max-retries lower would result in failure suspicion determination with a shorter number of missed heartbeats and vice versa. More below on consequences of different settings in the Impact of Changing Values section.
merge-protocol-max-interval-in-millis and merge-protocol-min-interval-in-millis are no-op attributes that have no effect on GMS behavior. These attributes remained in the v2 release due to oversight. In the upcoming v2.1.1 release, we are planning to deprecate or remove these attributes along with more meaningful descriptive attribute names.
ping-protocol-timeout-in-millis stands for initial discovery timeout. This is the amount of time an instance's GMS module will wait during instance startup (on a background thread, so that appserver startup does not wait for the timeout) for discovering the master member of the group - called master node discovery protocol in GMS. The instance's GMS module sends out a master node query to the multicast group address and waits until a response is received or the timeout occurs. If the wait times out i.e. the instance does not receive a master node response from another instance within this time, indicating the absence of a master, then it assumes the master role, sending out a master node announcement to the group. This instance subsequently responds to all future master node query messages from other members with a master node response. In the appserver, since DAS joins a cluster as soon as it is created, the DAS becomes a master member of the group ahead of time allowing cluster members to discover master quickly without having to timeout. More below on impact of changing settings.
vs-protocol-timeout-in-millis stands for Verify Suspect protocol's timeout used by the HealthMonitor. Once a member is marked suspect based on missed heartbeats and a failed p2p connection check, the verify suspect protocol kicks in waiting for the specified timeout to check for any further health state message received in that time and, to see if a peer-2-peer connection can be made with this suspect member. If not (i.e both the health state update missing and a p2p connection attempt failing), the suspected failed member is marked as confirmed failed and a failure notification is sent out.
Mileage gained from the above varies depending on how quickly and reliably the deployment environment needs to have failures detected.
Setting the fd-protocol-timeout-in-millis (and/or fd-protocol-max-retries) lower or higher has impact that you should consider :
The retries, missed heartbeat intervals, peer-2-peer connection based failure detection, the watchdog based failure reporting, and verify suspect protocols are all needed ensure that failure detection is robust and reliable in GlassFish/Sailfin. Most of these protocols (except for watchdog protocol) are employed as standard in many group communication solutions such as JGroups, Coherence, GridGain, GigaSpaces, etc., so our goal is to have parity with those solutions and with additional watchdog capability we are augmenting failure detection functionality.
For any further questions, please send email to users at shoal dot dev dot java dot net