CLUSTER COUNT-FAILURE-REPORTS
CLUSTER COUNT-FAILURE-REPORTS node-id
- Available since
- 3.0.0
- Time complexity
- O(N) where N is the number of failure reports
- ACL categories
- @admin, @slow, @dangerous
The command returns the number of failure reports for the specified node.
Failure reports are the way Redict Cluster uses in order to promote a
PFAIL
state, that means a node is not reachable, to a FAIL
state,
that means that the majority of masters in the cluster agreed within
a window of time that the node is not reachable.
A few more details:
- A node flags another node with
PFAIL
when the node is not reachable for a time greater than the configured node timeout, which is a fundamental configuration parameter of a Redict Cluster. - Nodes in
PFAIL
state are provided in gossip sections of heartbeat packets. - Every time a node processes gossip packets from other nodes, it creates (and refreshes the TTL if needed) failure reports, remembering that a given node said another given node is in
PFAIL
condition. - Each failure report has a time to live of two times the node timeout time.
- If at a given time a node has another node flagged with
PFAIL
, and at the same time collected the majority of other master nodes failure reports about this node (including itself if it is a master), then it elevates the failure state of the node fromPFAIL
toFAIL
, and broadcasts a message forcing all the nodes that can be reached to flag the node asFAIL
.
This command returns the number of failure reports for the current node which are currently not expired (so received within two times the node timeout time). The count does not include what the node we are asking this count believes about the node ID we pass as argument, the count only includes the failure reports the node received from other nodes.
This command is mainly useful for debugging, when the failure detector of Redict Cluster is not operating as we believe it should.