175 lines
12 KiB
Plaintext
175 lines
12 KiB
Plaintext
= Network Isolation (Split Brain)
|
|
:idprefix:
|
|
:idseparator: -
|
|
|
|
A _split brain_ is a condition that occurs when two different brokers are serving the same messages at the same time.
|
|
When this happens instead of client applications all sharing the _same_ broker as they ought, they may become divided between the two split brain brokers.
|
|
This is problematic because it can lead to:
|
|
|
|
* *Duplicate messages* e.g. when multiple consumers on the same JMS queue split between both brokers and receive the same message(s)
|
|
* *Missed messages* e.g. when multiple consumers on the same JMS topic split between both brokers and producers are only sending messages to one broker
|
|
|
|
Split brain most commonly happens when a pair of brokers in an HA *replication* configuration lose the replication connection linking them together.
|
|
When this connection is lost the backup assumes that the primary has died and therefore activates.
|
|
At this point there are two brokers on the network which are isolated from each other and since the backup has a copy of all the messages from the primary they are each serving the same messages.
|
|
|
|
[IMPORTANT]
|
|
.What about shared store configurations?
|
|
====
|
|
While it is technically possible for split brain to happen with a pair of brokers in an HA _shared store_ configuration it would require a failure in the file-locking mechanism of the storage device which the brokers are sharing.
|
|
|
|
One of the benefits of using a shared store is that the storage device itself acts as an arbiter to ensure consistency and mitigate split brain.
|
|
====
|
|
|
|
Recovering from a split brain may be as simple as stopping the broker which activated by mistake.
|
|
However, this solution is only viable *if* no client application connected to it and performed messaging operations.
|
|
The longer client applications are allowed to interact with split brain brokers the more difficult it will be to understand and remediate the resulting problems.
|
|
|
|
There are several different configurations you can choose from that will help mitigate split brain.
|
|
|
|
== Pluggable Lock Manager
|
|
|
|
A pluggable lock manager configuration requires a 3rd party to establish a shared lock between primary and backup brokers.
|
|
The shared lock ensures that either the primary or backup is active at any given point in time, similar to how the file lock functions in the shared storage use-case.
|
|
|
|
The _plugin_ decides what 3rd party implementation is used.
|
|
It could be something as simple as a shared file on a network file system that supports locking (e.g. NFS) or it could be something more complex like https://etcd.io/[etcd].
|
|
|
|
The broker ships with a xref:ha.adoc#apache-zookeeper-integration[reference plugin implementation] based on https://zookeeper.apache.org/[Apache ZooKeeper] - a common implementation used for this kind of task.
|
|
|
|
The main benefit of a pluggable lock manager is that is releases the broker from the responsibility of establishing a reliable vote.
|
|
This means that a _single_ HA pair of brokers can be reliably protected against split-brain.
|
|
|
|
== Quorum Voting
|
|
|
|
Quorum voting is a process by which one node in a cluster can determine whether another node in the cluster is active without directly communicating with that node.
|
|
Then the broker initiating the vote can take action based on the result (e.g. shutting itself down to avoid split-brain).
|
|
|
|
Quorum voting requires the participation of the other _active_ brokers in the cluster.
|
|
Of course this requires that there are, in fact, other active brokers in the cluster which means quorum voting won't work with a single HA pair of brokers.
|
|
Furthermore, it also won't work with just two HA pairs of brokers either because that's still not enough for a legitimate quorum.
|
|
There must be at least three HA pairs to establish a proper quorum with quorum voting.
|
|
|
|
=== Voting Mechanics
|
|
|
|
When the replication connection between a primary and backup is lost the backup and/or the primary may initiate a vote.
|
|
|
|
[IMPORTANT]
|
|
====
|
|
For a vote to pass a _majority_ of affirmative responses is required.
|
|
For example, in a 3 node cluster a vote will pass with 2 affirmatives.
|
|
For a 4 node cluster this would be 3 affirmatives and so on.
|
|
====
|
|
|
|
==== Backup Voting
|
|
|
|
By default, if a backup loses its replication connection to its primary it will activate automatically.
|
|
However, it can be configured via the `vote-on-replication-failure` property to initiate a quorum vote in order to decide whether to activate or not.
|
|
If this is done then the backup will keep voting until it either receives a vote allowing it to start or it detects that the primary is still active.
|
|
In the latter case it will then restart as a backup.
|
|
|
|
See the section on xref:ha.adoc#replication-configuration[Replication Configuration] for more details on configuration.
|
|
|
|
==== Primary Voting
|
|
|
|
By default, if the primary server loses its replication connection to the backup then it will just carry on and wait for a backup to reconnect and start replicating again.
|
|
However, this may mean that the primary remains active even though the backup has activated so this behavior is configurable via the `vote-on-replication-failure` property.
|
|
|
|
See the section on xref:ha.adoc#replication-configuration[Replication Configuration] for more details on configuration.
|
|
|
|
== Pinging the network
|
|
|
|
You may configure one more addresses in `broker.xml` that that will be pinged throughout the life of the server. The server will stop itself if it can't ping one or more of the addresses in the list.
|
|
|
|
If you execute the `create` command using the `--ping` argument you will create a default XML that is ready to be used with network checks:
|
|
|
|
[,console]
|
|
----
|
|
$ ./artemis create /myDir/myServer --ping 10.0.0.1
|
|
----
|
|
|
|
This XML will be added to your `broker.xml`:
|
|
|
|
[,xml]
|
|
----
|
|
<!--
|
|
You can verify the network health of a particular NIC by specifying the <network-check-NIC> element.
|
|
<network-check-NIC>theNicName</network-check-NIC>
|
|
-->
|
|
|
|
<!--
|
|
Use this to use an HTTP server to validate the network
|
|
<network-check-URL-list>http://www.apache.org</network-check-URL-list> -->
|
|
|
|
<network-check-period>10000</network-check-period>
|
|
<network-check-timeout>1000</network-check-timeout>
|
|
|
|
<!-- this is a comma separated list, no spaces, just DNS or IPs
|
|
it should accept IPV6
|
|
|
|
Warning: Make sure you understand your network topology as this is meant to check if your network is up.
|
|
Using IPs that could eventually disappear or be partially visible may defeat the purpose.
|
|
You can use a list of multiple IPs, any successful ping will make the server OK to continue running -->
|
|
<network-check-list>10.0.0.1</network-check-list>
|
|
|
|
<!-- use this to customize the ping used for ipv4 addresses -->
|
|
<network-check-ping-command>ping -c 1 -t %d %s</network-check-ping-command>
|
|
|
|
<!-- use this to customize the ping used for ipv6 addresses -->
|
|
<network-check-ping6-command>ping6 -c 1 %2$s</network-check-ping6-command>
|
|
----
|
|
Once you lose connectivity towards `10.0.0.1` on the given example the broker will log something like this:
|
|
----
|
|
09:49:24,562 WARN [org.apache.activemq.artemis.core.server.NetworkHealthCheck] Ping Address /10.0.0.1 wasn't reacheable
|
|
09:49:36,577 INFO [org.apache.activemq.artemis.core.server.NetworkHealthCheck] Network is unhealthy, stopping service ActiveMQServerImpl::serverUUID=04fd5dd8-b18c-11e6-9efe-6a0001921ad0
|
|
09:49:36,625 INFO [org.apache.activemq.artemis.core.server] AMQ221002: Apache ActiveMQ Artemis Message Broker version 1.6.0 [04fd5dd8-b18c-11e6-9efe-6a0001921ad0] stopped, uptime 14.787 seconds
|
|
09:50:00,653 WARN [org.apache.activemq.artemis.core.server.NetworkHealthCheck] ping: sendto: No route to host
|
|
09:50:10,656 WARN [org.apache.activemq.artemis.core.server.NetworkHealthCheck] Host is down: java.net.ConnectException: Host is down
|
|
at java.net.Inet6AddressImpl.isReachable0(Native Method) [rt.jar:1.8.0_73]
|
|
at java.net.Inet6AddressImpl.isReachable(Inet6AddressImpl.java:77) [rt.jar:1.8.0_73]
|
|
at java.net.InetAddress.isReachable(InetAddress.java:502) [rt.jar:1.8.0_73]
|
|
at org.apache.activemq.artemis.core.server.NetworkHealthCheck.check(NetworkHealthCheck.java:295) [artemis-commons-1.6.0-SNAPSHOT.jar:1.6.0-SNAPSHOT]
|
|
at org.apache.activemq.artemis.core.server.NetworkHealthCheck.check(NetworkHealthCheck.java:276) [artemis-commons-1.6.0-SNAPSHOT.jar:1.6.0-SNAPSHOT]
|
|
at org.apache.activemq.artemis.core.server.NetworkHealthCheck.run(NetworkHealthCheck.java:244) [artemis-commons-1.6.0-SNAPSHOT.jar:1.6.0-SNAPSHOT]
|
|
at org.apache.activemq.artemis.core.server.ActiveMQScheduledComponent$2.run(ActiveMQScheduledComponent.java:189) [artemis-commons-1.6.0-SNAPSHOT.jar:1.6.0-SNAPSHOT]
|
|
at org.apache.activemq.artemis.core.server.ActiveMQScheduledComponent$3.run(ActiveMQScheduledComponent.java:199) [artemis-commons-1.6.0-SNAPSHOT.jar:1.6.0-SNAPSHOT]
|
|
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_73]
|
|
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [rt.jar:1.8.0_73]
|
|
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [rt.jar:1.8.0_73]
|
|
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [rt.jar:1.8.0_73]
|
|
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_73]
|
|
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_73]
|
|
at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_73]
|
|
----
|
|
|
|
Once you reestablish your network connections towards the configured check-list:
|
|
|
|
----
|
|
09:53:23,461 INFO [org.apache.activemq.artemis.core.server.NetworkHealthCheck] Network is healthy, starting service ActiveMQServerImpl::
|
|
09:53:23,462 INFO [org.apache.activemq.artemis.core.server] AMQ221000: primary Message Broker is starting with configuration Broker Configuration (clustered=false,journalDirectory=./data/journal,bindingsDirectory=./data/bindings,largeMessagesDirectory=./data/large-messages,pagingDirectory=./data/paging)
|
|
09:53:23,462 INFO [org.apache.activemq.artemis.core.server] AMQ221013: Using NIO Journal
|
|
09:53:23,462 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-server]. Adding protocol support for: CORE
|
|
09:53:23,463 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-amqp-protocol]. Adding protocol support for: AMQP
|
|
09:53:23,463 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-hornetq-protocol]. Adding protocol support for: HORNETQ
|
|
09:53:23,463 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-mqtt-protocol]. Adding protocol support for: MQTT
|
|
09:53:23,464 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-openwire-protocol]. Adding protocol support for: OPENWIRE
|
|
09:53:23,464 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-stomp-protocol]. Adding protocol support for: STOMP
|
|
09:53:23,541 INFO [org.apache.activemq.artemis.core.server] AMQ221003: Deploying queue jms.queue.DLQ
|
|
09:53:23,541 INFO [org.apache.activemq.artemis.core.server] AMQ221003: Deploying queue jms.queue.ExpiryQueue
|
|
09:53:23,549 INFO [org.apache.activemq.artemis.core.server] AMQ221020: Started Acceptor at 0.0.0.0:61616 for protocols [CORE,MQTT,AMQP,STOMP,HORNETQ,OPENWIRE]
|
|
09:53:23,550 INFO [org.apache.activemq.artemis.core.server] AMQ221020: Started Acceptor at 0.0.0.0:5445 for protocols [HORNETQ,STOMP]
|
|
09:53:23,554 INFO [org.apache.activemq.artemis.core.server] AMQ221020: Started Acceptor at 0.0.0.0:5672 for protocols [AMQP]
|
|
09:53:23,555 INFO [org.apache.activemq.artemis.core.server] AMQ221020: Started Acceptor at 0.0.0.0:1883 for protocols [MQTT]
|
|
09:53:23,556 INFO [org.apache.activemq.artemis.core.server] AMQ221020: Started Acceptor at 0.0.0.0:61613 for protocols [STOMP]
|
|
09:53:23,556 INFO [org.apache.activemq.artemis.core.server] AMQ221007: Server is now active
|
|
09:53:23,556 INFO [org.apache.activemq.artemis.core.server] AMQ221001: Apache ActiveMQ Artemis Message Broker version 1.6.0 [0.0.0.0, nodeID=04fd5dd8-b18c-11e6-9efe-6a0001921ad0]
|
|
----
|
|
|
|
[IMPORTANT]
|
|
====
|
|
Make sure you understand your network topology as this is meant to validate your network.
|
|
Using IPs that could eventually disappear or be partially visible may defeat the purpose.
|
|
You can use a list of multiple IPs.
|
|
Any successful ping will make the server OK to continue running
|
|
====
|