activemq-artemis/docs/user-manual/network-isolation.adoc

= Network Isolation (Split Brain)
:idprefix:
:idseparator: -

A _split brain_ is a condition that occurs when two different brokers are serving the same messages at the same time.
When this happens instead of client applications all sharing the _same_ broker as they ought, they may become divided between the two split brain brokers.
This is problematic because it can lead to:

* *Duplicate messages* e.g. when multiple consumers on the same JMS queue split between both brokers and receive the same message(s)
* *Missed messages* e.g. when multiple consumers on the same JMS topic split between both brokers and producers are only sending messages to one broker

Split brain most commonly happens when a pair of brokers in an HA *replication* configuration lose the replication connection linking them together.
When this connection is lost the backup assumes that the primary has died and therefore activates.
At this point there are two brokers on the network which are isolated from each other and since the backup has a copy of all the messages from the primary they are each serving the same messages.

[IMPORTANT]
.What about shared store configurations?
====
While it is technically possible for split brain to happen with a pair of brokers in an HA _shared store_ configuration it would require a failure in the file-locking mechanism of the storage device which the brokers are sharing.

One of the benefits of using a shared store is that the storage device itself acts as an arbiter to ensure consistency and mitigate split brain.
====

Recovering from a split brain may be as simple as stopping the broker which activated by mistake.
However, this solution is only viable *if* no client application connected to it and performed messaging operations.
The longer client applications are allowed to interact with split brain brokers the more difficult it will be to understand and remediate the resulting problems.

There are several different configurations you can choose from that will help mitigate split brain.

== Pluggable Lock Manager

A pluggable lock manager configuration requires a 3rd party to establish a shared lock between primary and backup brokers.
The shared lock ensures that either the primary or backup is active at any given point in time, similar to how the file lock functions in the shared storage use-case.

The _plugin_ decides what 3rd party implementation is used.
It could be something as simple as a shared file on a network file system that supports locking (e.g. NFS) or it could be something more complex like https://etcd.io/[etcd].

The broker ships with a xref:ha.adoc#apache-zookeeper-integration[reference plugin implementation] based on https://zookeeper.apache.org/[Apache ZooKeeper] - a common implementation used for this kind of task.

The main benefit of a pluggable lock manager is that is releases the broker from the responsibility of establishing a reliable vote.
This means that a _single_ HA pair of brokers can be reliably protected against split-brain.

== Quorum Voting

Quorum voting is a process by which one node in a cluster can determine whether another node in the cluster is active without directly communicating with that node.
Then the broker initiating the vote can take action based on the result (e.g. shutting itself down to avoid split-brain).

Quorum voting requires the participation of the other _active_ brokers in the cluster.
Of course this requires that there are, in fact, other active brokers in the cluster which means quorum voting won't work with a single HA pair of brokers.
Furthermore, it also won't work with just two HA pairs of brokers either because that's still not enough for a legitimate quorum.
There must be at least three HA pairs to establish a proper quorum with quorum voting.

=== Voting Mechanics

When the replication connection between a primary and backup is lost the backup and/or the primary may initiate a vote.

[IMPORTANT]
====
For a vote to pass a _majority_ of affirmative responses is required.
For example, in a 3 node cluster a vote will pass with 2 affirmatives.
For a 4 node cluster this would be 3 affirmatives and so on.
====

==== Backup Voting

By default, if a backup loses its replication connection to its primary it will activate automatically.
However, it can be configured via the `vote-on-replication-failure` property to initiate a quorum vote in order to decide whether to activate or not.
If this is done then the backup will keep voting until it either receives a vote allowing it to start or it detects that the primary is still active.
In the latter case it will then restart as a backup.

See the section on xref:ha.adoc#replication-configuration[Replication Configuration] for more details on configuration.

==== Primary Voting

By default, if the primary server loses its replication connection to the backup then it will just carry on and wait for a backup to reconnect and start replicating again.
However, this may mean that the primary remains active even though the backup has activated so this behavior is configurable via the `vote-on-replication-failure` property.

See the section on xref:ha.adoc#replication-configuration[Replication Configuration] for more details on configuration.

== Pinging the network

You may configure one more addresses in `broker.xml` that that will be pinged throughout the life of the server. The server will stop itself if it can't ping one or more of the addresses in the list.

If you execute the `create` command using the `--ping` argument you will create a default XML that is ready to be used with network checks:

[,console]
----
$ ./artemis create /myDir/myServer --ping 10.0.0.1
----

This XML will be added to your `broker.xml`:

[,xml]
----
<!--
  You can verify the network health of a particular NIC by specifying the <network-check-NIC> element.
   <network-check-NIC>theNicName</network-check-NIC>
  -->

<!--
  Use this to use an HTTP server to validate the network
   <network-check-URL-list>http://www.apache.org</network-check-URL-list> -->

<network-check-period>10000</network-check-period>
<network-check-timeout>1000</network-check-timeout>

<!-- this is a comma separated list, no spaces, just DNS or IPs
     it should accept IPV6

     Warning: Make sure you understand your network topology as this is meant to check if your network is up.
              Using IPs that could eventually disappear or be partially visible may defeat the purpose.
              You can use a list of multiple IPs, any successful ping will make the server OK to continue running -->
<network-check-list>10.0.0.1</network-check-list>

<!-- use this to customize the ping used for ipv4 addresses -->
<network-check-ping-command>ping -c 1 -t %d %s</network-check-ping-command>

<!-- use this to customize the ping used for ipv6 addresses -->
<network-check-ping6-command>ping6 -c 1 %2$s</network-check-ping6-command>
----
Once you lose connectivity towards `10.0.0.1` on the given example the broker will log something like this:
----
09:49:24,562 WARN  [org.apache.activemq.artemis.core.server.NetworkHealthCheck] Ping Address /10.0.0.1 wasn't reacheable
09:49:36,577 INFO  [org.apache.activemq.artemis.core.server.NetworkHealthCheck] Network is unhealthy, stopping service ActiveMQServerImpl::serverUUID=04fd5dd8-b18c-11e6-9efe-6a0001921ad0
09:49:36,625 INFO  [org.apache.activemq.artemis.core.server] AMQ221002: Apache ActiveMQ Artemis Message Broker version 1.6.0 [04fd5dd8-b18c-11e6-9efe-6a0001921ad0] stopped, uptime 14.787 seconds
09:50:00,653 WARN  [org.apache.activemq.artemis.core.server.NetworkHealthCheck] ping: sendto: No route to host
09:50:10,656 WARN  [org.apache.activemq.artemis.core.server.NetworkHealthCheck] Host is down: java.net.ConnectException: Host is down
	at java.net.Inet6AddressImpl.isReachable0(Native Method) [rt.jar:1.8.0_73]
	at java.net.Inet6AddressImpl.isReachable(Inet6AddressImpl.java:77) [rt.jar:1.8.0_73]
	at java.net.InetAddress.isReachable(InetAddress.java:502) [rt.jar:1.8.0_73]
	at org.apache.activemq.artemis.core.server.NetworkHealthCheck.check(NetworkHealthCheck.java:295) [artemis-commons-1.6.0-SNAPSHOT.jar:1.6.0-SNAPSHOT]
	at org.apache.activemq.artemis.core.server.NetworkHealthCheck.check(NetworkHealthCheck.java:276) [artemis-commons-1.6.0-SNAPSHOT.jar:1.6.0-SNAPSHOT]
	at org.apache.activemq.artemis.core.server.NetworkHealthCheck.run(NetworkHealthCheck.java:244) [artemis-commons-1.6.0-SNAPSHOT.jar:1.6.0-SNAPSHOT]
	at org.apache.activemq.artemis.core.server.ActiveMQScheduledComponent$2.run(ActiveMQScheduledComponent.java:189) [artemis-commons-1.6.0-SNAPSHOT.jar:1.6.0-SNAPSHOT]
	at org.apache.activemq.artemis.core.server.ActiveMQScheduledComponent$3.run(ActiveMQScheduledComponent.java:199) [artemis-commons-1.6.0-SNAPSHOT.jar:1.6.0-SNAPSHOT]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_73]
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [rt.jar:1.8.0_73]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [rt.jar:1.8.0_73]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [rt.jar:1.8.0_73]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_73]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_73]
	at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_73]
----

Once you reestablish your network connections towards the configured check-list:

----
09:53:23,461 INFO  [org.apache.activemq.artemis.core.server.NetworkHealthCheck] Network is healthy, starting service ActiveMQServerImpl::
09:53:23,462 INFO  [org.apache.activemq.artemis.core.server] AMQ221000: primary Message Broker is starting with configuration Broker Configuration (clustered=false,journalDirectory=./data/journal,bindingsDirectory=./data/bindings,largeMessagesDirectory=./data/large-messages,pagingDirectory=./data/paging)
09:53:23,462 INFO  [org.apache.activemq.artemis.core.server] AMQ221013: Using NIO Journal
09:53:23,462 INFO  [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-server]. Adding protocol support for: CORE
09:53:23,463 INFO  [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-amqp-protocol]. Adding protocol support for: AMQP
09:53:23,463 INFO  [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-hornetq-protocol]. Adding protocol support for: HORNETQ
09:53:23,463 INFO  [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-mqtt-protocol]. Adding protocol support for: MQTT
09:53:23,464 INFO  [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-openwire-protocol]. Adding protocol support for: OPENWIRE
09:53:23,464 INFO  [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-stomp-protocol]. Adding protocol support for: STOMP
09:53:23,541 INFO  [org.apache.activemq.artemis.core.server] AMQ221003: Deploying queue jms.queue.DLQ
09:53:23,541 INFO  [org.apache.activemq.artemis.core.server] AMQ221003: Deploying queue jms.queue.ExpiryQueue
09:53:23,549 INFO  [org.apache.activemq.artemis.core.server] AMQ221020: Started Acceptor at 0.0.0.0:61616 for protocols [CORE,MQTT,AMQP,STOMP,HORNETQ,OPENWIRE]
09:53:23,550 INFO  [org.apache.activemq.artemis.core.server] AMQ221020: Started Acceptor at 0.0.0.0:5445 for protocols [HORNETQ,STOMP]
09:53:23,554 INFO  [org.apache.activemq.artemis.core.server] AMQ221020: Started Acceptor at 0.0.0.0:5672 for protocols [AMQP]
09:53:23,555 INFO  [org.apache.activemq.artemis.core.server] AMQ221020: Started Acceptor at 0.0.0.0:1883 for protocols [MQTT]
09:53:23,556 INFO  [org.apache.activemq.artemis.core.server] AMQ221020: Started Acceptor at 0.0.0.0:61613 for protocols [STOMP]
09:53:23,556 INFO  [org.apache.activemq.artemis.core.server] AMQ221007: Server is now active
09:53:23,556 INFO  [org.apache.activemq.artemis.core.server] AMQ221001: Apache ActiveMQ Artemis Message Broker version 1.6.0 [0.0.0.0, nodeID=04fd5dd8-b18c-11e6-9efe-6a0001921ad0]
----

[IMPORTANT]
====
Make sure you understand your network topology as this is meant to validate your network.
Using IPs that could eventually disappear or be partially visible may defeat the purpose.
You can use a list of multiple IPs.
Any successful ping will make the server OK to continue running
====