Clarify remote clusters' use of transport layer (#60268)
Today there are a few places in the transport layer docs where we talk about communication between nodes _within a cluster_. We also use the transport layer for remote cluster connections, and these statements also apply there, but this is not clear from today's docs. This commit generalises these statements to make it clear that they apply to remote cluster connections too. It also adds a link from the docs on configuring TCP retries to the (deeply-buried) docs on preserving long-lived connections.
This commit is contained in:
parent
0778274b72
commit
cf0cab614d
|
@ -10,10 +10,10 @@ to a limited number of nodes in that remote cluster. There are two modes for
|
||||||
remote cluster connections: <<sniff-mode,sniff mode>> and
|
remote cluster connections: <<sniff-mode,sniff mode>> and
|
||||||
<<proxy-mode,proxy mode>>.
|
<<proxy-mode,proxy mode>>.
|
||||||
|
|
||||||
All the communication required between different clusters
|
Communication with a remote cluster uses the <<modules-transport,transport
|
||||||
goes through the <<modules-transport,transport layer>>. Remote cluster
|
layer>> to establishe a number of <<long-lived-connections,long-lived>> TCP
|
||||||
connections consist of uni-directional connections from the coordinating
|
connections from the coordinating nodes of the local cluster to the chosen
|
||||||
node to the remote remote connections.
|
nodes in the remote cluster.
|
||||||
|
|
||||||
[discrete]
|
[discrete]
|
||||||
[[sniff-mode]]
|
[[sniff-mode]]
|
||||||
|
|
|
@ -1,19 +1,13 @@
|
||||||
[[modules-transport]]
|
[[modules-transport]]
|
||||||
=== Transport
|
=== Transport
|
||||||
|
|
||||||
The transport networking layer is used for internal communication between nodes
|
REST clients send requests to your {es} cluster over <<modules-http,HTTP>>, but
|
||||||
within the cluster. Each call that goes from one node to the other uses
|
the node that receives a client request cannot always handle it alone and must
|
||||||
the transport layer (for example, when an HTTP GET request is processed
|
normally pass it on to other nodes for further processing. It does this using
|
||||||
by one node, and should actually be processed by another node that holds
|
the transport networking layer. The transport layer is used for all internal
|
||||||
the data). The transport module is also used for the `TransportClient` in the
|
communication between nodes within a cluster, all communication with the nodes
|
||||||
{es} Java API.
|
of a <<modules-remote-clusters,remote cluster>>, and also by the
|
||||||
|
`TransportClient` in the {es} Java API.
|
||||||
The transport mechanism is completely asynchronous in nature, meaning
|
|
||||||
that there is no blocking thread waiting for a response. The benefit of
|
|
||||||
using asynchronous communication is first solving the
|
|
||||||
http://en.wikipedia.org/wiki/C10k_problem[C10k problem], as well as
|
|
||||||
being the ideal solution for scatter (broadcast) / gather operations such
|
|
||||||
as search in Elasticsearch.
|
|
||||||
|
|
||||||
[[transport-settings]]
|
[[transport-settings]]
|
||||||
==== Transport settings
|
==== Transport settings
|
||||||
|
@ -138,17 +132,18 @@ configured, and defaults otherwise to `transport.tcp.reuse_address`.
|
||||||
[[long-lived-connections]]
|
[[long-lived-connections]]
|
||||||
===== Long-lived idle connections
|
===== Long-lived idle connections
|
||||||
|
|
||||||
Elasticsearch opens a number of long-lived TCP connections between each pair of
|
A transport connection between two nodes is made up of a number of long-lived
|
||||||
nodes in the cluster, and some of these connections may be idle for an extended
|
TCP connections, some of which may be idle for an extended period of time.
|
||||||
period of time. Nonetheless, Elasticsearch requires these connections to remain
|
Nonetheless, Elasticsearch requires these connections to remain open, and it
|
||||||
open, and it can disrupt the operation of the cluster if any inter-node
|
can disrupt the operation of your cluster if any inter-node connections are
|
||||||
connections are closed by an external influence such as a firewall. It is
|
closed by an external influence such as a firewall. It is important to
|
||||||
important to configure your network to preserve long-lived idle connections
|
configure your network to preserve long-lived idle connections between
|
||||||
between Elasticsearch nodes, for instance by leaving `tcp.keep_alive` enabled
|
Elasticsearch nodes, for instance by leaving `tcp.keep_alive` enabled and
|
||||||
and ensuring that the keepalive interval is shorter than any timeout that might
|
ensuring that the keepalive interval is shorter than any timeout that might
|
||||||
cause idle connections to be closed, or by setting `transport.ping_schedule` if
|
cause idle connections to be closed, or by setting `transport.ping_schedule` if
|
||||||
keepalives cannot be configured.
|
keepalives cannot be configured. Devices which drop connections when they reach
|
||||||
|
a certain age are a common source of problems to Elasticsearch clusters, and
|
||||||
|
must not be used.
|
||||||
|
|
||||||
[[request-compression]]
|
[[request-compression]]
|
||||||
===== Request compression
|
===== Request compression
|
||||||
|
|
|
@ -2,8 +2,9 @@
|
||||||
=== TCP retransmission timeout
|
=== TCP retransmission timeout
|
||||||
|
|
||||||
Each pair of nodes in a cluster communicates via a number of TCP connections
|
Each pair of nodes in a cluster communicates via a number of TCP connections
|
||||||
which remain open until one of the nodes shuts down or communication between
|
which <<long-lived-connections,remain open>> until one of the nodes shuts down
|
||||||
the nodes is disrupted by a failure in the underlying infrastructure.
|
or communication between the nodes is disrupted by a failure in the underlying
|
||||||
|
infrastructure.
|
||||||
|
|
||||||
TCP provides reliable communication over occasionally-unreliable networks by
|
TCP provides reliable communication over occasionally-unreliable networks by
|
||||||
hiding temporary network disruptions from the communicating applications. Your
|
hiding temporary network disruptions from the communicating applications. Your
|
||||||
|
@ -36,14 +37,23 @@ To set this value permanently, update the `net.ipv4.tcp_retries2` setting in
|
||||||
`/etc/sysctl.conf`. To verify after rebooting, run
|
`/etc/sysctl.conf`. To verify after rebooting, run
|
||||||
`sysctl net.ipv4.tcp_retries2`.
|
`sysctl net.ipv4.tcp_retries2`.
|
||||||
|
|
||||||
{es} also implements its own health checks with timeouts that are much shorter
|
|
||||||
than the default retransmission timeout on Linux. However these health checks
|
|
||||||
must allow for application-level effects such as garbage collection pauses. We
|
|
||||||
do not recommend reducing any timeouts related to these application-level
|
|
||||||
health checks.
|
|
||||||
|
|
||||||
IMPORTANT: This setting applies to all TCP connections and will affect the
|
IMPORTANT: This setting applies to all TCP connections and will affect the
|
||||||
reliability of communication with systems outside your cluster too. If your
|
reliability of communication with systems outside your cluster too. If your
|
||||||
cluster communicates with external systems over an unreliable network then you
|
cluster communicates with external systems over an unreliable network then you
|
||||||
may need to select a higher value for `net.ipv4.tcp_retries2`. For this reason,
|
may need to select a higher value for `net.ipv4.tcp_retries2`. For this reason,
|
||||||
{es} does not adjust this setting automatically.
|
{es} does not adjust this setting automatically.
|
||||||
|
|
||||||
|
==== Related configuration
|
||||||
|
|
||||||
|
{es} also implements its own internal health checks with timeouts that are much
|
||||||
|
shorter than the default retransmission timeout on Linux. Since these are
|
||||||
|
application-level health checks their timeouts must allow for application-level
|
||||||
|
effects such as garbage collection pauses. You should not reduce any timeouts
|
||||||
|
related to these application-level health checks.
|
||||||
|
|
||||||
|
You must also ensure your network infrastructure does not interfere with the
|
||||||
|
long-lived connections between nodes, <<long-lived-connections,even if those
|
||||||
|
connections appear to be idle>>. Devices which drop connections when they reach
|
||||||
|
a certain age are a common source of problems to Elasticsearch clusters, and
|
||||||
|
must not be used.
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue