Using a ThreadLocal for the audit user information works in most cases,
but it can fail when dispatching messages to consumers because threads
are taken out of a pool to do the dispatching and those threads may not
be associated with the proper credentials. This commit fixes that
problem with the following changes:
- Passes the Subject explicitly when logging audit info during dispatch
- Relocates security audit logging from the SecurityManager
implementation(s) to the SecurityStore implementation
- Associates the Subject with the connection properly with the new
security caching
- Fixed an issue where I needed to set connection to null after closing it
- Added more tests on QpidDispatchPeerTest (tests i would have done manually, and reproduced a few issues along the way)
- Adding a paragraph about addressing and distinct queue names
- Renaming match on peers, senders and receivers as "address-match"
- Changing qpid dispatch test to use a single listener
- Fixing reconnect attemps message
It add additional required fixes:
- Fixed uncommitted deleted tx records
- Fixed JDBC authorization on test
- Using property-based version for commons-dbcp2
- stopping thread pool after activation to allow JDBC lease locks to release the lock
- centralize JDBC network timeout configuration and save repeating it
- adding dbcp2 as the default pooled DataSource to be used
Replaces direct jdbc connections with dbcp2 datasource. Adds
configuration options to use alternative datasources and to alter the
parameters. While adding slight overhead, this vastly improves the
management and pooling capabilities with db connections.
This reverts commit dbb3a90fe6.
The org.apache.activemq.artemis.core.server.Queue#getRate method is for
slow-consumer detection and is designed for internal use only.
Furthermore, it's too opaque to be trusted by a remote user as it only
returns the number of message added to the queue since *the last time
it was called*. The problem here is that the user calling it doesn't
know when it was invoked last. Therefore, they could be getting the
rate of messages added for the last 5 minutes or the last 5
milliseconds. This can lead to inconsistent and misleading results.
There are three main ways for users to track rates of message
production and consumption:
1. Use a metrics plugin. This is the most feature-rich and flexible
way to track broker metrics, although it requires tools (e.g.
Prometheus) to store the metrics and display them (e.g. Grafana).
2. Invoke the getMessageCount() and getMessagesAdded() management
methods and store the returned values along with the time they were
retrieved. A time-series database is a great tool for this job. This is
exactly what tools like Prometheus do. That data can then be used to
create informative graphs, etc. using tools like Grafana. Of course, one
can skip all the tools and just do some simple math to calculate rates
based on the last time the counts were retrieved.
3. Use the broker's message counters. Message counters are the broker's
simple way of providing historical information about the queue. They
provide similar results to the previous solutions, but with less
flexibility since they only track data while the broker is up and
there's not really any good options for graphing.
The queue is missing access to the server,
recent changed functionality on temporary queues namespace needed
the server and now the unit test has to pass in the reference to fix the test.
In a cluster scenario where non durable subscribers fail over to
backup while another live node forwarding messages to it,
there is a chance that the the live node keeps the old remote
binding for the subs and messages go to those
old remote bindings will result in "binding not found".
Both authentication and authorization will hit the underlying security
repository (e.g. files, LDAP, etc.). For example, creating a JMS
connection and a consumer will result in 2 hits with the *same*
authentication request. This can cause unwanted (and unnecessary)
resource utilization, especially in the case of networked configuration
like LDAP.
There is already a rudimentary cache for authorization, but it is
cleared *totally* every 10 seconds by default (controlled via the
security-invalidation-interval setting), and it must be populated
initially which still results in duplicate auth requests.
This commit optimizes authentication and authorization via the following
changes:
- Replace our home-grown cache with Google Guava's cache. This provides
simple caching with both time-based and size-based LRU eviction. See more
at https://github.com/google/guava/wiki/CachesExplained. I also thought
about using Caffeine, but we already have a dependency on Guava and the
cache implementions look to be negligibly different for this use-case.
- Add caching for authentication. Both successful and unsuccessful
authentication attempts will be cached to spare the underlying security
repository as much as possible. Authenticated Subjects will be cached
and re-used whenever possible.
- Authorization will used Subjects cached during authentication. If the
required Subject is not in the cache it will be fetched from the
underlying security repo.
- Caching can be disabled by setting the security-invalidation-interval
to 0.
- Cache sizes are configurable.
- Management operations exist to inspect cache sizes at runtime.
This is allowing journal appends to happen in burst
during replication, by batching replication response
into the network at the end of the append burst.
I couldn't reproduce this with a test, but static code analysis led me
to this solution which is similar to the fix done for ARTEMIS-2592 via
e397a17796.
1 of 2) - Porting of HORNETMQ-1575
In a live-backup scenario, when live is down and backup becomes live, clients
using HA Connection Factories can failover automatically. However if a
client decides to create a new connection by itself (as in camel jms case)
there is a chance that the new connection is pointing to the dead live
and the connection won't be successful. The reason is that if the old
connection is gone the backup will not get a chance to announce itself
back to client so it fails on initial connection.
The fix is to let CF remember the old topology and use it on any
initial connection attempts.
Since getDiskStoreUsage on the ActiveMQServerControl is converting a
double to a long the value is always 0 in the management API. It should
return a double instead.
Adding this metric required moving the meter registration code from the
AddressInfo class to the ManagementService in order to get clean access
to both the AddressInfo and AddressControl classes.
The calculation used by
ActiveMQServerControlImpl.getDiskStoreUsagePercentage() is incorrect. It
uses disk space info with global-max-size which is for address memory.
Also, the existing getDiskStoreUsage() method *already* returns a
percentage of total disk store usage so this method seems redundant.
Now it is possible to reset queue parameters to their defaults by removing them
from broker.xml and redeploying the configuration.
Originally this PR covered the "filter" parameter only.
ORIG message propertes like _AMQ_ORIG_ADDRESS are added to messages
during various broker operations (e.g. diverting a message, expiring a
message, etc.). However, if multiple operations try to set these
properties on the same message (e.g. administratively moving a message
which eventually gets sent to a dead-letter address) then important
details can be lost. This is particularly problematic when using
auto-created dead-letter or expiry resources which use filters based on
_AMQ_ORIG_ADDRESS and can lead to message loss.
This commit simply over-writes the existing ORIG properties rather than
preserving them so that the most recent information is available.
- when sending messages to DLQ or Expiry we now use x-opt legal names
- we now support filtering thorugh annotations if using m. as a prefix.
- enabling hyphenated_props: to allow m. as a prefix
DivertBindings are now properly cleaned up when a queue binding is
removed that matches the divert. The correct key is now used to remove
the queue address from the set and the correct address is now used to
remove the remote consumer.
Test fails with the primary server being killed by the crash and the backup server is killed
by the tearDown before ScaleDownHandler can kick in. This commit adds a wait method to allow
ScaleDownHandler to process before the test completes.
AmqpExpiredMessageTest will expire messages, eventually the counter could be 0
so it is invalid to assertEquals(1, queue.getMessageCount()) as it will be 0 eventually.
This one should improve eventual failures on ClusterConnectionControlTest
to validate this, run ClusterConnectionControlTest::testNotifications in a loop
you will see eventually the test taking longer to shutdown the executor as the call could be blocked.
This won't be an issue on a real server (Production system)
however, on the testsuite or while embedded this could cause issues,
if a same instance is stopped then started.
This is the reason why BackupAuthenticationTest was intermittently failing.
I also adapted the test since I would need to stop the server and reactivate it in order to change the configuration.
The previous test wasn't acting like a real server.
This commit does the following:
- Deprecates existing overloaded createQueue, createSharedQueue,
createTemporaryQueue, & updateQueue methods for ClientSession,
ServerSession, ActiveMQServer, & ActiveMQServerControl where
applicable.
- Deprecates QueueAttributes, QueueConfig, & CoreQueueConfiguration.
- Deprecates existing overloaded constructors for QueueImpl.
- Implements QueueConfiguration with JavaDoc to be the single,
centralized configuration object for both client-side and broker-side
queue creation including methods to convert to & from JSON for use in
the management API.
- Implements new createQueue, createSharedQueue & updateQueue methods
with JavaDoc for ClientSession, ServerSession, ActiveMQServer, &
ActiveMQServerControl as well as a new constructor for QueueImpl all
using the new QueueConfiguration object.
- Changes all internal broker code to use the new methods.
Due to the changes in 6b5fff40cb the
config parameter message-expiry-thread-priority is no longer needed. The
code now uses a ScheduledExecutorService and a thread pool rather than
dedicating a thread 100% to the expiry scanner. The pool's size can be
controlled via scheduled-thread-pool-max-size.
Using a property on AMQPLargeMessage instead of a ThreadLocal
This was causing issues on the journal as the message may transverse different threads on the journal.
There is no guarantee that the encodeSize size is the same in AMQP right after read.
As the protocol may add additional bytes right after decoded such as header, extra properties.. etc.
The drain control has to immediately flush
otherwise a next flow control event may remove the previous status from Proton.
So, this really cannot wait the next executor, and it has to be done immediately.
Historically speaking, all message properties starting with AMQ HDR
would not be passed to OpenWire messages. However, that blocked the
properties from management notifications so ARTEMIS-1209 was raised and
the solution there was to pass properties that started with _AMQ *if*
the consumer was connected to the management notification address.
However, in this case messages are diverted to a different address so
this check fails and the properties are removed. My solution will be to
check the message itself to see if it has the _AMQ_NotifType property
(which all notification messages do) rather than checking where the
consumer is connected.
In case there is a hardware, firewal or any other thing making the UDP connection to go deaf
we will now reopen the connection in an attempt to go over possible issues.
This is also improving locking around DiscoveryGroup initial connection.
This is a Large commit where I am refactoring largeMessage Body out of CoreMessage
which is now reused with AMQP.
I had also to fix Reference Counting to fix how Large Messages are Acked
And I also had to make sure Large Messages are transversing correctly when in cluster.
Update netty version to 4.1.43.Final and netty-tcnative version to 2.0.26.Final.
Change restricted-security-client.policy because Netty 4.1.43.Final requires
access to two more files: /etc/os-release and /usr/lib/os-release.
There is an optimization in AMQP, that properties are only parsed over demand.
It happens that after ARTEMIS-2294 (commit 2dd0671698),
every send would request for the property on the message, resulting the properties to always be parsed upon send.
Even when there's no use of application properties.
This is a surprisingly large change just to fix some log messages, but
the changes were necessary in order to get the relevant data to where it
was being logged. The fact that the data wasn't readily available is
probably why it wasn't logged in the first place.
Add a paged message to the tail, when the QueueIterateAction doesn't handle it, to avoid removing unhandled paged message. Move the refRemoved calls from the QueueIterateActions to the iterQueue to fix the queue stats.
When AMQPMessages are redistributed from one node to
another, the internal property of message is not
cleaned up and this causes a message to be routed
to a same queue more than once, causing duplicated
messages.
This commit introduces the ability to configure a downstream connection
for federation. This works by sending information to the remote broker
and that broker will parse the message and create a new upstream back
to the original broker.
A new feature to preserve messages sent to an address for queues that will be
created on the address in the future. This is essentially equivalent to the
"retroactive consumer" feature from 5.x. However, it's implemented in a way
that fits with the address model of Artemis.
The parameter failoverOnInitialConnection wouldn't seem to be used and
makes no sense any more, because the connectors are retried in a loop.
So someone can just add the backup in the initial connection.
In LargeMessageImpl.copy(long) it need to open the underlying
file in order to read and copy bytes into the new copied message.
However there is a chance that another thread can come in and close
the file in the middle, making the copy failed
with "channel is null" error.
This is happening in cases where a large message is sent to a jms
topic (multicast address). During delivery it to multiple
subscribers, some consumer is doing delivery and closed the
underlying file after. Some other consumer is rolling back
the messages and eventually move it to DLQ (which will call
the above copy method). So there is a chance this bug being hit on.
The crititical analyser trigger the broker shutdown if try to
removeAllMessages with a huge queue. The iterQueue is split so as
not to keep the lock too time.
It use RandomAccessFile to allow using heap buffers without additional
copies and/or leaks of direct buffers, as performed by FileChannel JDK
implementation (see https://bugs.openjdk.java.net/browse/JDK-8147468)
When an openwire client closes the session, the broker doesn't
clean up its server consumer references even though the core
consumers are closed. This results a leak when sessions within
a connection are created and closed when the connection keeps open.
It would fail on cannot destroy queue
as the failure could be asynchronous, introducing a quick race, which is acceptable
you just need to make sure the async operation will finish before removing the queue.
Fix is to introduce a Wait.assertEquals call.
After a node is scaled down to a target node, the sf queue in the
target node is not deleted.
Normally this is fine because may be reused when the scaled down
node is back up.
However in cloud environment many drainer pods can be created and
then shutdown in order to drain the messages to a live node (pod).
Each drainer pod will have a different node-id. Over time the sf
queues in the target broker node grows and those sf queues are
no longer reused.
Although use can use management API/console to manually delete
them, it would be nice to have an option to automatically delete
those sf queue/address resources after scale down.
In this PR it added a boolean configuration parameter called
cleanup-sf-queue to scale down policy so that if the parameter
is "true" the broker will send a message to the
target broker signalling that the SF queue is no longer
needed and should be deleted.
If the parameter is not defined (default) or is "false"
the scale down won't remove the sf queue.
When converting from AMQP to core and back again support annotations that
aren't able to be placed into Core message properties by storing the bytes
from encoding the types to AMQP encodings and then decoding them again
when converting back into AMQP messages.
Requires update to proton-j 0.33.2 for encoding fix
The core server session tracks details about producers like what
addresses have had messages sent to them, the most recent message ID
sent to each address, and the number of messages sent to each address.
This information is made available to users via the
listProducersInfoAsJSON method on the various management interfaces
(JMX, web console, etc.). However, in situations where a server session
is long lived (e.g. in a pool) and is used to send to many different
addresses (e.g. randomly named temporary JMS queues) this info can
accumulate to a problematic degree. Therefore, we should limit the
amount of producer details saved by the session.
this test was basically broken, it was silently failing as it was ignoring results and taking a long time to finish.
As this test is multiplied along many options (Netty, Replicated, JDBC) this was taking considerable extra time
on the testsuite.
Wait netty event loop group shutdown to avoid too many opened FDs after
server stops, when netty configuration is used. Clear server
activateCallbacks to avoid reactivation of previous nodeManager and
consequent FD leaks on restart. Fix LargeServerMessageImpl.copy to avoid
FD leaks when a large message expiry or it is sent to DLA. Terminate
HawtDispatcher global queue to avoid pipes and eventpolls leaks after a
MQTT test.
cherry-picking commit 9617058ba0649af4eea15ce8793f86de827c4b7f
NO-JIRA adding check for open FD on the testsuite
cherry-picking commit 0facb7ddf4d3baa14a3add4290684aff7fd46053
NO-JIRA addressing connections leaks on integration tests
If a jms client (be it openwire, amqp, or core jms) receives a message that
is from a different protocol, the JMSMessageID maybe null when the
jms client expects it.
Add max record size check before adding a record to prevent that the
broker shuts down, when there is one really large header sent with the
message. Add message size check before allocating large message resource
if it can't be stored.