7.9 KiB
Travis Integration
Apache Druid uses Travis to manage builds, including running the integration
tests. You can find the Travis build file at $DRUID_DEV/.travis.yml
, where
DRUID_DEV
is the root of your Druid development directory. Information
about Travis can be found at:
Running ITs In Travis
Travis integration is still experimental. The latest iteration is:
- name: "experimental docker tests"
stage: Tests - phase 1
script: ${MVN} install -P test-image,docker-tests -rf :it-tools ${MAVEN_SKIP} -DskipUTs=true
after_failure:
- docker-tests/check-results.sh
The above is a Travis job definition. The job "inherits" an install
task defined
earlier in the file. That install task builds all of Druid and creates the distribution
tarball. Since the tests are isolated in specialized Maven profiles, the install
task does not build any of the IT-related artifacts.
We've placed the test run in "Phase 1" for debugging convenience. Later, the tests will run in "Phase 2" along with the other ITs. Once conversion is complete, the "previous generation" IT tests will be replaced by the newer revisions.
The script
runs the ITs. The components of the command line are:
install
- Run Maven though the install lifecycle phase for each module. This allows us to build and install the "testing tools" (see the Maven notes). The test image is also built during theinstall
phase. The tests themselves only need theverify
phase, which occurs beforeinstall
.install
does nothing for ITs.-P test-image,docker-tests
- activates the image to build the image (test-image
) and then runs the ITs (docker-tests
).-rf :it-tools
- Theit-tools
module is the first of the IT modules: it contains the "testing tools" added into the image. Using-rf
skips all the other projects which we already built in the Travisinstall
step. Doing so saves the time otherwise required for Maven to figure out it has nothing to do for those modules.${MAVEN_SKIP}
- Omits the static checks: they are not needed for ITs.-DskipUTs=true
- The ITs use the Maven Failsafe plugin which shares code with the Maven Surefire plugin. We don't want to run unit tests. If we did the usual-DskipTests
, then we'd also disable the ITs. The-DskipUTs=true
uses a bit of Maven trickery to skip only the Surefire, but not Faisafe tests.
Travis Diagnostics
A common failure when running ITs is that they uncover a bug in a Druid service; typically in the code you added that you want to test. Or, if you are changing the Docker or Docker Compose infratructure, then the tests will often fail because the Druid services are mis-configured. (Bad configuration tends to result in services that don't start, or start and immediately exit.)
The standard way to diagnose such failures is to look at the Druid logs. However, Travis provides no support for attaching files to a build. The best alternative seems to be to upload the files somewhere else. As a compromise, the Travis build will append to the build log a subset of the Druid logs.
Travis has a limit of 4MB per build log, so we can't append the entire log for every Druid service for every IT. We have to be selective. In most cases, we only care about the logs for ITs that fail.
Now, it turns out to be very hard indeed to capture failues! Eventually, we want
Maven to run many ITs for each test run: we need to know which failed. Each IT
creates its own "shared" directory, so to find the logs, we need to know which IT
failed. Travis does not have this information: Travis only knows that Maven itself
exited with a non-zero status. Maven doesn't know: it only knows that Failsafe
failed the build. Failsafe is designed to run all ITs, then check the results in
the verify
state, so Maven doesn't even know about the failures.
Failsafe Error Reports
To work around all this, we mimic Failsafe: we look at the Failsafe error report
in $DRUID_DEV/docker-tests/<module>/target/failsafe-reports/failsafe-summary.xml
which looks like this:
<failsafe-summary ... result="null" timeout="false">
<completed>3</completed>
<errors>1</errors>
<failures>0</failures>
<skipped>0</skipped>
<failureMessage xsi:nil="true" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"/>
</failsafe-summary>
The above shows one error and no failures. A successful run will show 0 for the
errors
tag. This example tells us "something didn't work". The corresponding
Druid service logs are candidates for review.
Druid Service Failures
The Druid logs are in $DRUID_DEV/docker-tests/<module>/target/shared/logs
.
We could append all of them, but recall the 4MB limit. We generally are
interested only in those services that failed. So, we look at the logs and
see that a successful run is indicated by a normal Lifecycle shutdown:
2022-04-16T20:54:37,997 INFO [Thread-56] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [INIT]
The key bit of text is:
Stopping lifecycle [module] stage [INIT]
This says that 1) we're shutting down the lifecycle (which means no exception was thrown),
and 2) that we got all the way to the end ([INIT]
). Since Druid emits no final
"exited normally" message, we take the above as the next-best thing.
So, we only care about logs that don't have the above line. For those, we want to append the log to the build output. Or, because of the size limit, we append the last 100 lines.
All of this is encapsulated in the docker-tests/check-results.sh
script which
is run if the build fails (in the after_failure
) tag.
Druid Log Output
For a failed test, the build log will end with something like this:
======= it-high-availability Failed ==========
broker.log logtail ========================
022-04-16T03:53:10,492 INFO [CoordinatorRuleManager-Exec--0] org.apache.druid.discovery.DruidLeaderClient - Request[http://coordinator-one:8081/druid/coordinator/v1/rules] received redirect response to location [http://coordinator-two:8081/druid/coordinator/v1/rules].
...
To keep below the limit, on the first failed test is reported.
The above won't catch all cases: maybe the service exited normally, but might still have log lines of interest. Since all tests run, those lines could be anywhere in the file and the scripts can't know which might be of interest. To handle that, we either have to upload all logs somewhere, or you can use the convenience of the new IT framework to rerun the tests on your development machine.