druid/integration-tests-ex/docs/travis.md

169 lines
7.9 KiB
Markdown

<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
# Travis Integration
Apache Druid uses Travis to manage builds, including running the integration
tests. You can find the Travis build file at `$DRUID_DEV/.travis.yml`, where
`DRUID_DEV` is the root of your Druid development directory. Information
about Travis can be found at:
* [Documentation](https://docs.travis-ci.com/)
* [Job lifecycle](https://docs.travis-ci.com/user/job-lifecycle/)
* [Environment variables](https://docs.travis-ci.com/user/environment-variables/)
* [Travis file reference](https://config.travis-ci.com/)
* [Travis YAML](https://docs.travis-ci.com/user/build-config-yaml)
## Running ITs In Travis
Travis integration is still experimental. The latest iteration is:
```yaml
- name: "experimental docker tests"
stage: Tests - phase 1
script: ${MVN} install -P test-image,docker-tests -rf :it-tools ${MAVEN_SKIP} -DskipUTs=true
after_failure:
- docker-tests/check-results.sh
```
The above is a Travis job definition. The job "inherits" an `install` task defined
earlier in the file. That install task builds all of Druid and creates the distribution
tarball. Since the tests are isolated in specialized Maven profiles, the `install`
task does not build any of the IT-related artifacts.
We've placed the test run in "Phase 1" for debugging convenience. Later, the tests
will run in "Phase 2" along with the other ITs. Once conversion is complete, the
"previous generation" IT tests will be replaced by the newer revisions.
The `script` runs the ITs. The components of the command line are:
* `install` - Run Maven though the install [lifecycle phase](
https://maven.apache.org/guides/introduction/introduction-to-the-lifecycle.html)
for each module. This allows us to build and install the "testing tools"
(see the [Maven notes](maven.md)). The test image is also built during the
`install` phase. The tests themselves only need the `verify` phase, which occurs
before `install`. `install` does nothing for ITs.
* `-P test-image,docker-tests` - activates the image to build the image
(`test-image`) and then runs the ITs (`docker-tests`).
* `-rf :it-tools` - The `it-tools` module is the first of the IT modules: it contains
the "testing tools" added into the image. Using `-rf` skips all the other projects
which we already built in the Travis `install` step. Doing so saves the time
otherwise required for Maven to figure out it has nothing to do for those modules.
* `${MAVEN_SKIP}` - Omits the static checks: they are not needed for ITs.
* `-DskipUTs=true` - The ITs use the [Maven Failsafe plugin](
https://maven.apache.org/surefire/maven-failsafe-plugin/index.html)
which shares code with the [Maven Surefire plugin](
https://maven.apache.org/surefire/maven-surefire-plugin/index.html). We don't want
to run unit tests. If we did the usual `-DskipTests`, then we'd also disable the
ITs. The `-DskipUTs=true` uses a bit of [Maven trickery](
https://stackoverflow.com/questions/6612344/prevent-unit-tests-but-allow-integration-tests-in-maven)
to skip only the Surefire, but not Faisafe tests.
## Travis Diagnostics
A common failure when running ITs is that they uncover a bug in a Druid service;
typically in the code you added that you want to test. Or, if you are changing the
Docker or Docker Compose infratructure, then the tests will often fail because the
Druid services are mis-configured. (Bad configuration tends to result in services
that don't start, or start and immediately exit.)
The standard way to diagnose such failures is to look at the Druid logs. However,
Travis provides no support for attaching files to a build. The best alternative
seems to be to upload the files somewhere else. As a compromise, the Travis build
will append to the build log a subset of the Druid logs.
Travis has a limit of 4MB per build log, so we can't append the entire log for
every Druid service for every IT. We have to be selective. In most cases, we only
care about the logs for ITs that fail.
Now, it turns out to be *very hard* indeed to capture failues! Eventually, we want
Maven to run many ITs for each test run: we need to know which failed. Each IT
creates its own "shared" directory, so to find the logs, we need to know which IT
failed. Travis does not have this information: Travis only knows that Maven itself
exited with a non-zero status. Maven doesn't know: it only knows that Failsafe
failed the build. Failsafe is designed to run all ITs, then check the results in
the `verify` state, so Maven doesn't even know about the failures.
### Failsafe Error Reports
To work around all this, we mimic Failsafe: we look at the Failsafe error report
in `$DRUID_DEV/docker-tests/<module>/target/failsafe-reports/failsafe-summary.xml`
which looks like this:
```xml
<failsafe-summary ... result="null" timeout="false">
<completed>3</completed>
<errors>1</errors>
<failures>0</failures>
<skipped>0</skipped>
<failureMessage xsi:nil="true" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"/>
</failsafe-summary>
```
The above shows one error and no failures. A successful run will show 0 for the
`errors` tag. This example tells us "something didn't work". The corresponding
Druid service logs are candidates for review.
### Druid Service Failures
The Druid logs are in `$DRUID_DEV/docker-tests/<module>/target/shared/logs`.
We could append all of them, but recall the 4MB limit. We generally are
interested only in those services that failed. So, we look at the logs and
see that a successful run is indicated by a normal Lifecycle shutdown:
```text
2022-04-16T20:54:37,997 INFO [Thread-56] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [INIT]
```
The key bit of text is:
```text
Stopping lifecycle [module] stage [INIT]
```
This says that 1) we're shutting down the lifecycle (which means no exception was thrown),
and 2) that we got all the way to the end (`[INIT]`). Since Druid emits no final
"exited normally" message, we take the above as the next-best thing.
So, we only care about logs that *don't* have the above line. For those, we want to
append the log to the build output. Or, because of the size limit, we append the
last 100 lines.
All of this is encapsulated in the `docker-tests/check-results.sh` script which
is run if the build fails (in the `after_failure`) tag.
### Druid Log Output
For a failed test, the build log will end with something like this:
```text
======= it-high-availability Failed ==========
broker.log logtail ========================
022-04-16T03:53:10,492 INFO [CoordinatorRuleManager-Exec--0] org.apache.druid.discovery.DruidLeaderClient - Request[http://coordinator-one:8081/druid/coordinator/v1/rules] received redirect response to location [http://coordinator-two:8081/druid/coordinator/v1/rules].
...
```
To keep below the limit, on the first failed test is reported.
The above won't catch all cases: maybe the service exited normally, but might still have
log lines of interest. Since all tests run, those lines could be anywhere in the file
and the scripts can't know which might be of interest. To handle that, we either
have to upload all logs somewhere, or you can use the convenience of the new
IT framework to rerun the tests on your development machine.