History

Scott Williams 2129a8b2a0 Update buildpack		2022-11-01 16:23:00 +01:00
..
hbase-client-bundle	Update buildpack	2022-11-01 16:23:00 +01:00
hbase-mapreduce-bundle	Update buildpack	2022-11-01 16:23:00 +01:00
.blazar.yaml	Update buildpack	2022-11-01 16:23:00 +01:00
.build-jdk8	Add client bundles for hbase2 (#11 )	2022-05-31 10:55:55 -04:00
README.md	Add client bundles for hbase2 (#11 )	2022-05-31 10:55:55 -04:00
pom.xml	Fix client bundle metrics dep issues (#14 )	2022-06-16 16:58:17 -04:00

README.md

hubspot-client-bundles

Bundles up the hbase client in a way that is most friendly to the hubspot dependency trees

Why?

HBase provides some shaded artifacts, but they don't really work for us for two reasons:

We have little control over what's included in them, so the jars end up being unnecessarily fat and/or leaking dependencies we don't want.
The shaded artifacts have significant class overlaps because one is a superset of the other. This bloats our classpath and also makes mvn dependency:analyze complain. This can be a cause for the classic flappy "Unused declared"/"Used undeclared" dependency issue.

One option would be to fix those existing artifacts to work how we'd like. I tried that in hbase2, but it was very complicated without fully redoing how the shading works. Rather than maintain a large rewrite of those poms, I'd rather start fresh with our artifacts. This also will give us greater flexibility in the future for changing the includes/excludes as we see fit.

Why here?

The other design choice here was to include these artifacts in this repo as opposed to a separate repo. One pain point with developing on hbase has been the number of repos necessary to develop and/or test any change -- the client fork has historically had 2 branches (staging and master) and similar for hbase-shading. In order to get a branch out there for testing you need to modify two repos. Iterating on those branches is annoying because builds are not automatically in-sync.

Putting the bundling here makes it part of the build, so we automatically have client artifacts created for every branch.

One new guiding principle of our forking strategy is to minimize the number of customizations in our forks, instead aiming to get things upstreamed. The goal is to eliminate the tech debt inherent in having to re-analyze, copy patches, handle merge conflicts, etc, every time we upgrade. This module is an omission to that rule -- regardless of where it lives, we will want to be cognizant of dependency changes in new releases. Putting it here gives us the option to bake that process directly into our build and introduces no potential for merge conflicts because it's entirely isolated in a new module.

How it works

These artifacts are produced with the usual maven-shade-plugin. Some understanding of that plugin is helpful, but I wanted to give a little clarity on a few techniques used.

In general our goal with shading is to control two things:

Which classes end up in the jar, and the fully qualified class names (i.e. including package) of those classes.
Which dependencies are exposed in the resulting pom.

At a very high level, the shade plugin does the following:

Collect all the dependencies in your pom.xml, including transitive dependencies. It's worth noting that this flattens your dependency tree, so if your project A previously depended on project B which depended on project C, your project A now directly depends on B and C.
Include any selected dependencies (via artifactSet) directly into your jar by copying the class files in.
Rewrite those class packages and imports, if configured via relocations.
Write a new dependency-reduced-pom.xml, which only includes the dependencies that weren't included in the jar. This pom becomes the new pom for your artifact.

In terms of our two goals, choosing which classes end up in the jar is easy via artifactSet. Controlling which dependencies end up in your final pom is a lot trickier:

Exclusions - Since the shade plugin starts with your initial dependencies, you can eliminate transitive dependencies by excluding them from your direct dependencies. This is effective but typically involves needing to apply those same exclusions to all direct dependencies, because the ones you're trying to exclude will often come from multiple.
Marking a dependency as scope provided - The shade plugin seems to ignore scope provided dependencies, as well as all of their transitive dependencies (as long as they aren't converted to compile scope by some other dependency). This sometimes doesn't work and seems kind of magic, so might make sense to only use for cases where your jar actually provides that dependency.
Inclusion in the jar - Any dependencies included in the jar will be removed from the resulting pom. In general if you include something in the jar, it should be relocated or filtered. Otherwise, you run the risk of duplicate class conflicts. You can include something in the jar and then filter out all classes, which sort of wipes it out. But it requires configuring in multiple places and is again sort of magic, so another last resort.

My strategy has evolved here over time since none of these are perfect and there's no easy answer as far as I can tell. But I've listed the above in approximately the order I chose to solve each dependency. So I mostly preferred exclusions here, then marked some stuff as scope provided, and mostly didn't use the last strategy.

How to make changes

In general the best way I've found to iterate here is:

Create a simple downstream project which depends on one or both of these bundles
Run mvn dependency:list -DoutputFile=dependencies.out to see a full list of dependencies
You can pass that through something like cat dependencies.out | sed -E -e 's/^ +//' | sed -E -e 's/:(compile|runtime|provided|test).*/:\1/' | sed -E -e 's/:(compile|runtime)$/:compile/' | sort | uniq > dependencies.sorted to get a file that can be compared with another such-processed file
Make the change you want in the bundle, then mvn clean install
Re-run steps 2 and 3, outputting to a new file
Run comm -13 first second to see what might be newly added after your change, or comm -23 to see what might have been removed
If trying to track a specific dependency from the list, go back here and run mvn dependency:tree -Dincludes=<coordinates>. This might show you what dependency you need to add an exclusion to

This ends up being pretty iterative and trial/error, but can eventually get to a jar which has what you want (and doesn't what you don't).