The documentation says we support EPUB, but the parser is not enabled.
This parser does not require any external dependencies, so I think its ok?
Separately, test-framework drags in an ancient commons-codec (via httpclient), which gradle
"upgrades", but IDEs can't handle this case and just hit jar hell. So just wire that to 1.9,
this allows running tests in the IDE for this plugin.
We used to test only json parsing as we relied on QueryBuilder#toString which uses the json format. This commit makes sure that we now output the randomly generated queries using a random format, and that we are always able to parse them correctly.
This revealed a couple of issues with binary objects that haven't been migrated yet to be structured Writeable objects. We used to keep them in the format they were sent while parsing, which led to problems when printing them out as we expected them to always be in json format. Also we can't compare different BytesReference objects that hold the same content but in different formats (unless we want to parse them as part of equal and hashcode, doesn't seem like a good idea) and verify that we have parsed the right objects if they can be different formats. The fix is to always keep binary objects in json format. Best fix would be not to have binary objects, which we'll get to once we are done with the search refactoring.
Closes#14415
Latest version of lucene deprecated Query#setBoost and Query#getBoost which made queries effectively immutable. Those methods need to be replaced with `BoostQuery` that wraps any query that needs boosting.
This commit replaces usages of setBoost with BoostQuery and adds it to forbidden-apis for prod code.
Usages of `getBoost` are only partially removed, as some will have to stay for backwards compatibility.
Closes#14264
If you run tests under a 32-bit jvm, you will get a test failure in IndexStoreTests,
the logic there is wrong in the case of 32-bit (its NIOFSDirectory on linux).
Also if mlockall fails, you'll see huge bogus values (because of use of `long` instead of `NativeLong`)
finally add seccomp support for 32 bit too, and clean up all its `long` usage as well.
There have been security issues with tika's parsers in the past...
let's take away the network, filesystem, everything we can.
In some way, parsing these docs is a lot like executing untrusted code.
I know its not pretty, but I think its worth it.
This patch adds a zip of about 200 files from tika's test suite,
and we assert some content comes back from each. This is a good exercise
of the various formats.
I removed any huge files to try to keep size reasonable, but we want
a bit of a variety so we know stuff is working.
I fixed issues with the parser config by running this.
this removes a lot of obscure parsers, and leaves us with the basics.
This includes at least all of the formats listed on
https://github.com/elastic/elasticsearch-mapper-attachments/issues/163
I will start adding tests for each one of these document formats,
and take it as it goes and see what trouble we run into.
Closes#163
The plugin name currently defaults to the gradle project name. But the
gradle project name for standalone repo (like an external plugin would
be) defaults to the directory name of the repo. This is trappy, since it
depends on how the repo was checked out.
This change enforces the plugin name is always set.
closes#14603