[ML] Fix character set finder bug with unencodable charsets (#33234)

Some character sets cannot be encoded and this was tripping
up the binary data check in the ML log structure character
set finder.

The fix is to assume that if ICU4J identifies that some bytes
correspond to a character set that cannot be encoded and those
bytes contain zeroes then the data is binary rather than text.

Fixes #33227
This commit is contained in:
David Roberts 2018-08-29 14:56:02 +01:00 committed by GitHub
parent dd1956cf19
commit 22415fa2de
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 9 additions and 3 deletions

View File

@ -163,10 +163,16 @@ public final class LogStructureFinderManager {
// deduction algorithms on binary files is very slow as the binary files generally appear to
// have very long lines.
boolean spaceEncodingContainsZeroByte = false;
byte[] spaceBytes = " ".getBytes(name);
Charset charset = Charset.forName(name);
// Some character sets cannot be encoded. These are extremely rare so it's likely that
// they've been chosen based on incorrectly provided binary data. Therefore, err on
// the side of rejecting binary data.
if (charset.canEncode()) {
byte[] spaceBytes = " ".getBytes(charset);
for (int i = 0; i < spaceBytes.length && spaceEncodingContainsZeroByte == false; ++i) {
spaceEncodingContainsZeroByte = (spaceBytes[i] == 0);
}
}
if (containsZeroBytes && spaceEncodingContainsZeroByte == false) {
explanation.add("Character encoding [" + name + "] matched the input with [" + charsetMatch.getConfidence() +
"%] confidence but was rejected as the input contains zero bytes and the [" + name + "] encoding does not");