[ERROR] Medium: The field
org.apache.commons.lang3.builder.DiffBuilder$SDiff.leftSupplier is
transient but isn't set by deserialization
[org.apache.commons.lang3.builder.DiffBuilder$SDiff] In DiffBuilder.java
SE_TRANSIENT_FIELD_NOT_RESTORED
[ERROR] Medium: The field
org.apache.commons.lang3.builder.DiffBuilder$SDiff.rightSupplier is
transient but isn't set by deserialization
[org.apache.commons.lang3.builder.DiffBuilder$SDiff] In DiffBuilder.java
SE_TRANSIENT_FIELD_NOT_RESTORED
[ERROR] Medium: The field
org.apache.commons.lang3.builder.DiffBuilder$SDiff.leftSupplier is
transient but isn't set by deserialization
[org.apache.commons.lang3.builder.DiffBuilder$SDiff] In DiffBuilder.java
SE_TRANSIENT_FIELD_NOT_RESTORED
[ERROR] Medium: The field
org.apache.commons.lang3.builder.DiffBuilder$SDiff.rightSupplier is
transient but isn't set by deserialization
[org.apache.commons.lang3.builder.DiffBuilder$SDiff] In DiffBuilder.java
SE_TRANSIENT_FIELD_NOT_RESTORED
* [StringUtils::indexOfAnyBut] redesign due to inconsistent/faulty…
…behaviour regarding UTF-16 surrogates
Both signatures of StringUtils::indexOfAnyBut currently behave
inconsistently in matching UTF-16 supplementary characters and single
UTF-16 surrogate characters (i.e. paired and unpaired surrogates), since
they differ unnecessarily in their algorithmic implementations, use
their own incomplete and faulty interpretation of UTF-16 and don't take
full advantage of the standard library.
The example cases below show that they may yield contradictory results
or correct results for the wrong reasons.
This proposal gives a unified algorithmic implementation of both
signatures that
a) is much easier to grasp due to a clear mathematical set approach and
safe iteration and doesn't become entangled in index arithmetic;
stresses the set semantics of the 2nd argument
b) fully relies on the standard library for defined UTF-16
handling/interpretation;
paired surrogates are merged into one codepoint, unpaired surrogates
are left as they are
c) scales much better with input sizes and result index position
d) can benefit from current and future improvements in the standard
library and JVM
(streams implementation, parallelization, JIT optimization, JEP 218,
???…)
The algorithm boils down to:
find index i of first char in cs such that
(cs.codePointAt(i) ∈ {x ∈ codepoints(cs) ∣ x ∉
codepoints(searchChars) })
Examples:
---------
<H>: high-surrogate character
<L>: low-surrogate character
(<H><L>): valid supplementary character
signature 1: StringUtils::indexOfAnyBut(final CharSequence seq,
final CharSequence searchChars)
signature 2: StringUtils::indexOfAnyBut(final CharSequence cs,
final char... searchChars)
Case 1: matching of unpaired high-surrogate
---------seq/cs-------searchChars------exp./new-----sig.1-------sig.2---
1.1 <H>aaaa <H>abcd !found !found !found
sig.2: 'a' happens to follow <H> in searchChars;
sig.1: 'a' is somewhere in searchChars
1.2 <H>baaa <H>abcd !found !found 0
sig.1: 'b' is somewhere in searchChars
1.3 <H>aaaa (<H><L>)abcd 0 !found 0
sig.1: 'a' is somewhere in searchChars
1.4 aaaa<H> (<H><L>)abcd 4 !found !found
sig.1+2 don't interpret suppl. character
Case 2: matching of unpaired low-surrogate
---------seq/cs-------searchChars------exp./new-----sig.1-------sig.2---
2.1 <L>aaaa (<H><L>)abcd 0 !found !found
sig.1+2 don't interpret suppl. character
2.2 aaaa<L> (<H><L>)abcd 4 !found !found
sig.1+2 don't interpret suppl. character
Case 3: matching of supplementary character
---------seq/cs-------------searchChars-----exp./new----sig.1-----sig.2-
3.1 (<H><L>)aaaa <L>ab<H>cd 0 !found 0
sig.1: <L> is somewhere in searchChars
3.2 (<H><L>)aaaa abcd 0 1 0
sig.1 always points to low-surrogate of (fully) unmatched
suppl. character
3.3 (<H><L>)aaaa abcd<H> 0 0 1
3.4 (<H><L>)aaaa abcd<L> 0 !found 0
sig.1: <H> skipped by algorithm
* [StringUtils::indexOfAnyBut] further reduction of algorithm
by simplifying set consideration:
find index i of first char in seq such that (seq.codePointAt(i) ∉ { x ∈
codepoints(searchChars) })
* [StringUtils::indexOfAnyBut] simplify input-sequence iteration
by transforming ListIterator loop into index-based loop,
advancing by Character.charCount(codepoint);
enabling short-circuit processing, avoiding full in-advance processing of
input-sequence
* [StringUtils:indexOfAnyBut] parameterization of test functions
providing a single source-of-truth (arguments stream) for the two
function variants
* [StringUtils:indexOfAnyBut] remove comment
Set::contains of immutable Set has unclear desastrous performance issues
when searching for large values (here: >0xffff) in a set of smaller
values (including JDK 23)
---------
Co-authored-by: IBue <>