Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[StringUtils::indexOfAnyBut] redesign due to inconsistent/faulty behaviour regarding UTF-16 surrogates #1327

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

IBue
Copy link

@IBue IBue commented Dec 3, 2024

depends/stacked on #1326 (tests only)

Both signatures of StringUtils::indexOfAnyBut currently behave
inconsistently in matching UTF-16 supplementary characters and single
UTF-16 surrogate characters (i.e. paired and unpaired surrogates), since
they differ unnecessarily in their algorithmic implementations, use
their own incomplete and faulty interpretation of UTF-16 and don't take
full advantage of the standard library.

The example cases below show that they may yield contradictory results
or correct results for the wrong reasons.

This proposal gives a unified algorithmic implementation of both
signatures that

  • a) is much easier to grasp due to a clear mathematical set approach and
    safe iteration and doesn't become entangled in index arithmetic;
    stresses the set semantics of the 2nd argument
  • b) fully relies on the standard library for defined UTF-16
    handling/interpretation;
    paired surrogates are merged into one codepoint, unpaired surrogates
    are left as they are
  • c) scales much better with input sizes and result index position
  • d) can benefit from current and future improvements in the standard
    library and JVM
    (streams implementation, parallelization, JIT optimization, JEP 218, ???…)

The algorithm boils down to:
find index i of first char in cs such that
(cs.codePointAt(i) ∈ {x ∈ codepoints(cs) ∣ x ∉ codepoints(searchChars) })
(cs.codePointAt(i) ∉ { x ∈ codepoints(searchChars) })

Examples:

<H>: high-surrogate character
<L>: low-surrogate character
(<H><L>): valid supplementary character
signature 1: StringUtils::indexOfAnyBut(final CharSequence seq, final CharSequence searchChars)
signature 2: StringUtils::indexOfAnyBut(final CharSequence cs, final char... searchChars)

Case 1: matching of unpaired high-surrogate

seq/cs searchChars expected/new sig.1 sig.2
1.1 <H>aaaa <H>abcd !found !found !found
sig.1: 'a' is somewhere
in searchChars
sig.2: 'a' happens to follow
<H> in searchChars
1.2 <H>baaa <H>abcd !found !found 0
sig.1: 'b' is somewhere
in searchChars
1.3 <H>aaaa (<H><L>)abcd 0 !found 0
sig.1: 'a' is somewhere
in searchChars
1.4 aaaa<H> (<H><L>)abcd 4 !found !found

Case 2: matching of unpaired low-surrogate

seq/cs searchChars expected/new sig.1 sig.2
2.1 <L>aaaa (<H><L>)abcd 0 !found !found
sig.1+2 don't interpret suppl. character
2.2 aaaa<L> (<H><L>)abcd 4 !found !found
sig.1+2 don't interpret suppl. character

Case 3: matching of supplementary character

seq/cs searchChars expected/new sig.1 sig.2
3.1 (<H><L>)aaaa <L>ab<H>cd 0 !found 0
sig.1: <L> is somewhere
in searchChars
3.2 (<H><L>)aaaa abcd 0 1 0
sig.1 always points to low-surrogate
of (fully) unmatched suppl. character
3.3 (<H><L>)aaaa abcd<H> 0 0 1
3.4 (<H><L>)aaaa abcd<L> 0 !found 0
sig.1: <H> skipped by algorithm

Copy link
Member

@garydgregory garydgregory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @IBue

Please rebate on git master and see my comments. It's hard to read the changes since some changes make the PR noisier than it has to be, like unnecessary parameter name changes. Also the PR contains duplicate work from your other PR.

TY!

@@ -2829,41 +2835,23 @@ public static int indexOfAny(final CharSequence cs, final String searchChars) {

* </pre>
*
* @param cs the CharSequence to check, may be null
* @param srcChars the CharSequence to check, may be null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't rename parameters in methods, it makes the PR noisy and makes it harder to review.

*/
static final String CharU20000 = "\uD840\uDC00";
static final String CharU20000SupplCharLow = "\uDC00";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes duplicate your changes in your other PR. Rebase on git master and don't make these changes here.

public static int indexOfAnyBut(final CharSequence seq, final CharSequence searchChars) {
if (isEmpty(seq) || isEmpty(searchChars)) {
public static int indexOfAnyBut(final CharSequence srcChars, final CharSequence searchChars) {
if (srcChars == null || searchChars == null || srcChars.length() == 0 || searchChars.length() == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why you inlined these calls. Keep the PR to only the required changes to fix a bug.

}
} else if (!chFound) {
return i;
final Set<Integer> srcSetCodePoints = srcChars.codePoints().boxed().collect(Collectors.toSet()); // >=10: Collectors::toUnmodifiableSet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does comments like >=10, 11, ... mean here? Can you be more explicit? Are these magic numbers?

@IBue IBue force-pushed the pr-indexOfAnyBut-redesign branch from f74224a to 70ae28d Compare December 5, 2024 07:05
@IBue
Copy link
Author

IBue commented Dec 5, 2024

I have rebased onto master and have followed your comments.

@garydgregory
Copy link
Member

This one needs a second pair of eyes... @ppkarwasz @chtompki @struberg ?

One big difference is that the current implementation does not allocate any extra memory. I wonder what happens to performance of existing apps that call these methods on large strings.

@garydgregory
Copy link
Member

I have rebased onto master and have followed your comments.

Hi @IBue

It looks like you forgot to run mvn (by itself) before pushing, see the checkstyle error in StringUtils.

Comment on lines 2879 to 2883
final Set<Integer> seqSetCodePoints = seq.codePoints().boxed().collect(Collectors.toSet()); // JDK >=10: Collectors::toUnmodifiableSet
final Set<Integer> searchSetCodePoints = searchChars.codePoints().boxed()
.collect(Collectors.toSet()); // JDK >=10: Collectors::toUnmodifiableSet
final Set<Integer> complSetCodePoints = seqSetCodePoints.stream().filter(((Predicate<Integer>) searchSetCodePoints::contains).negate()) // JDK >=11: Predicate.not(searchSetCodePoints::contains)
.collect(Collectors.toSet()); // JDK >=10: Collectors::toUnmodifiableSet

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creates 3 temporary sets. Since seq is potentially a very long sequence, we can end up with a very large memory footprint, while the previous code was garbage-free.

The advantage of Streams (or for loops 😉) is that you can exit early without consuming seq.codePoints().
Using Java 9, we could implement this method using:

Set<Integer> searchCodePoints = searchChars.codePoints().boxed().collect(Collectors.toSet());
int idx = seq.codePoints()
        .takeWhile(c -> !searchCodePoints.contains(c))
        .reduce(0, (count, codePoint) -> count + Character.charCount(codePoint));
return idx < seq.length() ? idx : INDEX_NOT_FOUND;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't jump off of Java 8 yet. I think we'd leave too many people behind. At least based on current surveys.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to use Java 9. The above stream expression is almost identical to the loop I posted below.

In this case the use of a Stream seems pretty much syntactic sugar.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set<Integer> searchCodePoints = searchChars.codePoints().boxed().collect(Collectors.toSet());
int idx = seq.codePoints()
        .takeWhile(c -> !searchCodePoints.contains(c))
        .reduce(0, (count, codePoint) -> count + Character.charCount(codePoint));
return idx < seq.length() ? idx : INDEX_NOT_FOUND;

This is a really clever stream-native solution (though !searchCodePoints.contains… is the StringUtils::indexOfAny() variant)
that appears to have almost no memory footprint penalty and is even 2-3× faster than this pull-request.

Comment on lines -2892 to -2901
for (int i = 0; i < strLen; i++) {
final char ch = seq.charAt(i);
final boolean chFound = CharSequenceUtils.indexOf(searchChars, ch, 0) >= 0;
if (i + 1 < strLen && Character.isHighSurrogate(ch)) {
final char ch2 = seq.charAt(i + 1);
if (chFound && CharSequenceUtils.indexOf(searchChars, ch2, 0) < 0) {
return i;
}
} else if (!chFound) {
return i;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we just need to replace this code with:

int codePoint;
for (int i = 0; i < strLen; i += Character.charCount(codePoint)) {
    codePoint = Character.codePointAt(seq, i);
    if (CharSequenceUtils.indexOf(searchChars, codePoint, 0) == -1) {
        return i;
    }
}

This way, surrogate pairs HL in seq are only looked up once in searchChars.

Not sure how to handle the case, when a surrogate character appears alone in seq, but IMHO in this case the seq string is corrupted and undefined behavior is justified.

IBue added 2 commits December 6, 2024 18:56
…behaviour regarding UTF-16 surrogates

Both signatures of StringUtils::indexOfAnyBut currently behave
inconsistently in matching UTF-16 supplementary characters and single
UTF-16 surrogate characters (i.e. paired and unpaired surrogates), since
they differ unnecessarily in their algorithmic implementations, use
their own incomplete and faulty interpretation of UTF-16 and don't take
full advantage of the standard library.

The example cases below show that they may yield contradictory results
or correct results for the wrong reasons.

This proposal gives a unified algorithmic implementation of both
signatures that
a) is much easier to grasp due to a clear mathematical set approach and
   safe iteration and doesn't become entangled in index arithmetic;
   stresses the set semantics of the 2nd argument
b) fully relies on the standard library for defined UTF-16
   handling/interpretation;
   paired surrogates are merged into one codepoint, unpaired surrogates
   are left as they are
c) scales much better with input sizes and result index position
d) can benefit from current and future improvements in the standard
   library and JVM
   (streams implementation, parallelization, JIT optimization, JEP 218,
   ???…)

The algorithm boils down to:
find index i of first char in cs such that
(cs.codePointAt(i) ∈ {x ∈ codepoints(cs) ∣ x ∉
codepoints(searchChars) })

Examples:
---------

<H>: high-surrogate character
<L>: low-surrogate character
(<H><L>): valid supplementary character
signature 1: StringUtils::indexOfAnyBut(final CharSequence seq,
final CharSequence searchChars)
signature 2: StringUtils::indexOfAnyBut(final CharSequence cs,
final char... searchChars)

Case 1: matching of unpaired high-surrogate
---------seq/cs-------searchChars------exp./new-----sig.1-------sig.2---

 1.1     <H>aaaa      <H>abcd          !found       !found      !found
  sig.2: 'a' happens to follow <H> in searchChars;
  sig.1: 'a' is somewhere in searchChars

 1.2     <H>baaa      <H>abcd          !found       !found      0
  sig.1: 'b' is somewhere in searchChars

 1.3     <H>aaaa      (<H><L>)abcd     0            !found      0
  sig.1: 'a' is somewhere in searchChars

 1.4     aaaa<H>      (<H><L>)abcd     4            !found      !found
  sig.1+2 don't interpret suppl. character

Case 2: matching of unpaired low-surrogate
---------seq/cs-------searchChars------exp./new-----sig.1-------sig.2---

 2.1     <L>aaaa      (<H><L>)abcd     0            !found      !found
  sig.1+2 don't interpret suppl. character

 2.2     aaaa<L>      (<H><L>)abcd     4            !found      !found
  sig.1+2 don't interpret suppl. character

Case 3: matching of supplementary character
---------seq/cs-------------searchChars-----exp./new----sig.1-----sig.2-

 3.1     (<H><L>)aaaa       <L>ab<H>cd      0           !found    0
  sig.1: <L> is somewhere in searchChars

 3.2     (<H><L>)aaaa       abcd            0           1         0
  sig.1 always points to low-surrogate of (fully) unmatched
  suppl. character

 3.3     (<H><L>)aaaa       abcd<H>         0           0         1
 3.4     (<H><L>)aaaa       abcd<L>         0           !found    0
  sig.1: <H> skipped by algorithm
by simplifying set consideration:
find index i of first char in seq such that (seq.codePointAt(i) ∉ { x ∈
codepoints(searchChars) })
@IBue IBue force-pushed the pr-indexOfAnyBut-redesign branch from 70ae28d to 0e372a6 Compare December 6, 2024 18:57
Comment on lines +2880 to +2888
for (final ListIterator<Integer> seqListIt = seq.chars().boxed().collect(Collectors.toList()) // JDK >=16: Stream::toList, JDK >=10: Collectors::toUnmodifiableList
.listIterator(); seqListIt.hasNext(); seqListIt.next()) {
final int curSeqCharIdx = seqListIt.nextIndex();
final int curSeqCodePoint = Character.codePointAt(seq, curSeqCharIdx);
if (!searchSetCodePoints.contains(curSeqCodePoint)) {
return curSeqCharIdx;
}
if (Character.isSupplementaryCodePoint(curSeqCodePoint)) {
seqListIt.next(); // skip subsequent low-surrogate in next loop, since it merged into curSeqCodePoint

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (final ListIterator<Integer> seqListIt = seq.chars().boxed().collect(Collectors.toList()) // JDK >=16: Stream::toList, JDK >=10: Collectors::toUnmodifiableList
.listIterator(); seqListIt.hasNext(); seqListIt.next()) {
final int curSeqCharIdx = seqListIt.nextIndex();
final int curSeqCodePoint = Character.codePointAt(seq, curSeqCharIdx);
if (!searchSetCodePoints.contains(curSeqCodePoint)) {
return curSeqCharIdx;
}
if (Character.isSupplementaryCodePoint(curSeqCodePoint)) {
seqListIt.next(); // skip subsequent low-surrogate in next loop, since it merged into curSeqCodePoint
int seqLength = seq.length();
int curSeqCodePoint;
// Skips the second character of a surrogate pair
for (curSeqCharIdx = 0; curSeqCharIdx < seqLength; curSeqCharIdx += Character.charCount(curSeqCodePoint)) {
curSeqCodePoint = Character.codePointAt(seq, curSeqCharIdx);
if (!searchSetCodePoints.contains(curSeqCodePoint)) {
return curSeqCharIdx;
}

Using a ListIterator has IMHO the same disadvantage of creating sets: you need first to create a List (i.e. consume all the code points in seq) and only then you start to iterate. I don't see why we can't use a simple loop.

@IBue
Copy link
Author

IBue commented Dec 6, 2024

note on performance:
boxing is the main responsible for increase in memory footprint here, that is +3GB for a 100M input sequence,
but the current implementation takes already a 10² order of magnitude more runtime than the proposed one for a 10M input sequence (minutes!) and a result index at half of the input size.

@ppkarwasz
Copy link

@IBue,

note on performance: boxing is the main responsible for increase in memory footprint here, that is +3GB for a 100M input sequence, but the current implementation takes already a 10² order of magnitude more runtime than the proposed one for a 10M input sequence (minutes!) and a result index at half of the input size.

Can you tell us more on the usage you have for this method? That will tell us a lot on how it should be optimized.

A common usage for this method could be extracting contiguous series of digits from a long string. With series of 10-20 digits it doesn't really make sense to go parallel, but if you want to validate a 100M input to make sure it only contains digits, you can sacrifice more memory.

Note: The indexOfAnyBut method can have multiple implementations and switched between them based on the input size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants