Add a new SuffixParser to handle UTF-8 chracters with more than one byte #82

baldawar · 2023-03-31T03:54:31Z

Issue #, if available: #73

Description of changes:

Add a new suffix parser that reverses utf-8 characters correctly.

Before we would convert "雨 to [34, -23, -101, -88] while addSuffixMatch(...) in ByteMatchine expected the the value to be [34, -88, -101, -23].

In this change, we're changing how we build InputCharacter[] within the default parser by introducing SuffixParser for the current "reverse and then match backwards" quirky behaviour within suffix and anything-bit-suffix matcher.

I decided against switching to a wildcard based pattern like *雨 within suffix because

Patterns.suffixMatch() is a public method, so any changes can be breaking.
Wildcard hasn't been significantly tested in the wild yet and introduces a new layer of complexity.
There's significant amount of refactoring (mostly deletion) which will lead to rebasing issues with Exponential memory improvement by re-using NameState across multiple patterns #75 that I don't want at introduce at the moment.

No issues detected within benchmarks.

Benchmark / Performance (for source code changes):

Before

EXACT events/sec: 201387.5
WILDCARD events/sec: 160201.5
PREFIX events/sec: 214138.7
SUFFIX events/sec: 197651.2
EQUALS_IGNORE_CASE events/sec: 175798.7
NUMERIC events/sec: 129840.3
ANYTHING-BUT events/sec: 127356.8
ANYTHING-BUT-IGNORE-CASE events/sec: 129524.6
ANYTHING-BUT-PREFIX events/sec: 135971.9
ANYTHING-BUT-SUFFIX events/sec: 136319.9
COMPLEX_ARRAYS events/sec: 4285.7
PARTIAL_COMBO events/sec: 56712.3
COMBO events/sec: 2421.7

After

EXACT events/sec: 205664.1
WILDCARD events/sec: 154062.2
PREFIX events/sec: 213281.3
SUFFIX events/sec: 209506.4
EQUALS_IGNORE_CASE events/sec: 183048.1
NUMERIC events/sec: 125852.3
ANYTHING-BUT events/sec: 127738.6
ANYTHING-BUT-IGNORE-CASE events/sec: 129367.3
ANYTHING-BUT-PREFIX events/sec: 129998.8
ANYTHING-BUT-SUFFIX events/sec: 131280.3
COMPLEX_ARRAYS events/sec: 4411.9
PARTIAL_COMBO events/sec: 44781.0
COMBO events/sec: 2360.7

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

#73

jonessha · 2023-03-31T15:53:31Z

src/test/software/amazon/event/ruler/MachineTest.java

+        Machine m = new Machine();
+        String rule = "{\n" +
+                "   \"status\": {\n" +
+                "       \"weatherText\": [{\"suffix\": \"雨\"}]\n" +


Would it be worth having a suffix with more than one multi-byte character?

jonessha · 2023-03-31T15:55:17Z

src/test/software/amazon/event/ruler/MachineTest.java

+                "  }\n" +
+                "}";
+        m.addRule("r1", rule);
+        List<String> matchRules = m.rulesForJSONEvent(eventStr);


What I have been doing recently is added rulesForJSONEvent tests to ACMachineTest and adding rulesForEvent tests (otherwise identical) to MachineTest. I think that is the intended difference between the two test classes? So change this and add a version to ACMachineTest? Of course, we should probably think through a path forward to stop duplicating all tests.

I hadn't thought about it. Let me follow it as well. I filed #83 to keep track of duplicate effort.

baldawar added 4 commits March 30, 2023 20:19

Add a new SuffixParser to handle UTF-8 chracters with more than one byte

ce88571

#73

remove commented code

17cbe02

remove unused importd

f59f367

minor version bump

c6b584a

jonessha approved these changes Mar 31, 2023

View reviewed changes

baldawar mentioned this pull request Mar 31, 2023

Reduce Duplication between ACMachineTest and MachineTest #83

Open

Addressing PR feedback

9b979f3

baldawar requested a review from jonessha March 31, 2023 20:36

jonessha approved these changes Mar 31, 2023

View reviewed changes

baldawar merged commit 7d588b1 into main Mar 31, 2023

baldawar deleted the suffix-cn-bug branch March 31, 2023 21:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a new SuffixParser to handle UTF-8 chracters with more than one byte #82

Add a new SuffixParser to handle UTF-8 chracters with more than one byte #82

baldawar commented Mar 31, 2023

jonessha Mar 31, 2023

jonessha Mar 31, 2023

baldawar Mar 31, 2023

Add a new SuffixParser to handle UTF-8 chracters with more than one byte #82

Add a new SuffixParser to handle UTF-8 chracters with more than one byte #82

Conversation

baldawar commented Mar 31, 2023

Issue #, if available: #73

Description of changes:

Benchmark / Performance (for source code changes):

jonessha Mar 31, 2023

Choose a reason for hiding this comment

jonessha Mar 31, 2023

Choose a reason for hiding this comment

baldawar Mar 31, 2023

Choose a reason for hiding this comment