Splitting benchmark for numbers and complex arrays. #40

baldawar · 2022-09-10T17:55:28Z

Issue #, if available: #25

Description of changes:

As part of #25 realized that numeric matchers are orders of magnitude slow not because of inherent issue within the matcher specific code, but instead because the benchmarks were stressed while checking for complex arrays (don't have a better term for this yet, think "json arrays within arrays within ..." )

After splitting the matchers into two and introducing a second PARTIAL_COMBO benchmark, I was able to identify a regression I would have introduced within the ByteMachine.java for numeric ranges.

As part of this change we're also changing the citylots2.json.gz file and adding a new firstCoordinates key for numeric matching only. I tried other existing properties first but as none of them have floating points or large numbers, the benchmarks results were not matching my expectations. No licensing concerns with the modification of the dataset as the original citylots data was under public domain and the citylots2 dataset was created as part of this library.

Benchmark / Performance (for source code changes):

Reading citylots2
Read 213068 events
EXACT events/sec: 213495.0
WILDCARD events/sec: 176526.9
PREFIX events/sec: 201387.5
SUFFIX events/sec: 226909.5
EQUALS_IGNORE_CASE events/sec: 197102.7
NUMERIC events/sec: 123804.8
ANYTHING-BUT events/sec: 139625.2
COMPLEX_ARRAYS events/sec: 1973.4
PARTIAL_COMBO events/sec: 72079.8
COMBO events/sec: 1178.5
Reading citylots2
Read 213068 events
Finding Rules...
Lots: 10000
Lots: 20000
Lots: 30000
Lots: 40000
Lots: 50000
Lots: 60000
Lots: 70000
Lots: 80000
Lots: 90000
Lots: 100000
Lots: 110000
Lots: 120000
Lots: 130000
Lots: 140000
Lots: 150000
Lots: 160000
Lots: 170000
Lots: 180000
Lots: 190000
Lots: 200000
Lots: 210000
Lines: 213068, Msec: 12807
Events/sec: 16636.8
 Rules/sec: 116457.9
Before: 175.0
After: 1559.8
Per rule: -3462
Turning JSON into field-lists...
Finding Rules...
Lines: 213068, Msec: 2763
Events/sec: 77114.7
Before: 223.0
After: 1410.1
Per rule: -2967
Reading lines...
Finding Rules...
Lots: 10000
Lots: 20000
Lots: 30000
Lots: 40000
Lots: 50000
Lots: 60000
Lots: 70000
Lots: 80000
Lots: 90000
Lots: 100000
Lots: 110000
Lots: 120000
Lots: 130000
Lots: 140000
Lots: 150000
Lots: 160000
Lots: 170000
Lots: 180000
Lots: 190000
Lots: 200000
Lots: 210000
Lines: 213068, Msec: 1195
Events/sec: 178299.6
 Rules/sec: 649367076.2
Reading citylots2
Read 213068 events
Lots: 10000
Lots: 20000
Lots: 30000
Lots: 40000
Lots: 50000
Lots: 60000
Lots: 70000
Lots: 80000
Lots: 90000
Lots: 100000
Lots: 110000
Lots: 120000
Lots: 130000
Lots: 140000
Lots: 150000
Lots: 160000
Lots: 170000
Lots: 180000
Lots: 190000
Lots: 200000
Lots: 210000
Matched: 52527
Lines: 213068, Msec: 21613
Events/sec: 9858.3
Reading lines...
Finding Rules...
Lots: 10000
Lots: 20000
Lots: 30000
Lots: 40000
Lots: 50000
Lots: 60000
Lots: 70000
Lots: 80000
Lots: 90000
Lots: 100000
Lots: 110000
Lots: 120000
Lots: 130000
Lots: 140000
Lots: 150000
Lots: 160000
Lots: 170000
Lots: 180000
Lots: 190000
Lots: 200000
Lots: 210000
Lines: 213068, Msec: 11904
Events/sec: 17898.9
 Rules/sec: 125292.0

Process finished with exit code 0

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

As part of #25 realized that numeric matchers are orders of magnitude slow not because of inherent issue within the matcher specific code, but instead because the benchmarks were stressed while checking for complex arrays (don't have a better term for this yet, think "json arrays within arrays within ..." ) After splitting the matchers into two and introducing a second PARTIAL_COMBO benchmark, I was able to identify a regression I would have introduced within the ByteMachine.java for numeric ranges. As part of this change we're also changing the citylots2.json.gz file and adding a new firstCoordinates key for numeric matching only. I tried other existing properties first but as none of them have floating points or large numbers, the benchmarks results were not matching my expectations.

timbray

Since it's hard to see diffs on huge binary files, could you show a bit of diff output just so it's clear what you did to the sample data? Thanks in advance.

baldawar · 2022-09-12T21:19:50Z

sure

Got these by running

head citylots2_old.json | jq | pbcopy

timbray reviewed Sep 12, 2022

View reviewed changes

timbray approved these changes Sep 12, 2022

View reviewed changes

baldawar merged commit fe60ed9 into main Sep 12, 2022

baldawar deleted the x_arr_bench branch September 12, 2022 22:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Splitting benchmark for numbers and complex arrays. #40

Splitting benchmark for numbers and complex arrays. #40

baldawar commented Sep 10, 2022

timbray left a comment

baldawar commented Sep 12, 2022

Splitting benchmark for numbers and complex arrays. #40

Splitting benchmark for numbers and complex arrays. #40

Conversation

baldawar commented Sep 10, 2022

Issue #, if available: #25

Description of changes:

Benchmark / Performance (for source code changes):

timbray left a comment

Choose a reason for hiding this comment

baldawar commented Sep 12, 2022