performance problem #7

michaelfruth · 2020-07-10T13:25:09Z

Hello,
I noticed a performance problem as soon as the schema contains the following structure:

... "anyOf": [ {"enum": ["aa", "bb", "cc"]}, {"pattern": "pattern1"}, {"pattern": "pattern2"}, {"pattern": "pattern3"}, ... ] ...

The performance can be massively improved by processing the schema beforehand. All enum values and patterns should be combined to a single pattern as shown in the example below:

... "anyOf": [ {"pattern": "^aa$|^bb$|^cc$|pattern1|pattern2|pattern3"} ] ...

Actually, you iteratively append the enum values and regex patterns to a single regex and compute for every iteration the intersection between the current pattern and ".*". This is very expensive and results in bad performance (for this specific kind of schema).

I added an example json file (anyOf.json) that shows the problem. anyOf.json takes on my machine about 50-60 seconds for the result (LHS :< RHS and RHS :< LHS) when checking the file against itself (command jsonsubschema anyOf.json anyOf.json). Applying preprocessing, it takes about 0.04 seconds. I also attached a python script (smaller_anyOf.py) that contains the preprocessing. The script combines the string-enum-values and all patterns to a single pattern as shown in the example above.

AnyOf.zip

By transforming the string-enum-values to a regex, special regex characters (e.g. ".", "-", ...) are escaped to get an identical expression as regex.

... "enum": ["ab-c"] ...
will be transformed to
... "pattern": "^ab\\-c$" ...

Be careful, this can currently lead to another problem - see #6 .

Best Regards
Michael

The text was updated successfully, but these errors were encountered:

andrewhabib · 2020-07-15T13:33:54Z

Hi Micahel,

Thank you for this issue.

I am not sure your description of the problem root cause is correct.
Isn't your suggestion is what is being done here for string enums

jsonsubschema/jsonsubschema/_canonicalization.py

Lines 264 to 266 in 165f893

    
           if t == "string": 
        
               pattern = "|".join(map(lambda x: "^"+str(x)+"$", enum)) 
        
               ret = {"type": "string", "pattern": pattern}

and here for anyOf with several patterns

jsonsubschema/jsonsubschema/_checkers.py

Lines 349 to 351 in 165f893

    
           if s1_new_pattern and s2_new_pattern: 
        
               ret["pattern"] = "^" + s1_new_pattern + \ 
        
                   "$|^" + s2_new_pattern + "$"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance problem #7

performance problem #7

michaelfruth commented Jul 10, 2020

andrewhabib commented Jul 15, 2020

performance problem #7

performance problem #7

Comments

michaelfruth commented Jul 10, 2020

andrewhabib commented Jul 15, 2020