Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance problem #7

Open
michaelfruth opened this issue Jul 10, 2020 · 1 comment
Open

performance problem #7

michaelfruth opened this issue Jul 10, 2020 · 1 comment

Comments

@michaelfruth
Copy link

Hello,
I noticed a performance problem as soon as the schema contains the following structure:

... "anyOf": [ {"enum": ["aa", "bb", "cc"]}, {"pattern": "pattern1"}, {"pattern": "pattern2"}, {"pattern": "pattern3"}, ... ] ...

The performance can be massively improved by processing the schema beforehand. All enum values and patterns should be combined to a single pattern as shown in the example below:

... "anyOf": [ {"pattern": "^aa$|^bb$|^cc$|pattern1|pattern2|pattern3"} ] ...

Actually, you iteratively append the enum values and regex patterns to a single regex and compute for every iteration the intersection between the current pattern and ".*". This is very expensive and results in bad performance (for this specific kind of schema).

I added an example json file (anyOf.json) that shows the problem. anyOf.json takes on my machine about 50-60 seconds for the result (LHS :< RHS and RHS :< LHS) when checking the file against itself (command jsonsubschema anyOf.json anyOf.json). Applying preprocessing, it takes about 0.04 seconds. I also attached a python script (smaller_anyOf.py) that contains the preprocessing. The script combines the string-enum-values and all patterns to a single pattern as shown in the example above.

AnyOf.zip

By transforming the string-enum-values to a regex, special regex characters (e.g. ".", "-", ...) are escaped to get an identical expression as regex.

... "enum": ["ab-c"] ...
will be transformed to
... "pattern": "^ab\\-c$" ...

Be careful, this can currently lead to another problem - see #6 .

Best Regards
Michael

@andrewhabib
Copy link
Contributor

Hi Micahel,

Thank you for this issue.

I am not sure your description of the problem root cause is correct.
Isn't your suggestion is what is being done here for string enums

if t == "string":
pattern = "|".join(map(lambda x: "^"+str(x)+"$", enum))
ret = {"type": "string", "pattern": pattern}

and here for anyOf with several patterns

if s1_new_pattern and s2_new_pattern:
ret["pattern"] = "^" + s1_new_pattern + \
"$|^" + s2_new_pattern + "$"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants