Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tests for integrity of merged contexts #26

Closed
Tracked by #11
cthoyt opened this issue Oct 26, 2022 · 1 comment · Fixed by #35
Closed
Tracked by #11

Add tests for integrity of merged contexts #26

cthoyt opened this issue Oct 26, 2022 · 1 comment · Fixed by #35
Labels

Comments

@cthoyt
Copy link
Collaborator

cthoyt commented Oct 26, 2022

#15 introduced data integrity tests, but did not apply them to the merged contexts, because there seems to be some issues with the logic that generates them.

The merged and merged.oak strings should be removed from the following (effectively leaving skip = set() to go along with any updates that it takes to generate merged contexts with bijectiveness guarantees.

skip = {"merged", "merged.oak"}

@cmungall
Copy link
Member

If we don't skip these, then

    def test_namespace_aliases(self):
        """Test that prefix aliases have a valid namespace."""

will fail, but this is expected.

this is because we have:

merged,sdo,https://schema.org/,namespace_alias

from prefix.cc -- which is junk because, the correct semantic URL uses http not https, so there is no direct deterministic join point with the correct canonical prefix or IRI:

merged,schema,http://schema.org/,canonical

so this is effectively orphaned

now, we could extend the test to do a transitive walk over aliases and use the following as a join point:

merged,schema,https://schema.org/,prefix_alias

but this would be overkill, and not necessary, as the fundamental assumption here doesn't hold - the merged context does have integrity, it provides the correct bijective mapping for schema.org, according to the precedence rules in which the merged context is built (with the junky prefix.cc having lowest priority).

The thing to remember here is that the only thing the API exposes is the canonical mappings, the rest is just there for debugging purposes. It could be argued that a cleaner design would be for the CSVs to only include the canonical mappings for that context, and to put additional anciliary metadata that arises from the ETL elsewhere - a separate issue could be made for this if it's a priority (this would only be a priority for making the library easier to understand, as it wouldn't affect the users of the library)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants