Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LSIDs for taxonomic names live again #117

Open
rdmpage opened this issue Mar 10, 2021 · 42 comments
Open

LSIDs for taxonomic names live again #117

rdmpage opened this issue Mar 10, 2021 · 42 comments

Comments

@rdmpage
Copy link

rdmpage commented Mar 10, 2021

TL;TR You can resolve LSIDs for taxonomic names here: https://lsid.herokuapp.com

Sorry for gatecrashing, but this might of interest. Given that there are millions of taxonomic names with LSIDs, most of which no longer resolve using the LSID protocol, it's always bothered me that we've let LSIDs die. So, I've made a website Life Science Identifier (LSID) Resolver that serves up the metadata for each LSID for names from three datasets (IPNI, Index Fungorum, and ION). These are all sources that used to support LSIDs, still display LSIDs, and in some cases still make the metadata available using the TDWG LSID vocabulary (if not via the LSID protocol).

The metadata is cached so the LSIDs resolve regardless of whether the source database supports the LSID protocol. Might be fun to compare the metadata from these LSIDs with what any new TNC comes up with. Note that there are some issues with the metadata, including mistakes and/or inconsistencies in the namespaces, and how the XML was constructed. I suspect these occurred because nobody ever actually used it.

I hope to add other LSIDs as time permits, and also depending on whether the database still provides metadata for LSIDs in TDWG LSID RDF.

@baskaufs
Copy link

This is marvelous, @rdmpage ! Do you have any plans to add ZooBank to the list of supported sources? Perhaps there is already a resolver for it somewhere else, but if so, it isn't apparent from the ZooBank website.

@rdmpage
Copy link
Author

rdmpage commented Mar 10, 2021

@baskaufs Glad you like it! By default I'm concentrating on sources that have RDF XML currently (or recently) available. I'm also biased towards integer identifiers (makes storing the data in chunks a bit easier, the whole thing data and all is in GitHub https://github.com/rdmpage/lsid-cache).

ZooBank stopped resolving LSIDs a long time ago :( If @deepreef restores that feature (even if just the RDF XML output) I could add ZooBank to the list, alternatively I'd have to make my own mapping between the JSON currently served by ZooBank and the TDWG LSID vocabulary, which is possible but slightly undermines the notion that I'm caching authoritative LSID metadata.

Personally I'm still baffled how our community decided that (a) the best identifier for a taxonomic name is an LSID and yet (b) made no attempt to persist either the identifiers or their associated metadata...

@rdmpage
Copy link
Author

rdmpage commented Mar 10, 2021

FYI I've managed to find a copy of a ZooBank LSID record in XML:

<?xml version="1.0"?>
<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:tto="http://rs.tdwg.org/ontology/voc/Specimen#" xmlns:tc="http://rs.tdwg.org/ontology/voc/TaxonConcept#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:tn="http://rs.tdwg.org/ontology/voc/TaxonName#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:tpub="http://rs.tdwg.org/ontology/voc/PublicationCitation#" xmlns:trank="http://rs.tdwg.org/ontology/voc/TaxonRank#" xmlns:tcom="http://rs.tdwg.org/ontology/voc/Common#">
  <tn:TaxonName rdf:about="urn:lsid:zoobank.org:act:A1AE7A00-32C6-4510-A1D6-6DDDA9129D8B">
    <dc:title>Ectenopsis mackerrasi</dc:title>
    <owl:versionInfo>1.1.2.1</owl:versionInfo>
    <tn:nameComplete>Ectenopsis mackerrasi</tn:nameComplete>
    <tn:genusPart>Ectenopsis</tn:genusPart>
    <tn:specificEpithet>mackerrasi</tn:specificEpithet>
    <tn:year>1996</tn:year>
    <tn:publication>
      <tpub:PublicationCitation>
        <tpub:publicationType rdf:resource="JournalArticle" />
        <tpub:parentPublication rdf:resource="urn:lsid:zoobank.org:pub:2B273330-A0BE-4BA7-8D41-5F49A5099DFC" />
        <dc:identifier>urn:lsid:zoobank.org:pub:1A71CBE3-0D39-471A-8F05-A5D87573591D</dc:identifier>
        <tpub:authorship>Burger, John F.</tpub:authorship>
        <tpub:year>1996</tpub:year>
        <tpub:title>A new species of Ectenopsis (Paranopsis) (Diptera: Tabanidae) from New Zealand and a key to species of the subgenus Paranopsis</tpub:title>
        <tpub:parentPublicationString>Proceedings of the Entomological Society of Washington</tpub:parentPublicationString>
        <tpub:volume>98</tpub:volume>
        <tpub:number>2</tpub:number>
        <tpub:pages>264-266</tpub:pages>
      </tpub:PublicationCitation>
    </tn:publication>
    <tn:rank rdf:resource="http://rs.tdwg.org/ontology/voc/TaxonRank#Species" />
    <tn:rankString>Species</tn:rankString>
    <tn:nomenclaturalCode rdf:resource="http://rs.tdwg.org/ontology/voc/TaxonName#ICZN" />
  </tn:TaxonName>
  <tpub:PublicationTypeTerm rdf:about="JournalArticle" />
  <tpub:PublicationCitation rdf:about="urn:lsid:zoobank.org:pub:2B273330-A0BE-4BA7-8D41-5F49A5099DFC" />
  <trank:TaxonRankTerm rdf:about="http://rs.tdwg.org/ontology/voc/TaxonRank#Species" />
  <tn:NomenclaturalCodeTerm rdf:about="http://rs.tdwg.org/ontology/voc/TaxonName#ICZN" />
</rdf:RDF>

The current ZooBank JSON API returns this for A1AE7A00-32C6-4510-A1D6-6DDDA9129D8B:

[
  {
    "tnuuuid": "a1ae7a00-32c6-4510-a1d6-6ddda9129d8b",
    "OriginalReferenceUUID": "1a71cbe3-0d39-471a-8f05-a5d87573591d",
    "protonymuuid": "a1ae7a00-32c6-4510-a1d6-6ddda9129d8b",
    "label": "mackerrasi Burger 1996",
    "value": "mackerrasi Burger 1996",
    "lsid": "urn:lsid:zoobank.org:act:A1AE7A00-32C6-4510-A1D6-6DDDA9129D8B",
    "parentname": "Ectenopsis",
    "namestring": "mackerrasi",
    "rankgroup": "Species",
    "usageauthors": "Burger",
    "taxonnamerankid": "70",
    "parentusageuuid": "de501acd-28db-42b9-9ed8-f1ff0926bda5",
    "cleanprotonym": "Ectenopsis mackerrasi Burger, 1996",
    "NomenclaturalCode": "ICZN"
  }
]

So, the mapping is less than straightforward :(

@deepreef
Copy link
Contributor

deepreef commented Mar 10, 2021

Thanks for tagging me on this! Like @rdmpage , I have been bothered by the state of LSIDs -- but from the opposite direction. I am bothered that we still mint them as though the community uses them (in the way they were intended to be used). The reason I haven't bothered to maintain the LSID resolver for ZooBank is that there was basically only one person who ever accessed them using the LSID resolution protocol (hint: it's the same person who started this thread). I'm more than happy to get it working again, if there is some desire to actually use that protocol for resolving content. @rdmpage : you used to have a WONDERFUL LSID resolution testing service -- is that still functional? If you can point me to that service (which I'll need to test the ZooBank LSID resolver service), I'll get the ZooBank LSID resolver working again.

The last time we discussed this, we ended up agreeing to abandon the LSID protocol, and instead create an RSS feed:
http://zoobank.org/rss/rss.xml
That continued to work up until last July, when we moved ZooBank to a new server and I forgot to update the login credentials for the service, so it stopped working. And the outcry from the user community in response to it no longer functioning can best be described as "deafening silence". But it was super easy to update the login credentials for the service just now, so it's now working again. I won't know for 24 hours whether it is correctly refreshing every 24 hours. But if it's not, I expect the outcry from the community to be the same as it was last July.

Snarky commentary aside, this is actually PERFECT timing. I had a long chat this morning with the COL ISG and one of the key topics was mobilizing ZooBank and GNUB to be more tightly integrated with COL/GBIF ChecklistBank. We had done a lot of work on that before, which stalled on November 18 2019 when our server system was hit by ransomware. After we solved that issue, we found ourselves in the middle of a global pandemic and re-adjusted priorities. As it happens, the cycle of priorities have looped around again such that ZooBank is back near the top.

The key next steps is to get these two datasets live again:
https://www.gbif.org/dataset/c8227bb4-4143-443f-8cb2-51f9576aff14
https://www.gbif.org/dataset/34a96ebe-e51c-4222-9d08-5c2043c39dec

IPT is already up and running on our server for both, and the last step needed to flip the switch is to find a moment for @mdoering and I to connect and hack the config file and make them live again (perhaps next week). ZooBank was last refreshed on the day of the ransomware attack, and GNUB has been down since May 2015 (the outcry from that was the same as the RSS feed going down). Both should be up and live again by next week.

So, after getting the RSS feed and the two IPT datasets up and running again, this is my question to @rdmpage and @baskaufs and anyone else who is interested: What would you like next?

  • Re-establish the LSID resolver?
  • Harmonize the JSON in the ZooBank API with the LSID record?
  • Harmonize the RSS content feed with the API and/or LSID record?
  • Normalize them all to the DWC template published through the IPT?
  • Something else?
  • All of the above?

Tell me what you want, and I'll get 'er done. I don't even mind doing it for a client base of one (or two or three) -- I just want to make it easy to access and use the content. The reason the JASON API looks so clunky is that it never got past the "proof of concept" phase. If you give me the exact JSON output template you want, I can have that up and running for you. Of course, if I change the existing API, it will break any code that was build around the existing structure. But I have no doubt what the outcry to that will be (more deafening silence), so I'm ready to completely change the output template of all of these data access services (IPT, RSS, JSON API, and even LSID -- if people really want that) so they provide exactly the same content.

Let's do this.

@deepreef
Copy link
Contributor

deepreef commented Mar 10, 2021

By the way, another reason this is perfect timing is that we're planning the next-generation ZooBank in the context of the 5th Edition of the ICZN Code. One of the items on that list of improvements was to (once and for all) abandon the LSID protocol for identifiers. We're still committed to maintaining the ones we've already minted into perpetuity (at least as identifiers; if not as an LSID resolution service); but after a certain date in the year 202X, the plan is to only issue the UUIDs. This approach was based on the assumption that LSIDs were dead in our community. But based on this thread, I'm wondering if news of the demise of LSIDs has been greatly exaggerated... Do we want to recommit to them? Or should we drive the wooden stake through it's heart once and for all and embrace something else (my preference: UUIDs for everything, wrapped within the DOI dereferencing infrastructure).

@deepreef
Copy link
Contributor

deepreef commented Mar 10, 2021

One final thing:

Personally I'm still baffled how our community decided that (a) the best identifier for a taxonomic name is an LSID and yet (b) made no attempt to persist either the identifiers or their associated metadata...

You and me both! We all got together for two different workshops to discuss it. @rdmpage gave a great presentation on why LSIDs suck, DOIs suck and PURLs suck (I think those were the three -- I just remember that they all definitely sucked). At the end of those workshops, we all decided that LSIDs sucked the least, so we decided to go for it (largely because they were developed and backed by "IBM" -- so clearly were going to be around for the long haul -- yeah, right...). Lee Belbin convinced Paul Kirk, Nicky Nicholson and I to embrace LSIDs as a way to kick-start the community interest and understanding in them by showcasing them in IF, IPNI and ZooBank (respectively). COL wasn't far behind in adopting them. It all seemed so promising at the time.

Sigh

@rdmpage
Copy link
Author

rdmpage commented Mar 10, 2021

@deepreef Hi Rich, from my perspective it would be great to have the LSID XML available, even if just via an API call rather than full blown LSID resolution. That way I could cache it and have essentially instantaneous access to the four main LSID providers. The LSID tester you mention is long dead, but some of its code lives on in http://www.lsid.info/resolver/ which could be used to help debug LSIDs.

@rdmpage
Copy link
Author

rdmpage commented Mar 10, 2021

On the other things it seems to me inevitable that any serious attempt to issue identifiers for taxonomic names should use DOIs. I have never liked UUIDs, I think they are anti-user and send exactly the wrong message if you want to encourage adoption (identifiers are ugly, for computers only, and disposable), but I know @deepreef and I will never agree on this ;)

@baskaufs
Copy link

@deepreef As far as I'm concerned LSIDs are dead and it does not seem like it is worth maintaining an infrastructure that mints any more of them. I'm mostly concerned about them as a sort of archival issue. In other words, is there ANY way to recover the information they were supposed to provide if someone were to read an old paper that used them and wanted to get whatever information they were supposed to provide. That is what @rdmpage's tool does, subject to actually having access to the underlying data.

@deepreef
Copy link
Contributor

OK, thanks @rdmpage -- so it's not so much about the LSID resolution protocol as it is to get the content in XML format similar to the the LSID template? That should be a lot easier, I imagine.

Above you gave two examples of output, the LSID template and the JSON template. Again, the latter was just a proof of concept that we never finished (mostly Rob Whitton wanting to get his head around how to implement JSON). After we built it, we put out the call for feedback on how to modify the structure to represent it in ways that people would find useful. Again, the response was deafening silence, so we never followed up with it.

So... let's assume that nobody is using the LSID resolution protocol, so we don't need to resurrect that. Let's also assume that nobody is using the ZooBank APIs, so I can re-develop those without breaking anyone's existing code (or I can keep a legacy version if people really want and use that crappy JSON template). And finally, let's assume I will commit to doing the necessary work to make it happen (like I said, the timing is good as I'm mucking around with the IPT now anyway.

If we assume all of those things, then it makes a lot of sense to me to harmonize at least the output content for IPT, XML and JSON. IPT is the only one following a real, active standard (DwC), so let's use that as the "core" content. DwC lacks a literature standard (something we've always wanted) so maybe I can just use the terms as they are in the LSID template. My thinking is that ipt will continue to do its thing (via http://ipt.zoobank.org:8080/ipt -- not quite functional yet, but it will be after I synch with @mdoering). Then I'll base two APIs off the same content, one that outputs in XML, and one that outputs in JSON.

Here's what I need help with:

  1. Someone needs to provide me with templates for both XML and JSON with a sample record showing me exactly what you want the output to look like. I have a rough idea, but I need the actual consumers of this stuff to tell me what they want -- rather than me trying to guess and hoping what I build meets your needs.

  2. Help me decide the endpoints to access the content. I'm a bit out of my depth on best practices here, but I think it's important to understand that ZooBank content is a subset of GNUB content. As discussed in another issue on GitHub, limiting the content to ZooBank is artificial, as links to parentUsageID don't work unless both the genus and the species are both established within the same publication. I think a much better approach is to simply present all the content as GNUB, including both the ZooBank stuff and non-ZooBank stuff. So by my primitive thinking, the endpoints for these services would be something like:

http://gnub.org/a1ae7a00-32c6-4510-a1d6-6dddA9129d8b.xml
and
http://gnub.org/a1ae7a00-32c6-4510-a1d6-6dddA9129d8b.json
[UUIDs would be case-insensitive for the service, so it wouldn't matter if you used the uppercase or lowercase versions of the UUIDs].

Is that the best way to do it? Or would it be better to go with something like:
http://gnub.org/xml/a1ae7a00-32c6-4510-a1d6-6dddA9129d8b
and
http://gnub.org/json/a1ae7a00-32c6-4510-a1d6-6dddA9129d8b

Or maybe:

http://gnub.org/tnu/a1ae7a00-32c6-4510-a1d6-6dddA9129d8b.xml
and
http://gnub.org/tnu/a1ae7a00-32c6-4510-a1d6-6dddA9129d8b.json
[e.g., if we want to have different services for tnus vs. pubs vs. authors]

Maybe it doesn't matter (in which case I'll go with the first option, because it seems clean to me); or maybe it does matter (in which case someone needs to tell me what it should be).

I know it's bad GitHub etiquette to write such long posts, but you all know me well enough to know that I don't care about GitHub etiquette (I'm going to spell this stuff out explicitly no matter what, so get over it). But I'm serious about rebuilding this stuff right -- meaning in a way that is useful enough that the user base may eventually expand beyond two or three clients.

@deepreef
Copy link
Contributor

@baskaufs : Yes! As I said, we're committed to maintaining the "identity" part of LSIDs into perpetuity (if not the resolution protocol part). That was one of the things I wanted to achieve through http://bioguid.org
Taking the example from @rdmpage :
http://bioguid.org/searchIdentifier?q=a1ae7a00-32c6-4510-a1d6-6dddA9129d8b&format=html

BTW, that is another @rdmpage - inspired service that almost got off the ground, then went into hibernation for a few years, but sometime within the next year or two I plan to bring it back to life again (with gusto!)

But that's a topic for another thread...

@rdmpage
Copy link
Author

rdmpage commented Mar 10, 2021

@deepreef From the perspective of the LSID archive ideally XML like the example I showed above #117 (comment) (which was actually retrieved from ZooBank when its LSID service was live). The structure and vocabulary of that file closely match IPNI, Index Fungorum, and ION, which makes integrating all sources of data much easier.

If nothing else, if we get ZooBank added it means that the millions of LSIDs for names in the wild, including those which presumably have some nomenclatural significance would all be "resolvable".

So, would it be possible to serve XML like #117 (comment) for each taxon name? Maybe the original code for this still exists in the ZooBank source code? I have no preference for API interface, presumably something like http://zoobank.org/NomenclaturalActs.xml/6EA8BB2A-A57B-47C1-953E-042D8CD8E0E2 would be consistent with the current API?

@deepreef
Copy link
Contributor

OK, I'll use that as a starting point. You said it "closely matches" IPNI, IF and ION. Can we bump that up to "exactly matches" to make it even easier? If I'm going to need to build it anyway, I might as well add any additional tweaks to improve it in any way you wish. I'll start with the template as you presented it above, but I assume it won't break anything if I add additional properties (as long as I don't change the existing ones) -- is that a safe assumption?

Framing it as ZooBank is artificially constrictive, and will lead to broken links from parentUsageID (assuming I add that property to the XML output). Why not apply it to the entire GNUB content? Here are some comparisons of numbers:

Class ZooBank GNUB
Protonym TNUs 279,245 385,967
Non-Protonym TNUs 0 831,532
References 121,742 156,341

All of the ZooBank content is included among the GNUB content. The only difference is that the ZooBank records have both a UUID and an LSID (and also a little bit of metadata, such as when the content was registered), whereas the GNUB records only have the UUIDs. If we could just add one property to indicated that a given record was registered in ZooBank, then it seems to me that the GNUB content would make the most sense to scope the service for.

I guess it wouldn't hurt to do both as separate services (one at zoobank.org, and one at gnub.org), but that seems pretty redundant when the gnub version already includes everything in the zoobank version.

Yes, the original code does already exist, so it won't be too hard to resurrect it exactly as is.

One last thing, though: we're talking about "resolving" the LSIDs, but your example uses the UUID. My assumption that both will work, but my question is whether the uuid should be presented in the output as a separate identifier, or just leave it to the end-user to harvest it from the LSID.

So, just to be clear: my current plan is to implement an interface that returns the exact same XML as you listed above, but directly (rather than through the LSID protocol). I'll make it so you get the same results for any of these:
http://zoobank.org/NomenclaturalActs.xml/6ea8bb2a-a57b-47c1-953e-042d8cd8e0e2
http://zoobank.org/NomenclaturalActs.xml/6EA8BB2A-A57B-47C1-953E-042D8CD8E0E2
http://zoobank.org/NomenclaturalActs.xml/urn:lsid:zoobank.org:act:6EA8BB2A-A57B-47C1-953E-042D8CD8E0E2

Once I get that working, then we can move on to the next questions:

  1. Should I make the JSON API provide the same content and terms?
  2. Should we create similar services for all of GNUB content?
  3. What's the best interface system to use for the new (GNUB) APIs?

@rdmpage
Copy link
Author

rdmpage commented Mar 11, 2021

@deepreef From my perspective I'd just like the ZooBank LSIDs (I think of my services as a "wayback machine" for LSIDs). So my preference is not to include additional links to GNUB, but obviously that's up to you. I'd only harvest ZooBank LSIDs as they are the only ones that are likely to be in the wild (e.g., in publications or referred to in external databases such as Wikidata).

Regarding XML, the original example I gave above could be tweaked as it has some issues. In particulate, it doesn't link the publication to the name (except indirectly via a bnode). At the moment you have something like this:

<?xml version="1.0"?>
<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:tto="http://rs.tdwg.org/ontology/voc/Specimen#" xmlns:tc="http://rs.tdwg.org/ontology/voc/TaxonConcept#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:tn="http://rs.tdwg.org/ontology/voc/TaxonName#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:tpub="http://rs.tdwg.org/ontology/voc/PublicationCitation#" xmlns:trank="http://rs.tdwg.org/ontology/voc/TaxonRank#" xmlns:tcom="http://rs.tdwg.org/ontology/voc/Common#">
  <tn:TaxonName rdf:about="urn:lsid:zoobank.org:act:A1AE7A00-32C6-4510-A1D6-6DDDA9129D8B">
    <dc:title>Ectenopsis mackerrasi</dc:title>
    <tn:publication>
      <tpub:PublicationCitation>
        <tpub:publicationType rdf:resource="JournalArticle" />
        <tpub:parentPublication rdf:resource="urn:lsid:zoobank.org:pub:2B273330-A0BE-4BA7-8D41-5F49A5099DFC" />
        <dc:identifier>urn:lsid:zoobank.org:pub:1A71CBE3-0D39-471A-8F05-A5D87573591D</dc:identifier>
        <tpub:title>A new species of Ectenopsis (Paranopsis) (Diptera: Tabanidae) from New Zealand and a key to species of the subgenus Paranopsis</tpub:title>
      </tpub:PublicationCitation>
    </tn:publication>
  </tn:TaxonName>
  <tpub:PublicationCitation rdf:about="urn:lsid:zoobank.org:pub:2B273330-A0BE-4BA7-8D41-5F49A5099DFC" />
</rdf:RDF>

servlet_4738409711555504906

whereas I think you want something like this:

<?xml version="1.0"?>
2: <rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:tto="http://rs.tdwg.org/ontology/voc/Specimen#" xmlns:tc="http://rs.tdwg.org/ontology/voc/TaxonConcept#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:tn="http://rs.tdwg.org/ontology/voc/TaxonName#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:tpub="http://rs.tdwg.org/ontology/voc/PublicationCitation#" xmlns:trank="http://rs.tdwg.org/ontology/voc/TaxonRank#" xmlns:tcom="http://rs.tdwg.org/ontology/voc/Common#">
3:   <tn:TaxonName rdf:about="urn:lsid:zoobank.org:act:A1AE7A00-32C6-4510-A1D6-6DDDA9129D8B">
4:     <dc:title>Ectenopsis mackerrasi</dc:title>
5:     <tn:publication>
6:         <rdf:Description rdf:about="urn:lsid:zoobank.org:pub:1A71CBE3-0D39-471A-8F05-A5D87573591D">
7:         <rdf:type rdf:resource="http://rs.tdwg.org/ontology/voc/PublicationCitation#PublicationCitation"/>
8: 		<dc:identifier>urn:lsid:zoobank.org:pub:1A71CBE3-0D39-471A-8F05-A5D87573591D</dc:identifier>
9:         <tpub:title>A new species of Ectenopsis (Paranopsis) (Diptera: Tabanidae) from New Zealand and a key to species of the subgenus Paranopsis</tpub:title>                
10:        </rdf:Description>
11:     </tn:publication>
12:   </tn:TaxonName>
13: </rdf:RDF>

servlet_4193915353756474619

The difference is that now we are explicitly making the link between the taxon name and publication LSIDs. RDF XML is horrible, the W3C validator is useful for figuring out if you're doing it right (it took me a few goes).

The sooner we all move to JSON-LD and Bioschemas the better ;)

@cboelling
Copy link
Member

two different workshops to discuss it. @rdmpage gave a great presentation on why LSIDs suck, DOIs suck and PURLs suck (I think those were the three -- I just remember that they all definitely sucked).

Are the presentations from these workshops or the presentation by @rdmpage still available somewhere, by any chance?

@rdmpage
Copy link
Author

rdmpage commented Mar 11, 2021

@cboelling This was 2005-2006 as I recall, and whatever I said then is probably stuck somewhere on a ZIP file or a DVD! My recollection at the time was that we looked at DOIs, Handles, PURLs, and LSIDs.

The discussion was heavily driven by costs, so DOIs were seen as problematic as they were expensive. Ironically, DOIs were already in use at the time by NamesforLife (N4L) a company set up by George M. Garrity (who was at the meeting) to manage bacterial names and taxonomy. For example, doi:10.1601/nm.3093 is the name Escherichia coli, and doi:10.1601/tx.3093 is the corresponding taxon. Imagine if we'd gone down this route and hand DOIs for every Eukaryote taxonomic name... oh well.

Handles are DOIs without the branding and with minimal costs, but you have to mange them using clunky software. PURLs just move managing persistence somewhere else using someone else's brand and worse tools. LSIDs had the advantage of being free, they keep your organisation brand, and by serving RDF they forced nomenclators to standardise on a data format (the TDWG LSID vocabulary). But their dependency on messing with DNS and using SOAP made then beyond the reach of many biodiversity developers.

As is typical in these discussions when the participants have no money, the free solution won. If you don't value the solution (i.e., won't spend money on it) then why would anyone else value it?

Personally I think we missed the absolutely key challenges, which are to:

  1. provide value for users (what do you get if you resolve the identifier?)
  2. provide services on top of the identifier (what else does the identifier give me?)
  3. engineer network effects to encourage identifier adoption (so that people feel left out if they don't use the identifier).

DOIs are the shiny example of doing this right, LSIDs, not so much. The challenge is to make sure you have 1-3, once you have that then the actual identifier technology doesn't matter so much (but of course, some have brand recognition, which is why DOIs are taking over the world).

@deepreef
Copy link
Contributor

@rdmpage : EXCELLENT! This is exactly the sort of feedback I was hoping for.

OK, I decided sleep wasn't necessary tonight, so I went ahead and built version 1 of the service, incorporating your requested tweaks. I made as couple of other minor changes changes:

  • authorship is now included in the dc:title for the name.
  • I added dc:identifier for the name as well as the pub
  • I added your suggested publication structure, but for now I also left the original tpub:PublicationCitation...</tpub:PublicationCitation> content in place so that the parsed reference citation data are included. I can obviously remove these if they represent a problem -- which I think it is, based on the results from the W3C Validator.

In any case, have a look and let me know if this works to your needs:
http://zoobank.org/NomenclaturalActs.xml/A1AE7A00-32C6-4510-A1D6-6DDDA9129D8B
The heavy lifting is done, so modifications are super easy to make from here.

I have NOT tested this extensively! I tried to trap for ampersands and html tags and whatnot, but I might have missed some, so there may be errors. Please let me know if you find problematic records

Questions:

  1. I'm assuming that I should only include tags for actual content, correct? If there is no volume (for example), then I should not include an empty tpub:volume</tpub:volume> pair of tags -- correct?

  2. Do I want to include additional dc:identifiers when I have them? E.g. include the uuid separately from the LSID? Include DOIs when I have them for the pubs? Include other identifiers when I have them for the other stuff?

On a final note: Within the next couple years (coinciding with Code-5), ZooBank will likely stop wrapping the uuids within the cumbersome and unnecessary LSID prefixes. From that point forward, the plain uuids will be in the wild (they already are -- they just happen to be prefixed by the LSID stuff).

OK, probably time for some sleep now.

@deepreef
Copy link
Contributor

Imagine if we'd gone down this route and hand DOIs for every Eukaryote taxonomic name... oh well.

It's not too late! I can always replace the urn:lsid:zoobank.org:[pub|act|author]: prefix with a 10.xxxxx/ prefix. All I need to do is get a xxxxx for ZooBank. Right?

@rdmpage
Copy link
Author

rdmpage commented Mar 11, 2021

@deepreef Cool, I will take a look. From my perspective, in a ZooBank LSID it's not the LSID prefix that is cumbersome... it's the UUID. I think if you (a) adopt DOIs for names and (b) drop the UUID and have a nice short user-friendly string (can be opaque) you would do wonders for the adoption of persistent identifiers for zoological names.

@mdoering
Copy link

I agree. Even though it feels silly I have the same reservation for UUIDs. That's why we decided in COL to use short alphanumerical strings that try not to resemble real words and avoid easily confused char pairs: CatalogueOfLife/backend#491

They can also be converted to ints for a more memory or db friendly incarnation.

@rdmpage
Copy link
Author

rdmpage commented Mar 11, 2021

OK @deepreef I hoping that you're getting some sleep now ;)

Here is my version of what ZooBank XML should look like, with comments to explain why I've made the changes.

<?xml version="1.0"  encoding="UTF-8"?>
<rdf:RDF 
    xmlns:dc="http://purl.org/dc/elements/1.1/" 
    xmlns:owl="http://www.w3.org/2002/07/owl#" 
    xmlns:tto="http://rs.tdwg.org/ontology/voc/Specimen#" 
    xmlns:tc="http://rs.tdwg.org/ontology/voc/TaxonConcept#" 
    xmlns:dcterms="http://purl.org/dc/terms/" 
    xmlns:tn="http://rs.tdwg.org/ontology/voc/TaxonName#" 
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
    xmlns:tpub="http://rs.tdwg.org/ontology/voc/PublicationCitation#" 
    xmlns:trank="http://rs.tdwg.org/ontology/voc/TaxonRank#" 
    xmlns:tcom="http://rs.tdwg.org/ontology/voc/Common#">
    <tn:TaxonName rdf:about="urn:lsid:zoobank.org:act:A1AE7A00-32C6-4510-A1D6-6DDDA9129D8B">
        <dc:title>Ectenopsis mackerrasi Burger, 1996</dc:title>
        <dc:identifier>urn:lsid:zoobank.org:act:A1AE7A00-32C6-4510-A1D6-6DDDA9129D8B</dc:identifier>
        <owl:versionInfo>1.1.2.1</owl:versionInfo>
        <tn:nameComplete>Ectenopsis mackerrasi</tn:nameComplete>
        <tn:genusPart>Ectenopsis</tn:genusPart>
        <tn:specificEpithet>mackerrasi</tn:specificEpithet>
        <tn:year>1996</tn:year>
        <!-- there isn't any such term as tn:publication, even though Index Fungorum uses it, it should be tcom:publishedInCitation -->
        <!-- <tn:publication> -->
        <tcom:publishedInCitation>
            <rdf:Description rdf:about="urn:lsid:zoobank.org:pub:1A71CBE3-0D39-471A-8F05-A5D87573591D">
                <rdf:type rdf:resource="http://rs.tdwg.org/ontology/voc/PublicationCitation#PublicationCitation"/>
                <dc:identifier>urn:lsid:zoobank.org:pub:1A71CBE3-0D39-471A-8F05-A5D87573591D</dc:identifier>
                <tpub:title>A new species of Ectenopsis (Paranopsis) (Diptera: Tabanidae) from New Zealand and a key to species of the subgenus Paranopsis</tpub:title>
            <!-- the rdf:Description tag encloses everything about the publication, and already says it is of type tpub:PublicationCitation -->     
            <!-- </rdf:Description>
            <tpub:PublicationCitation> -->
                <!-- need to add namespace for publication type -->
                <tpub:publicationType rdf:resource="http://rs.tdwg.org/ontology/voc/PublicationCitation#Journal Article" />
                <tpub:parentPublication rdf:resource="urn:lsid:zoobank.org:pub:2B273330-A0BE-4BA7-8D41-5F49A5099DFC" />
                <tpub:authorship>Burger, John F.</tpub:authorship>
                <tpub:year>1996</tpub:year>
                <tpub:title>A new species of Ectenopsis (Paranopsis) (Diptera: Tabanidae) from New Zealand and a key to species of the subgenus Paranopsis</tpub:title>
                <tpub:parentPublicationString>Proceedings of the Entomological Society of Washington,  (Proc. Ent. Soc. Wash.)</tpub:parentPublicationString>
                <tpub:volume>98</tpub:volume>
                <tpub:number>2</tpub:number>
                <tpub:pages>264-266</tpub:pages>
            <!-- </tpub:PublicationCitation> -->
                </rdf:Description>
        <!-- </tn:publication> -->
        </tcom:publishedInCitation>
        <tn:rank rdf:resource="http://rs.tdwg.org/ontology/voc/TaxonRank#Species" />
        <tn:rankString>Species</tn:rankString>
        <tn:nomenclaturalCode rdf:resource="http://rs.tdwg.org/ontology/voc/TaxonName#ICZN" />
    </tn:TaxonName>
    <!-- These are all superflous and are outside the scope of the document (i.e., they don't refer to the tn:TaxonName -->
    <!--
    <tpub:PublicationTypeTerm rdf:about="Journal Article" />
    <tpub:PublicationCitation rdf:about="urn:lsid:zoobank.org:pub:2B273330-A0BE-4BA7-8D41-5F49A5099DFC" />
    <trank:TaxonRankTerm rdf:about="http://rs.tdwg.org/ontology/voc/TaxonRank#Species" />
    <tn:NomenclaturalCodeTerm rdf:about="http://rs.tdwg.org/ontology/voc/TaxonName#ICZN" />
    -->
</rdf:RDF>

I've also made it into a gist https://gist.github.com/rdmpage/ea25baf487a17af4a2184f0ca5bef98b and you can look at the revisions to see the steps I took to change it. The RDF now validates.

Biggest change was to tidy up the publication, and use the correct TDWG term tcom:publishedInCitation ("tn:publication" isn't a thing, even though Index Fungorum uses it). There was also some stuff at the end of the document that needed to go. I'd forgotten just how awful RDFXML is to work with.

@rdmpage
Copy link
Author

rdmpage commented Mar 11, 2021

@deepreef Oops, forgot your other questions. If there's no info then I would simply not include the corresponding tag, so no volume, no tag.

Other identifiers, yes please, especially DOIs (elsewhere I'm harvesting ZooBank's DWCA to add DOIs and other identifiers, but it would be nice to have the ones ZooBank already knows about).

@cboelling
Copy link
Member

The challenge is to make sure you have 1-3, once you have that then the actual identifier technology doesn't matter so much (but of course, some have brand recognition, which is why DOIs are taking over the world).

Thanks for this context @rdmpage. I agree and its vital to keep this in mind in current decision-making on identifier systems. Of course, the governance structures to actually achieve persistance of data and services are an essential part of any relevant solution.

@rdmpage
Copy link
Author

rdmpage commented Mar 11, 2021

@cboelling Yes, governance matters, but I would argue providing value to users should be the primary driver. If something isn't useful and doesn't help people do what they want to do, then all the governance in the world won't help.

@deepreef
Copy link
Contributor

@rdmpage @mdoering : On the uuid thing; well... we're just going to have to agree to disagree. Especially in taxonomy, we already have the "identifier" that is human-friendly (it's the scientific name itself). From the perspective of humans, these identifiers have worked spectacularly well (otherwise they wouldn't still be in use a quarter-millennium after they were launched). Humans have no problem accommodating things like misspellings, alternate genus combinations, homonyms and the like. Computers, of course, have different needs in identifiers. They need to be globally unique and explicitly attached to the associated metadata, and above all, they should never change. Sure, integers work great for things like foreign keys and such -- which is why every database I create (including GNUB/ZooBank) uses integer fields for primary and foreign keys. I even have a system that unambiguously links each integer primary key to its corresponding UUID. But there's a reason it's a very (VERY) bad idea to use a value of a primary key field as your globally unique identifier. We could debate this indefinitely (as we have for years already before, and as we no doubt will for years to come); but I'm much more interested in focusing this discussion on this:

Yes, governance matters, but I would argue providing value to users should be the primary driver. If something isn't useful and doesn't help people do what they want to do, then all the governance in the world won't help.

YES! YES! YES! Let's make stuff that people actually find useful! That's exactly why I was up until 2am this morning tweaking the XML service -- because someone might find it useful. It's also why I want to get the IPT up and running again, and why I'm eager to create JSON-LD service and leverage Bioschemas. I'm going to need a bit of hand-holding to get those up and running, asking lots of rookie-level questions like "should I include the tags if the content is empty" and such.

@rdmpage : THANK YOU -- that's EXACTLY what I needed: an explicit template to implement. I'll stop typing this post and start coding now. Back in a bit.

@deepreef
Copy link
Contributor

deepreef commented Mar 11, 2021

Of course, the moment after I posted that last note, I realized I was late for my first (of many) Zoom meetings for the day, so coding got delayed. However, I just now had my first break, and went straight to the coding.

I followed your template:
http://zoobank.org/NomenclaturalActs.xml/A1AE7A00-32C6-4510-A1D6-6DDDA9129D8B
It seems to pass the WC3 Validator, so thank you for correcting the errors.

I also added additional identifiers, when I have them. I can display the identifiers either with the dereferencing metadata, or without. In some cases, it's obvious that I should include the dereferencing metadata, for example:
Without: <dc:identifier>8831844</dc:identifier>
With: <dc:identifier>http://www.gbif.org/species/8831844</dc:identifier>

In the case of LSIDs, the dereferencing metadata is built into the identifier itself (i.e., the urn:lsid:zoobank.org:act: part)

But what about DOIs? Should I include the dereferencing metadata, or not:
Without: <dc:identifier>10.3897/zookeys.641.11500</dc:identifier>
With: <dc:identifier>https://doi.org/10.3897/zookeys.641.11500</dc:identifier>

For now, I'm including it:
http://zoobank.org/NomenclaturalActs.xml/18c72d73-00c3-40e4-b27f-fa7748a1251e
But I can very easily remove it.

One Rookie question: Among the declared references in the opening RDF tag, some of the URLs have a hash at the end, and some don't. Is that a thing? Should I strip the ending hash characters? Add them to the ones that lack them? Leave them as is? Probably not important, but I'm just letting my OCD run wild on this.

Awaiting further instructions to do even more stuff that people will find useful....

@deepreef
Copy link
Contributor

One other note: there are some data quality issues due to how the users enter data in a messy way. For example, the DOI is properly stored in the database as 10.3897/zookeys.641.11500; but people will sometimes enter it as "https://doi.org/10.3897/zookeys.641.11500" or "doi: 10.3897/zookeys.641.11500". It's on my to-do list to clean all these up in the master database, but for now there is a lot of noise in there, so you'll get things that look like these:
https://doi.org/https://doi.org/10.3897/zookeys.641.11500
or
https://doi.org/doi: 10.3897/zookeys.641.11500

If this is a problem, I'll bump the clean-up task up higher in the priority list.

@rdmpage
Copy link
Author

rdmpage commented Mar 11, 2021

@deepreef Regarding the namespaces in the rdf:RDF tag, they can end in either a forward slash / or a hash #, depending on the choice made by whoever created that vocabulary. Given that this is the delimiter between the namespace name and the property you need to keep them, for example, http://purl.org/dc/elements/1.1/identifier (= dc:identifier) and http://www.w3.org/1999/02/22-rdf-syntax-ns#Description (= ref:Description). See HashVsSlash for background.

@rdmpage
Copy link
Author

rdmpage commented Mar 11, 2021

@deepreef Regarding identifiers there are a bunch of ways to include and represent DOIs (that there are so many ways to do things is yet another reason RDF is hard work).

If you are going to use dc:identifier then my suggestion is to store it as a URL with the prefix https://doi.org/, so <dc:identifier>https://doi.org/10.3897/zookeys.641.11500</dc:identifier>.

@deepreef
Copy link
Contributor

Thanks, @rdmpage

Given that this is the delimiter between the namespace name and the property you need to keep them, for example, http://purl.org/dc/elements/1.1/identifier (= dc:identifier) and http://www.w3.org/1999/02/22-rdf-syntax-ns#Description (= ref:Description).

I get that part (when used as a delimiter). I was talking about the terminal character in the URL; e.g.:
"http://rs.tdwg.org/ontology/voc/TaxonConcept#"
vs.
"http://rs.tdwg.org/ontology/voc/TaxonConcept"

I'll assume they're there for a reason.

RE: DOIs: OK, I'll leave them with the http://doi.org/ prefix (dereferencing metadata)

@jgerbracht
Copy link
Contributor

jgerbracht commented Mar 12, 2021 via email

@rdmpage
Copy link
Author

rdmpage commented Mar 12, 2021

@deepreef

I'll assume they're there for a reason.

Yes! A term such as tc:TaxonConcept is a QName where tc is the prefix for the namespace http://rs.tdwg.org/ontology/voc/TaxonConcept# and TaxonConcept is the local name. When tc:TaxonConcept is expanded tc is replaced by the namespace, so if you leave off the trailing / or # it won't be expanded correctly. It's small thing but it will completely bugger things if left off (I've done this more than a couple of times when coding and wondered why my RDF didn't look right). A good way to detect errors like this is convert the RDF into different formats (like triples and JSON-LD) and make sure it still makes sense. A even more powerful way is to aggregate the RDF into a triple store and start to query it... once you do that you discover that pretty much every LSID provider has broken their XML in one way or another. This is what happens if people produce stuff that no one uses :( This means you need to be careful in using existing LSIDs as a guide because they all get some things wrong.

If easy. You might change the doi url to https

Rich, the convention is now to use https as @jgerbracht points out (as you know the Achilles heel of HTTP identifiers is that they keep changing). Most RDF now being minted that's relevant to biodiversity (e.g., JSON-LD in Zenodo) uses https DOIs. If and when we move to http://schema.org there are nice ways to encode identifiers independently of resolution. But for now, it would be great to use https://doi.org/.

@deepreef
Copy link
Contributor

@jgerbracht :

If easy. You might change the doi url to https

Actually, no change needed. The service already represents DOIs with the https:// prefix -- I had just mistyped my post above. Sorry about that!

@rdmpage :

as you know the Achilles heel of HTTP identifiers is that they keep changing

You have no idea how hard it was for me to resist the urge to rant on this (actually, maybe you do know how hard it was for me...). With all due respect to TBL, conflating dereferencing metadata with identification is a really bad idea. 'nuff said.

Yes! A term such as tc:TaxonConcept is a QName where tc is the prefix for the namespace

Excellent! Thanks! Your explanation makes perfect sense.

This is what happens if people produce stuff that no one uses :(

Indeed! That was my experience setting up the LSID resolution protocol for ZooBank way-back-when. It quickly became clear that only one client ever noticed when it stopped working or was otherwise broken. It's hard to justify committing time and resources to developing services that only have one client. On the other hand, if that one client who directly accesses that service turns around and makes it useful to hundreds of other clients through awesome services like this, it makes it all worth it (hence my enthusiasm to fix this ZooBank XML service). Speaking of which, let me know if you'd like an svg of the ZB logo.

But the most intriguing thing to me is this:

If and when we move to http://schema.org there are nice ways to encode identifiers independently of resolution.

This is exactly the idea behind http://bioguid.org (as you know): decouple the role of identification from the resolution and dereferencing mechanisms. Maybe my next "2am project" will be to build a service on that website that produces JSON-LD following schema.org. Nobody is using it right now anyway, so with over a billion identifiers to play with, it might make a nice sandbox for fleshing out an identifier cross-referencing system using these next-gen approaches. I'd definitely need some hand-holding; but I'm game if you are.

@rdmpage
Copy link
Author

rdmpage commented Mar 12, 2021

@deepreef

conflating dereferencing metadata with identification is a really bad idea. 'nuff said.

Can't help myself: arguably "conflating" the two is the genius move that makes it possible to build networks of easily discoverable, inter-connected data. I think it's one of those classic tradeoffs, and TBL picked the one that gave us the web. But we can argue this point endlessly.

In practice I think we can program defensively and include both types of identifiers in our metadata, for example this is how ORCID does it:

 {
      "@type" : "CreativeWork",
      "@id" : "https://doi.org/10.3417/2020586",
      "name" : "Two New Species of Begonia sect. Erminea (Begoniaceae) from the Masoala Peninsula, Madagascar",
      "identifier" : {
        "@type" : "PropertyValue",
        "propertyID" : "doi",
        "value" : "10.3417/2020586"
      }
}

https://doi.org/10.3417/2020586 gets you connectivity and the ability to "follow your nose", identifier.value gets you robustness from the vagaries of resolution.

Speaking of which, let me know if you'd like an svg of the ZB logo.

That would be great! I have one I made, but having an original from source would be better.

fleshing out an identifier cross-referencing system

My sense is that this is the role Wikidata is playing, and that it's our best bet as an "identity broker" to map between different identifiers for the "same" things.

@rdmpage
Copy link
Author

rdmpage commented Mar 12, 2021

Just thought I'd post the https://bioschemas.org JSON-LD for Ectenopsis mackerrasi that is embedded in the GBIF page for this species if anyone wanted to compare it to the RDF above from ZooBank.

{
  "@context": [
    "https://schema.org/",
    {
      "dwc": "http://rs.tdwg.org/dwc/terms/",
      "dwc:vernacularName": {
        "@container": "@language"
      }
    }
  ],
  "@type": "Taxon",
  "additionalType": [
    "dwc:Taxon",
    "http://rs.tdwg.org/ontology/voc/TaxonConcept#TaxonConcept"
  ],
  "identifier": [
    {
      "@type": "PropertyValue",
      "name": "GBIF taxonKey",
      "propertyID": "http://www.wikidata.org/prop/direct/P846",
      "value": 1494907
    },
    {
      "@type": "PropertyValue",
      "name": "dwc:taxonID",
      "propertyID": "http://rs.tdwg.org/dwc/terms/taxonID",
      "value": 1494907
    }
  ],
  "name": "Ectenopsis mackerrasi Burger, 1996",
  "scientificName": {
    "@type": "TaxonName",
    "name": "Ectenopsis mackerrasi",
    "author": "Burger, 1996",
    "taxonRank": "SPECIES",
    "isBasedOn": {
      "@type": "ScholarlyArticle",
      "name": "Burger, John F. 1996. A new species of Ectenopsis (Paranopsis) (Diptera: Tabanidae) from New Zealand and a key to species of the subgenus Paranopsis. Proceedings of the Entomological Society of Washington 98(2): 264-266."
    }
  },
  "taxonRank": [
    "http://rs.gbif.org/vocabulary/gbif/rank/species",
    "species"
  ],
  "parentTaxon": {
    "@type": "Taxon",
    "name": "Ectenopsis Macquart, 1838",
    "scientificName": {
      "@type": "TaxonName",
      "name": "Ectenopsis",
      "author": "Macquart, 1838",
      "taxonRank": "GENUS",
      "isBasedOn": {
        "@type": "ScholarlyArticle",
        "name": "Mém. Soc. R. Sci. Lille, 1838 (2)"
      }
    },
    "identifier": [
      {
        "@type": "PropertyValue",
        "name": "GBIF taxonKey",
        "propertyID": "http://www.wikidata.org/prop/direct/P846",
        "value": 1494898
      },
      {
        "@type": "PropertyValue",
        "name": "dwc:taxonID",
        "propertyID": "http://rs.tdwg.org/dwc/terms/taxonID",
        "value": 1494898
      }
    ],
    "taxonRank": [
      "http://rs.gbif.org/vocabulary/gbif/rank/genus",
      "genus"
    ]
  }
}

@baskaufs
Copy link

Just a quick question about something I haven't kept up on. When Crossref first announce that it would support content negotiation with DOIs, they said to use dx.doi.org (can't remember if it was http:// or https://). I dutifully started exposing DOIs using that subdomain. But now I mostly see doi.org, when DOIs are shown as an HTTP IRI. Is this a change and do those IRIs dereference to RDF as well? Or does nobody actually care about 303 redirects and 'doi.org' is just a cleaner identifier? I could experiment with dereferencing tests myself but I just wondered if you knew what the convention was now, @rdmpage ?

@rdmpage
Copy link
Author

rdmpage commented Mar 12, 2021

@baskaufs https://doi.org is the current way to do things (see https://www.crossref.org/education/metadata/persistent-identifiers/doi-display-guidelines/ ). It supports content-negotiation as before, but the RDF sucks. I don't think it's been worked on recently, for example it uses the old style http://dx.doi.org/ prefix for the identifier for the article, not the new HTTPS style. I think most people ignore the RDF and use the CSL JSON version of the metadata which is much richer.

@baskaufs
Copy link

Thanks, good to know. I can see I'm way behind the times. I haven't paid that much attention to the content-negotiation because as you say, they don't really give you much useful info, particularly compared with what you can just get from the Crossref API.

@deepreef
Copy link
Contributor

@rdmpage :

Can't help myself:

Fair points, all.

But we can argue this point endlessly.

Indeed; but we should save that for another context. I'll cease and desists on the opportunities for snarky commentary on that subject. And BTW, when I said "all due respect to TBL", I meant it genuinely. His role in history is, and forever will be, monumental. It's just that one thing I have a quibble with.

this is how ORCID does it

Excellent! But I think they need to add one more property; something like:

{
     "@type" : "CreativeWork",
     "@id" : "https://doi.org/10.3417/2020586",
     "name" : "Two New Species of Begonia sect. Erminea (Begoniaceae) from the Masoala Peninsula, Madagascar",
     "identifier" : {
       "@type" : "PropertyValue",
       "propertyID" : "doi",
       "value" : "10.3417/2020586",
       "dereferencingService" : "https://doi.org/"
     }
} 

I'd actually nest an array of properties (things like "dereferencingProtocol" : "https", etc.) within a dereferencingServices object; but the above would be a start.

[svg for ZB logo] That would be great! I have one I made, but having an original from source would be better.

Cool! I'll get on that later today. I'll make two; one for this:
http://zoobank.org/images/ZooBankBanner.jpg
and one for this:
http://zoobank.org/images/favicon.ico
I notice that the others you have are squarish; hence the second option.

My sense is that this is the role Wikidata is playing, and that it's our best bet as an "identity broker" to map between different identifiers for the "same" things.

That was the original intent of bioguid.org (thanks again, btw, for letting my hijack that name): to map "sameAs" relationships among identifiers in the biodiversity space. I eventually expanded it to accommodate other predicates -- mostly so I could capture links between TNUs and BHL pages. But the main thing it does, which I haven't seen anyone else do (at least not well), is to parse out dereferencing service metadata from the actual identifiers (see above). Any given identifier might have more than one dereferencing service. For example:
http://zoobank.org/18c72d73-00c3-40e4-b27f-fa7748a1251e
http://zoobank.org/NomenclaturalActs/18c72d73-00c3-40e4-b27f-fa7748a1251e
http://zoobank.org/NomenclaturalActs.xml/18c72d73-00c3-40e4-b27f-fa7748a1251e
http://zoobank.org/NomenclaturalActs.json/18c72d73-00c3-40e4-b27f-fa7748a1251e
https://www.google.com/search?q=18c72d73-00c3-40e4-b27f-fa7748a1251e
https://scholar.google.com/scholar?q=18c72d73-00c3-40e4-b27f-fa7748a1251e
[etc.]

Does Wikidata have a mechanism for doing that? Or would it treat all of these as discrete "identifiers", without parsing out the part that represents dereferencing metadata from the part that represents identity?

Just thought I'd post the https://bioschemas.org JSON-LD for Ectenopsis mackerrasi that is embedded in the GBIF page for this species if anyone wanted to compare it to the RDF above from ZooBank.

Nice! You just gave me my 2am project for the day (technically tomorrow). Bonus: it's a weekend, so no early-morning Zoom calls tomorrow morning to show up for groggy and disheveled!

Question: Should I simply update the existing service at http://zoobank.org/NomenclaturalActs.json/ (which will no-doubt prompt howls from the thousands of people who currently use that service every day...not...); or would it be better to start anew with something like:
http://zoobank.org/NomenclaturalActs.jsonld/
?

Ask, and ye shall receive.

@rdmpage
Copy link
Author

rdmpage commented May 3, 2021

My how time flies :O Ok @deepreef I've added support for ZooBank "act" LSIDs. I've had to create a second resolver https://lsid-two.herokuapp.com, which currently has WoRMS and ZooBank LSIDs. The reason for this is that I store all the LSID metadata as disk files (no databases) and I'm limited by Heroku to GitHub repositories that are < 500 Mb in size. The "no database" thing to avoid dependencies on other servers, and partly because the whole idea is to have a backup of the data. This achieves both goals of a backup and a service.

@rdmpage
Copy link
Author

rdmpage commented May 3, 2021

@deepreef I grabbed all the LSIDs I could find using a recent DwCA file from GBIF as the source for the list. Oh how I miss simple integer ids that I can count up when I'm fetching data ;)

Anyway, one thing it would be nice to fix about ZooBank LSIDs is the nameComplete field which according to the spec. should be:

The complete uninomial, binomial or trinomial name without any authority or year components. Every TaxonName should have a DublinCore:title property that contains the complete name string including authors and year (where appropriate).

But which in ZooBank has the authorship information as well. This breaks my other reason for doing all this LSID stuff, which is to build another "no database" search engine for taxonomic names which relies on nameComplete being a simple string with just the name. Any thoughts on whether you could extract just the name and have that in nameComplete?

@deepreef
Copy link
Contributor

deepreef commented May 3, 2021

Thanks, @rdmpage ! Alas, I'm deep down other rabbit holes at the moment, so it may be a while before I can get back to this.

Anyway, one thing it would be nice to fix about ZooBank LSIDs is the nameComplete

OK, that should be easy enough. I think I just used the same Code/Logic as for dwc:scientificName, and forgot that nameComplete is supposed to be sans-authorship.

If you really like integer ids, you can always render the 128-bit identifiers in decimal, instead of in the canonical UUID form (e.g., instead of representing those 128 bits as something like 'e593838a-f7a9-5ef2-a04a-2bfc7c90771f', they could be represented as '305159146678742414161168577211252373279'; see here). But I suspect that's not really what you meant... ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants