-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What makes code a research object? #18
Comments
Hi, So in terms of code becoming a research object - the DOI guarantees persistence of the metadata whereas a private sector interest makes no such guarantees beyond "we won't annoy our main earnings source too much". |
Thank you for these excellent points. I was not aware of the solvency guarantee condition for DOI providers, that is certainly a tip in favour of DOIs. The idea of guaranteed metadata persistence is clearly valuable at the systems interoperability level, but why this helps science should I think be made clear. It still seems to me that it's ultimately the services that build on that layer, rather than the layer itself, that provide the visible value. And Datacite doesn't seem to have many of those services, at least not researcher-facing. Perhaps this isn't the place to ask, but would it be possible to build an open citation-tracking service on top of both Crossref and Datacite? Perhaps this is something we at ContentMine could work on. I think it would be worthwhile to explain these and any other reasons why a DOI is useful on the site. I'll wait a while to see if anyone else contributes ideas, then I'll make a PR adding such an explanation. |
@blahah The Making Data Count project has been building just such an open citation index for the last year. We recently announced that our prototype citation tracking index has moved to DataCite servers, and we'll be adding additional reports there soon. This work is based on the Lagotto tool by @mfenner, and on data usage statistics collation and reporting services that we developed at @DataONEorg. We learned several lessons in building this prototype, some of which are being presented by J. Lin at this week's DataCite PID workshop. These include issues with the effectiveness of current citation practices with respect to complex data objects, data versioning, and support for multiple persistent identifier standards (DOIs are not magic, not are they ubiquitous, but they do have some desirable features). The hard part of a citation index is making it open. We index a variety of open sources for citations (e.g., PubMed Central), but of course there are many closed repositories of articles to which we do not have text mining rights. An ongoing struggle with academic publishing models... |
@mbjones this is a very interesting project! Thanks for the information - I have been looking at lagotto but did not know about Making Data Count. I believe there is a solution to the text-mining rights problem. In the UK we now have a copyright exception for non-commercial text and data-mining. In addition, a citation is un-copyrightable, being a fact (specifically en entity relation, article x cites article y). Thus, provided a collaborator can access the works to read, they have the right to mine them in the UK for non-commercial purposes and any contractual restrictions that try to prevent it are not enforceable. Subsequently releasing those facts is protected. I'd be happy to explore how I can help with this - [email protected] |
DataCite DOIs also has linkage information via the The second point is that CrossRef simply doesn't allow you to register DOIs for datasets or software. If they did, data repositories like Zenodo and figshare could just mint CrossRef DOIs.
It's not GitHub longevity I'm worried about. It's the individual researchers that worries me the most. Reseachers can re-write their history so a commit is not available anymore, they can move their repository (GitHub will redirect you if you did it correctly), or worse they can simply delete your repository. A DOI offers protection against this, because those who mints DOIs (either via CrossRef or DataCite) agree to adhere to certain principles - e.g. not changing the underlying files. Also, as mentioned before, even if the files are deleted, the metadata persists. On Zenodo we've archived in order of 2000 repositories and 4000 individual releases so far and after a year roughly 40 repositories had been deleted (not just moved).
Long-term archiving is pretty difficult - no matter if you're company, academic institution or state library and the way GitHub is built now, it's not a place for long-term archiving (and I doubt they have any interest in it either).
I think the important point here is that archiving != identification. Long-term archiving is difficult - say you archive a git repository, then you'll likely need git version x to be able to read and understand that archive 10-20 years from now. For citations I'm about to say that any persistent globally unique identifier is better than none at all and DOIs is just one option. See more in https://dx.doi.org/10.7717/peerj-cs.1 Also you might be interested in http://dliservice.research-infrastructures.eu/#/ |
where are these "DOI guarantees" documented? Section 3.2 of this handbook: http://www.doi.org/doi_handbook/3_Resolution.html makes it clear that the entity in question "may or may not be an Internet-accessible file". So how is that any different from a git checksum? |
@drj11: Depends on the DOI registration agency (http://www.doi.org/registration_agencies.html). The difference between a git checksum and a DOI is that you can resolve a DOI - i.e. stick it in a browser and get redirected to a landing page (which can tell you how you can get access to the object if it's not internet accessible). And in case the digital object goes lost, CrossRef/DataCite will still have the metadata of the object. |
From my perspective, any kind of object, including software, can qualify as a research object. It can obtain that status by claiming an official place in the scholarly record. From the perspective of the ongoing discussions, this entails two core requirements:
For example, in the paper-based journal system:
Moving on to code as a research object in the GitHub/Zenodo scenario:
Based on my work with web archiving, Memento and Robust Links, I can sketch another approach that meets the requirements:
Link decoration results in the following behavior, as long as GitHub code repository remains available:
Link decoration, combined with Memento infrastructure, results in the following behavior, if the GitHub code repository becomes unavailable:
Notes:
|
Interesting discussion! I was glad to see it; thank you for starting it! I would say code is a research object that should be cited when it enables research. Clearly there’s a continuum for software, from small scripts that don’t manipulate data/output but perhaps simply move it, for example, and general tools/software such as Excel that are not research specific that one probably wouldn’t consider a research object, to scientist-written code that uniquely enables research results that clearly is or should be considered a research object. I agree with you that having a DOI is not necessary to a code’s being a research object, and indeed, as the editor of the Astrophysics Source Code Library (ASCL), I could not possibly defend DOI necessity! Out of our over 1200 site links, only two are DOI links. We link to the GitHub repos for both of those codes in addition to their DOI page. Though the archived version is useful to reveal a specific version of the code that enabled specific research written in a specific paper, anyone who might be interested in using a code for his/her own research would likely want to get the software from its development site rather than an archive site, as code may undergo additional development (or bug fixes) after archiving. The ASCL is citable and is indexed by the main indexing service for astrophysics, the Astrophysics Data System (ADS). You might be interested in this (incomplete) Google doc about astronomy software citation; it includes a short section called Citable works that pulls in discussion elsewhere on what makes something citable and the difference between attribution and citation. The ASCL started out in 1999 as a repository, and though it can and does store codes, we have found that most authors prefer to keep their software close to them rather than on a site they don’t control. I suspect this is why there is little uptake of code archiving in astronomy. We do recommend software authors take steps to save their codes, and in fact the founder of the ASCL will be presenting at this January's AAS meeting on what to do with a dead code. |
This is an excellent discussion. Inspired by @owlice's recent post, I have these thoughts on when code is a research object: I tend to think of research objects as a synonym for research 'outputs' or anything that the researcher has had to work to produce. This is perhaps restrictive but stems from 2 notions: First, the need to improve research efficiency (see Ioannidis, 2014, 85% of research resources are wasted PLOS) sees me want to reduce duplication of effort by improving accessibility and reuse of anything a researcher has created - so metadata, indexing, persistence are important for researcher 'outputs', but not so much for pre-existing code that is an 'input'. Second, the traditional research object/output is the publication and that has become the thing that gets researchers funding, leading to quick and dirty, irreproducible publications being preferred over more time consuming reproducible ones. If we can provide citations and tracking for all the other outputs/objects, perhaps we can get researchers rewarded (via funding/employement) for sharing these useful outputs (which feeds back into my first point). These two (closely linked) considerations make me focus on outputs when considering all possible research objects. Of course, it's also important to report the research objects that are 'inputs' but this can be done as part of the metadata for the work that led to 'outputs'. With that i mind, I'd say code is a research object when a researcher has put work into creating/adapting it. A researcher has not put work into creating Excel, but they have put work into creating an Excel spreadsheet and/or any macros or Visual Basic code that might be used with Excel. Thus, to my limited definition, Excel wouldn't be a research object but Excel version and other metadata would be reported alongside the research data object (Excel spreadsheet) that may include research code objects (Excel macros/Visual Basic). |
Regarding author retention of their code objects: At GigaScience, we encourage authors to have a project page for them to better control the definitive version of their code - but we also encourage a GitHub location (and will fork/host at our GitHub group) to facilitate community engagement/support, AND we take a snapshot of the code at time of publication so that we ensure we have control over the persistence of a version linked to our publication. Perhaps this is overkill but we feel each avenue has its merits and that they complement one another. That level of control by the research repository (gigascience) stems from concerns that are well highlighted in this report by Arizona University's comp sci dept. They tried to systematically contact authors of comp sci papers and conference proceedings and found that less than half were 'weakly repeatable'. Responses when trying to contact authors included the classics 'oh, that was a phd student/post-doc that has left', 'oh, i've lost it' and ... no response. Thus leaving research outputs/objects solely in the loving arms of authors is a real problem for reuse of those research outputs. That said - the astrophysics community is particularly awesome at this sort of thing and there are no broken links in ASCL's list of homepages (going by a quick online link checker test), so the need to supplement author management of research code objects may be field dependent (or perhaps ASCL has excellent stipulations when registering a project and these ensure persistence - if so, please share!) |
I like that. I'd exempt small scripts or other outputs that anyone with basic computer skills could produce, but yeah, overall, that works.
I like this approach! Other journals would do well to emulate your approach of taking a snapshot of the code at publication and controlling that file.
Thanks! We do have links fail, of course, but Associate Editor Kim DuPrie runs a link checker regularly and does an excellent job keeping the links up-to-date. We've had very very few codes disappear permanently over the years. Our policies for registering a code are pretty simple: it must be used in refereed research or research submitted for refereeing, and it has to be available for download. |
Following on from this twitter discussion, I figure this is a better place to ask.
I'm interested in what makes some bundle of code become a research object? It seems as though this project treats 'having a DOI' as the criterion. If so, why is this?
My position, from which I would love to be moved, and which may be factually wrong, is this:
If a research object is a fixed archive of some artefacts of research, then what is wrong with git commit, or better, a tagged release?
The text was updated successfully, but these errors were encountered: