Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relax extension requirements, maybe rename to formats #64

Open
TheChymera opened this issue Feb 27, 2024 · 9 comments
Open

Relax extension requirements, maybe rename to formats #64

TheChymera opened this issue Feb 27, 2024 · 9 comments

Comments

@TheChymera
Copy link

With some file formats supporting data across modalities (any volumetric data can be NIfTI, any raster image can be TIFF, anyting at all can be ZARR) I wonder if it makes sense to restrict these “extensions”.

I'm also wondering whether the terminology shouldn't be renamed to “formats”.

More generally, I'm also not sure why the emergence of a new format would need to be “accepted” by BIDS first before a dataset using it can be BIDS-compliant.
Is there any reason why we would ever say no?
If not, why not allow any data format?

I'm mentioning data format specifically, because for metadata files, which BIDS as a standard controls the contents of, we can't just have people using participants.xlsx. But BIDS does not control the analysis of TIFF, or NWB, or MNAF (my new amazing format), so why not let people use whatever fits their use case?

I see some utility in discouraging bad practices, such as proprietary or .m files for everything, or compressed .jpeg for optical imaging — so maybe allowing anything would go too far. But in any case I think open formats with no compression could be globally accepted.

@sappelhoff
Copy link
Member

sappelhoff commented Feb 27, 2024

Is there any reason why we would ever say no?
If not, why not allow any data format?

the main reason is to provide an incentive for people to flock around a few data formats and make them better ... as opposed to everyone "brewing their own thing". The latter leads to an increased burden in managing IO software, and also for end users to be able to "know" more data formats.

@TheChymera
Copy link
Author

the main reason is to provide an incentive for people to flock around a few datasets and make them better

Did you mean “data formats” instead of “datasets”? If not I don't think that's something I ever saw as a goal of BIDS, i.e. consolidating a small number of datasets as opposed to allowing better access to as many as possible — nor do I see what it has to do with data formats.

If you meant “data formats”... it's not just that we're limiting the total number of formats we support, but also constraining them per modality for reasons that are more historical than anything. In a sense that constrains the IO software. Lots of data can be represented as NIfTI, and thereby analyzed with the rich NIfTI tools, so why restrict that? Blanket permitting all/some formats would allow formats to better spread across use cases based on the tooling support they have.

@sappelhoff
Copy link
Member

Did you mean “data formats” instead of “datasets”?

yes, sorry.

but also constraining them per modality for reasons that are more historical than anything.

I wouldn't say that we do that for "historical" reasons. To my understanding we do that to reflect the most common practices in the field where a particular modality is used. For example, NIfTI is used in MRI ... but not in EEG, even though you probably could somehow encode EEG data in NIfTI.

Blanket permitting all/some formats would allow formats to better spread across use cases based on the tooling support they have.

yes, but it will also invite edge cases, where a single dataset curator is exceptionally well versed in a particular data format and uses/applies it ... however the large majority of the community won't be able to use it because they lack the tools/skills.


I am playing a bit of a devil's advocate here. I personally don't have a big horse in this race. But I do think that fewer, rather than more, data formats are a good idea. I am saying this coming from a project like MNE-Python, where every other few months somebody is requesting support for yet another data format that is entirely unnecessary as the data could be represented in an already existing (open) format.

@yarikoptic
Copy link
Contributor

yarikoptic commented Feb 28, 2024

The whole point of any standardization is to minimize variability. BIDS did not only minimize variability in how people name their files, but also in file formats to use. Hence you @TheChymera can always open participants.tsv and not some participants.xyz of an unknown nature. Allowing for any file format immediately opens unlimited variability, and thus makes standard much less valuable. And hence in BIDS we limit to most common format(s).

Someone in turn could establish some "BIDS naming convention" or "BIDS naming principles" which would then allow for arbitrary file formats to be used and rather just promote use of schema and the rest of the logic behind files organization. But it would be a different project.

I think it is time for us to add some indicators to issues so we could get some kind of a sense on which ones to keep open or close, so 👎 this one as IMHO I do not think it would be wanted/result in being implemented. apparently I have already added on that 7 months ago in README.md ;)

@poldrack
Copy link

+1 to closing this.

@TheChymera
Copy link
Author

TheChymera commented Feb 28, 2024

can always open participants.tsv and not some participants.xyz of an unknown nature

@yarikoptic but that's exactly not what I meant. I was referring to data formats specifically. I even gave the exact same example:

I'm mentioning data format specifically, because for metadata files, which BIDS as a standard controls the contents of, we can't just have people using participants.xlsx.

Yes, the metadata files, like the file naming conventions, are optimized for easy browsing, readability, and (maybe on purpose maybe incidentally) are very convenient to manipulate with GNU coreutils or other ubiquitous CLI packages.
The point is data files are different because they require additional tooling anyway:

  • Proposal 1: So if we already “support” a format, why not support it wherever the experimenter might find it useful?
  • Proposal 2: If we do not support a format yet, why not auto-support any data format? Do we have any examples where we have vetoed a file format?

I already mentioned that proposal 2 was probably not as good as proposal 1, because there are some reasons to exclude e.g. proprietary formats.


@sappelhoff

To my understanding we do that to reflect the most common practices in the field where a particular modality is used.

But should a dataset be “invalid” for using an uncommon practice, even if it's still open source and useful to the experimenter?

yes, but it will also invite edge cases, where a single dataset curator is exceptionally well versed in a particular data format and uses/applies it ... however the large majority of the community won't be able to use it because they lack the tools/skills.

Isn't that an edge case we want? Think of the following: MRI expert wants to integrate data with microscopy, and use NIfTI for everything, including the microscopy data, so it can all be handled in the same space with the same tools. Why block that?

I am playing a bit of a devil's advocate here. I personally don't have a big horse in this race. But I do think that fewer, rather than more, data formats are a good idea.

In a sense, that's addressed by proposal 1. The guy from the NIfTI example is me. I'd like to use NIfTI for more things. More broader acceptance of formats that are already accepted could materialize in a consolidation around fewer formats. I also think there are other people who would like to .zarr everything.

@sappelhoff
Copy link
Member

Do we have any examples where we have vetoed a file format?

during several BEP processes (e.g., EEG, iEEG) several file formats have been vetoed

@TheChymera
Copy link
Author

@sappelhoff oh, I was unaware, thanks for telling me. Do you remember which ones they were or have a link tot he discussions? I'm curious what demonstrably disqualifies a format.

@sappelhoff
Copy link
Member

Most of these discussions happened on the old BEP006 Google Doc, and there was a community survey about data formats used in the community in 2018. the survey results used to be reported here: https://bids.berkeley.edu/news/bids-megeegieeg-data-format-survey. Unfortunately this archive did not preserve images: https://web.archive.org/web/20230130152808/https://bids.berkeley.edu/news/bids-megeegieeg-data-format-survey ... but perhaps you can do some digging and find something.

I'm curious what demonstrably disqualifies a format.

we wanted the file formats to:

  • have an open specification and be usable (in terms of a potential license)
  • be widely used in the community (no niche formats)
  • be able to hold most data the community would possibly want to save (and all data considering the union of all accepted formats)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Punted
Development

No branches or pull requests

4 participants