`resource.infer(stats=True)` changes value of `dialect.delimiter` depending on resource data #1671

fjuniorr · 2024-07-09T17:20:25Z

I did not find this information in the documentation, but by extrapolating this behavior:

Also, won't resource.infer() reset the whole schema of the resource?

No, it will not change existing properties except for the recalculation of resource.stats.

-- #450 (comment)

I assumed that resource.infer(stats=True) would not change dialect.delimiter (even if it was not explicitly set).

However, this is not the case, and data like:

pkey,year,code,desc
2008|1001,2008,1001,PROGRAMA LARES HABITACAO POPULAR
2008|1003,2008,1003,DIVULGACAO DE MINAS GERAIS COMO ESTADO DESCOMPLICADO

is causing resource.infer(stats=True) to set dialect.delimiter to |.

There is a reproducible example here.

The text was updated successfully, but these errors were encountered:

see frictionlessdata/frictionless-py#1671

* Adiciona chave primária em coluna única chave_* para cada resource de acordo com schemas. * Adiciona campos de chaves aos schemas das resources * Correções para validação do datapackage.json * refactor: remove dead code * fix: build_package() should not be called in scripts/build.py * feat: move chave_* columns to first position * hardcode delimiter to prevent bug in resource.infer see frictionlessdata/frictionless-py#1671 * reset confict files --------- Co-authored-by: Francisco Júnior <[email protected]>

pierrecamilleri · 2024-09-05T15:55:27Z

Hi, thanks for the report and reproducible example.

In your example, I get indeed dialect.delimiter to | for the first test (with data.csv), but not with the sample or with explicitely specified delimiter (I guess that was intended).

I am not very familiar with the frictionless codebase, so sorry if I'm off base, I am guessing a bit. My understanding is that when it is not explicitly set, frictionless makes a guess about the delimiter being | (not the best, I agree), and infer requires to do so, as without delimiter there are e.g. no possible stats about the number of fields. I observe the same delimiter if I resource.read_rows() instead of infer (with a TableResource)

So do I understand correctly that the issue is not that it changes any behaviour, but that it explicitely sets the delimiter in the resource? If yes, why is it an issue?

fjuniorr · 2024-09-07T13:19:28Z

I did not notice that resource.read_rows() was also inferring the delimiter of data.csv to be |.

In general¹ I would expect frictionless-py to use , as delimiter for operations such as read_rows because , is the default value per the data package standard. The wording in CSV Dialect is even more explicit:

delimiter - specifies the character sequence which should separate fields (aka columns). Default = ,. Example \t. If not present, consumers should assume that it’s ,.

For my particular use case the inference behaviour is problematic because I have several descriptors in which I do not explicitly set delimiter when the proper value is ,. Mostly it works just fine except when it breaks.

The exception would be in functionality that is explicitly created for inference purposes, such as describe and maybe infer (which was my original question). ↩

pierrecamilleri · 2024-09-09T07:17:43Z

Ok thanks, I understand what bothers you, and I agree that using , as default instead of guessing the delimiter would probably a less error-prone behaviour, and more consistent with the documentation.

I also understand why you specifically asked about infer. Any change on this (infer or read_rows) would however be a breaking change.

In any case I'll try to find the time to understand the exact behaviour of the delimiter detection.

fjuniorr · 2024-09-09T11:34:20Z

Knowing that this is also the behaviour in read_rows I agree that changing it is not a good idea.

Maybe in the future making the delimiter detection more robust (which I'm not sure is high priority TBH).

fjuniorr added a commit to splor-mg/dados-aux-classificadores that referenced this issue Jul 9, 2024

hardcode delimiter to prevent bug in resource.infer

068adfc

see frictionlessdata/frictionless-py#1671

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`resource.infer(stats=True)` changes value of `dialect.delimiter` depending on resource data #1671

`resource.infer(stats=True)` changes value of `dialect.delimiter` depending on resource data #1671

fjuniorr commented Jul 9, 2024

pierrecamilleri commented Sep 5, 2024

fjuniorr commented Sep 7, 2024

pierrecamilleri commented Sep 9, 2024 •

edited

Loading

fjuniorr commented Sep 9, 2024

resource.infer(stats=True) changes value of dialect.delimiter depending on resource data #1671

resource.infer(stats=True) changes value of dialect.delimiter depending on resource data #1671

Comments

fjuniorr commented Jul 9, 2024

pierrecamilleri commented Sep 5, 2024

fjuniorr commented Sep 7, 2024

Footnotes

pierrecamilleri commented Sep 9, 2024 • edited Loading

fjuniorr commented Sep 9, 2024

`resource.infer(stats=True)` changes value of `dialect.delimiter` depending on resource data #1671

`resource.infer(stats=True)` changes value of `dialect.delimiter` depending on resource data #1671

pierrecamilleri commented Sep 9, 2024 •

edited

Loading