Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resource.infer(stats=True) changes value of dialect.delimiter depending on resource data #1671

Open
fjuniorr opened this issue Jul 9, 2024 · 4 comments

Comments

@fjuniorr
Copy link
Contributor

fjuniorr commented Jul 9, 2024

I did not find this information in the documentation, but by extrapolating this behavior:

Also, won't resource.infer() reset the whole schema of the resource?

No, it will not change existing properties except for the recalculation of resource.stats.

-- #450 (comment)

I assumed that resource.infer(stats=True) would not change dialect.delimiter (even if it was not explicitly set).

However, this is not the case, and data like:

pkey,year,code,desc
2008|1001,2008,1001,PROGRAMA LARES HABITACAO POPULAR
2008|1003,2008,1003,DIVULGACAO DE MINAS GERAIS COMO ESTADO DESCOMPLICADO

is causing resource.infer(stats=True) to set dialect.delimiter to |.

There is a reproducible example here.

fjuniorr added a commit to splor-mg/dados-aux-classificadores that referenced this issue Jul 9, 2024
fjuniorr added a commit to splor-mg/dados-aux-classificadores that referenced this issue Jul 9, 2024
* Adiciona chave primária em coluna única chave_* para cada resource de acordo com schemas.

* Adiciona campos de chaves aos schemas das resources

* Correções para validação do datapackage.json

* refactor: remove dead code

* fix: build_package() should not be called in scripts/build.py

* feat: move chave_* columns to first position

* hardcode delimiter to prevent bug in resource.infer

see frictionlessdata/frictionless-py#1671

* reset confict files

---------

Co-authored-by: Francisco Júnior <[email protected]>
@pierrecamilleri
Copy link
Collaborator

Hi, thanks for the report and reproducible example.

In your example, I get indeed dialect.delimiter to | for the first test (with data.csv), but not with the sample or with explicitely specified delimiter (I guess that was intended).

I am not very familiar with the frictionless codebase, so sorry if I'm off base, I am guessing a bit. My understanding is that when it is not explicitly set, frictionless makes a guess about the delimiter being | (not the best, I agree), and infer requires to do so, as without delimiter there are e.g. no possible stats about the number of fields. I observe the same delimiter if I resource.read_rows() instead of infer (with a TableResource)

So do I understand correctly that the issue is not that it changes any behaviour, but that it explicitely sets the delimiter in the resource? If yes, why is it an issue?

@fjuniorr
Copy link
Contributor Author

fjuniorr commented Sep 7, 2024

I did not notice that resource.read_rows() was also inferring the delimiter of data.csv to be |.

In general1 I would expect frictionless-py to use , as delimiter for operations such as read_rows because , is the default value per the data package standard. The wording in CSV Dialect is even more explicit:

delimiter - specifies the character sequence which should separate fields (aka columns). Default = ,. Example \t. If not present, consumers should assume that it’s ,.

For my particular use case the inference behaviour is problematic because I have several descriptors in which I do not explicitly set delimiter when the proper value is ,. Mostly it works just fine except when it breaks.

Footnotes

  1. The exception would be in functionality that is explicitly created for inference purposes, such as describe and maybe infer (which was my original question).

@pierrecamilleri
Copy link
Collaborator

pierrecamilleri commented Sep 9, 2024

Ok thanks, I understand what bothers you, and I agree that using , as default instead of guessing the delimiter would probably a less error-prone behaviour, and more consistent with the documentation.

I also understand why you specifically asked about infer. Any change on this (infer or read_rows) would however be a breaking change.

In any case I'll try to find the time to understand the exact behaviour of the delimiter detection.

@fjuniorr
Copy link
Contributor Author

fjuniorr commented Sep 9, 2024

Knowing that this is also the behaviour in read_rows I agree that changing it is not a good idea.

Maybe in the future making the delimiter detection more robust (which I'm not sure is high priority TBH).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants