Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of [:cntrl:] character class in tokenize() #5

Closed
hpeifer opened this issue Oct 17, 2016 · 7 comments
Closed

Use of [:cntrl:] character class in tokenize() #5

hpeifer opened this issue Oct 17, 2016 · 7 comments
Assignees
Labels

Comments

@hpeifer
Copy link

hpeifer commented Oct 17, 2016

The POSIX [:cntrl:] character class does not exactly cover the same chars which must be escaped according to https://tools.ietf.org/html/rfc7159#section-7 (i.e. U+0000 through U+001F). [:cntrl:] does also cover U+007F, and all C1 control chars when used in a UTF locale. See the below example where I am getting an error in my locale "en_US.UTF-8". Apart from using LC_ALL=C, the error can be avoided when changing [:cntrl:] to the range defined in the spec: \x00-\x1F.

$ echo world_bank109.json | awk -f JSON.awk > /dev/null
world_bank109.json: expected <value> but got <"> at input token 263
, "productlinetype" : "L" , "project_abstract" : { "cdata" : <<">> T h e o b j e c t i
$ echo world_bank109.json | LC_ALL=C awk -f JSON.awk > /dev/null
(no error message here)

world_bank109.json.txt, which is line 109 from the world bank sample file at http://jsonstudio.com/resources/

@step-
Copy link
Owner

step- commented Oct 17, 2016

Thank you for reporting this issue so thoroughly. I am traveling so I can't make changes right away, It's good that you found the LC_ALL=C work-around, so if other people need an immediate solution they can have it. For a permanent fix I think I will change [[:ctrl:]] to [\x00-\x1F] so the behavior is consistent regardless of locale settings. I will ponder this matter while I'm traveling. Thanks again.

@step- step- added the bug label Oct 17, 2016
@step- step- self-assigned this Oct 17, 2016
@hpeifer
Copy link
Author

hpeifer commented Oct 18, 2016

Thanks for taking care. Here is what I did locally, to fix the issue:
gsub(/\"[^\x00-\x1F\"\\]*((\\[^u\x00-\x1F]|\\u[0-9a-fA-F]{4})[^\x00-\x1F\"\\]*)*\"|-?(0|[1-9][0-9]*)([.][0-9]*)?([eE][+-]?[0-9]*)?|null|false|true|[[:space:]]+|./, "\n&", a1)

@step-
Copy link
Owner

step- commented Oct 18, 2016

Thank you very much. I will get to it after returning from my travel.

Update: while the proposed fix certainly works, the \xNM character syntax is gawk-specific, and it isn't POSIX compliant.

@step-
Copy link
Owner

step- commented Dec 22, 2016

See also dominictarr/JSON.sh#46

@step-
Copy link
Owner

step- commented Oct 15, 2017

See also dominictarr/JSON.sh#48

step- referenced this issue Aug 15, 2020
Aimed at old "onetrueawk" versions as the preceding commit.
See #10 for details.
@mohd-akram
Copy link
Contributor

The sample file incorrectly uses U+0092 for a single quote instead of U+2019. It can be fixed by running it through awk '{gsub(/\302\222/,"\342\200\231"); print}'.

@step-
Copy link
Owner

step- commented Aug 18, 2020

Closed ac086a4

@step- step- closed this as completed Aug 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants