Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading text with BOM encoding #5268

Open
philrz opened this issue Sep 9, 2024 · 0 comments
Open

Loading text with BOM encoding #5268

philrz opened this issue Sep 9, 2024 · 0 comments

Comments

@philrz
Copy link
Contributor

philrz commented Sep 9, 2024

tl;dr

A community zync user mentioned the following in a Zui context, though it seems addressing it would likely have to start at the Zed layer since that's what handles data loading. In their own words:

I noticed Zui does not properly ingest JSON files which have been saved using UTF8BOM encoding, while UTF8NOBOM works well. I was also asking myself whether it would useful to be able to specify the file encoding in Zui, à-la iconv.

Details

Repro is with Zui Insiders 1.17.1-insiders.20 which uses Zed commit 556f586.

I was not personally familiar with byte-order mark (BOM) but gave myself a quick crash course so I could repro the user's experience.

This first video shows an example in the Windows context. Notepad's default selection to Save As in "UTF-8" creates a text file that Zui loads without complaint. However, if the "UTF-8 with BOM" option is selected, indeed the text format is no longer recognized by Zed's auto-detect as a format that can be read.

Repro-Notepad.mp4

That same page on Wikipedia also mentions that Google Docs adds the BOM when selecting the "Plain Text" download format, and indeed, this second video shows this causing the same error message we just saw with the file saved from Notepad.

Google-Doc.mp4

I'm not certain if it's feasible for Zed's auto-detect to recognize and react to the BOM without disturbing the ability to auto-detect the other formats supported by Zed. If not, I suppose we could add an explicit reader options indicating to expect the BOM and read in the specified format, per the user's iconv comment.

I don't know if it would do the trick, but I did some web searches and found https://pkg.go.dev/github.com/dimchansky/utfbom which it sounds like it may be what's needed:

The package utfbom implements the detection of the BOM (Unicode Byte Order Mark) and removing as necessary. It can also return the encoding detected by the BOM.

If the returned encoding is accurate, then perhaps this is the info that would be needed to automatically react to the BOM when it's present and read in the encoded format.

Per the iconv comment, https://github.com/djimenez/iconv-go also looks like it might be relevant.

This issue also reminded me of #4348, though that one appears to be only about encoding in a non-BOM context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant