-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Windows: Switch ruby command line interface and console inputs to UTF-8 #11799
base: master
Are you sure you want to change the base?
Conversation
- script name - include paths - script input from stdin
This comment has been minimized.
This comment has been minimized.
✅ All Tests passed!✖️no tests failed ✔️31897 tests passed(1 flake) |
return -1; | ||
} | ||
if (isconsole) { | ||
/* Direct console input, without pipe */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nobu, @unak Can I get some help from you here?
ReadConsoleW
retrieves console characters as UTF-16. I think the most useful way to pass the input data to ruby API is as UTF-8 encoded string. There seems to be no easy way to to archive this through the Win32-API.
The below implementation requires to read
at least so many bytes that the UTF-8 converted string fits into it. Otherwise STDIN.read
returns nil
. In contrast STDIN.gets
provides a large buffer size so that the string is converted to UTF-8 and returned.
To properly implement reading UTF-8 characters bytewise some internal state has to be managed, about the character position of already delivered UTF-8 bytes and about pending/buffered bytes. Can you recommend a way to store this information?
Maybe the socklist
hash table could be extended to store this data. But maybe there is some better place. Or do you have some other idea how to implement Unicode aware console reading?
Another idea is to rely on ruby's encoding conversion layer to do the buffering. That experimental code was finally abandoned and no working code remains, I still remember some key points.
It would be great if the code could be organized so that the console input routines could be treated like a device that outputs UTF-8. |
I found a fragment of my previous code and modified it to compile and check. The code is in #12055. My code seems to bring too many modifications to io.c. Also, it is not consistent when using the binary read method. We can avoid these by doing the processing on the win32.c side. |
In windows, a process can have at maximum one console, but the corresponding file descriptors can be multiple. If we could fix the encoding conversion buffer to one, it would simplify the code, but would it cause any problems? It seems that perl5 does it that way. |
Implements https://bugs.ruby-lang.org/issues/20774 so that the encodings look like so:
The string
äöü€
was typed on the console and is passed as UTF-8 to$stdin.read
(although€
is not part of the locale encoding).Fixes oneclick/rubyinstaller2#265