Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows: Switch ruby command line interface and console inputs to UTF-8 #11799

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

larskanis
Copy link
Contributor

@larskanis larskanis commented Oct 4, 2024

Implements https://bugs.ruby-lang.org/issues/20774 so that the encodings look like so:

$ ruby  -It€st -r .\täst-locale-enc.rb -e "pr STDIN, $0, __FILE__, __dir__, 'ä', '€', $:.first, $:.last, $stdin.read"
äöü€
[#<IO:<STDIN>>, "UTF-8"]
["-e", "UTF-8"]
["-e", "ASCII-8BIT"]
[".", "US-ASCII"]
["ä", "UTF-8"]
["", "UTF-8"]
["C:/Users/kanis/ruby/t€st", "UTF-8"]
["C:/Users/kanis/ruby/lib/ruby/3.4.0+0/x64-mingw-ucrt", "UTF-8"]
["äöü€\n", "UTF-8"]

The string äöü€ was typed on the console and is passed as UTF-8 to $stdin.read (although is not part of the locale encoding).

Fixes oneclick/rubyinstaller2#265

- script name
- include paths
- script input from stdin

This comment has been minimized.

Copy link

launchable-app bot commented Oct 4, 2024

All Tests passed!

✖️no tests failed ✔️31897 tests passed(1 flake)

return -1;
}
if (isconsole) {
/* Direct console input, without pipe */
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nobu, @unak Can I get some help from you here?

ReadConsoleW retrieves console characters as UTF-16. I think the most useful way to pass the input data to ruby API is as UTF-8 encoded string. There seems to be no easy way to to archive this through the Win32-API.

The below implementation requires to read at least so many bytes that the UTF-8 converted string fits into it. Otherwise STDIN.read returns nil. In contrast STDIN.gets provides a large buffer size so that the string is converted to UTF-8 and returned.

To properly implement reading UTF-8 characters bytewise some internal state has to be managed, about the character position of already delivered UTF-8 bytes and about pending/buffered bytes. Can you recommend a way to store this information?

Maybe the socklist hash table could be extended to store this data. But maybe there is some better place. Or do you have some other idea how to implement Unicode aware console reading?

@YO4
Copy link
Contributor

YO4 commented Oct 20, 2024

Another idea is to rely on ruby's encoding conversion layer to do the buffering.
I was personally experimenting with this at the time of the following issue.
https://bugs.ruby-lang.org/issues/19191

That experimental code was finally abandoned and no working code remains, I still remember some key points.

  • I was making console input a special case in make_readconv() to achieve this.
  • NEED_READCONV() macro has become so complicated, so I felt there was another, better implementation.
  • miniruby requires UTF-16LE for console input and an error occurs.

It would be great if the code could be organized so that the console input routines could be treated like a device that outputs UTF-8.

@YO4
Copy link
Contributor

YO4 commented Nov 11, 2024

I found a fragment of my previous code and modified it to compile and check. The code is in #12055.
The major difference is that
yours
Prepare utf-8 in win32.c.
my own
Do the conversion in io.c

My code seems to bring too many modifications to io.c. Also, it is not consistent when using the binary read method.
It requires a build change to get the encoding and transcoding built in, which is something not needed except for win32.

We can avoid these by doing the processing on the win32.c side.
I think that your solution is better.

@YO4
Copy link
Contributor

YO4 commented Nov 11, 2024

In windows, a process can have at maximum one console, but the corresponding file descriptors can be multiple.

If we could fix the encoding conversion buffer to one, it would simplify the code, but would it cause any problems?

It seems that perl5 does it that way.
Perl/perl5@dace60f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Ruby require fails when the path has special characters
2 participants