Windows: Switch ruby command line interface and console inputs to UTF-8 #11799

larskanis · 2024-10-04T15:42:29Z

Implements https://bugs.ruby-lang.org/issues/20774 so that the encodings look like so:

$ ruby  -It€st -r .\täst-locale-enc.rb -e "pr STDIN, $0, __FILE__, __dir__, 'ä', '€', $:.first, $:.last, $stdin.read"
äöü€
[#<IO:<STDIN>>, "UTF-8"]
["-e", "UTF-8"]
["-e", "ASCII-8BIT"]
[".", "US-ASCII"]
["ä", "UTF-8"]
["€", "UTF-8"]
["C:/Users/kanis/ruby/t€st", "UTF-8"]
["C:/Users/kanis/ruby/lib/ruby/3.4.0+0/x64-mingw-ucrt", "UTF-8"]
["äöü€\n", "UTF-8"]

The string äöü€ was typed on the console and is passed as UTF-8 to $stdin.read (although € is not part of the locale encoding).

Fixes oneclick/rubyinstaller2#265

- script name - include paths - script input from stdin

launchable-app · 2024-10-04T15:54:30Z

✅ All Tests passed!

✖️no tests failed ✔️31897 tests passed(1 flake)

larskanis · 2024-10-14T13:24:01Z

win32/win32.c

-        return -1;
-    }
+    if (isconsole) {
+        /* Direct console input, without pipe */


@nobu, @unak Can I get some help from you here?

ReadConsoleW retrieves console characters as UTF-16. I think the most useful way to pass the input data to ruby API is as UTF-8 encoded string. There seems to be no easy way to to archive this through the Win32-API.

The below implementation requires to read at least so many bytes that the UTF-8 converted string fits into it. Otherwise STDIN.read returns nil. In contrast STDIN.gets provides a large buffer size so that the string is converted to UTF-8 and returned.

To properly implement reading UTF-8 characters bytewise some internal state has to be managed, about the character position of already delivered UTF-8 bytes and about pending/buffered bytes. Can you recommend a way to store this information?

Maybe the socklist hash table could be extended to store this data. But maybe there is some better place. Or do you have some other idea how to implement Unicode aware console reading?

YO4 · 2024-10-20T14:21:32Z

Another idea is to rely on ruby's encoding conversion layer to do the buffering.
I was personally experimenting with this at the time of the following issue.
https://bugs.ruby-lang.org/issues/19191

That experimental code was finally abandoned and no working code remains, I still remember some key points.

I was making console input a special case in make_readconv() to achieve this.
NEED_READCONV() macro has become so complicated, so I felt there was another, better implementation.
miniruby requires UTF-16LE for console input and an error occurs.

It would be great if the code could be organized so that the console input routines could be treated like a device that outputs UTF-8.

YO4 · 2024-11-11T15:32:33Z

I found a fragment of my previous code and modified it to compile and check. The code is in #12055.
The major difference is that
yours
Prepare utf-8 in win32.c.
my own
Do the conversion in io.c

My code seems to bring too many modifications to io.c. Also, it is not consistent when using the binary read method.
It requires a build change to get the encoding and transcoding built in, which is something not needed except for win32.

We can avoid these by doing the processing on the win32.c side.
I think that your solution is better.

YO4 · 2024-11-11T15:39:27Z

In windows, a process can have at maximum one console, but the corresponding file descriptors can be multiple.

If we could fix the encoding conversion buffer to one, it would simplify the code, but would it cause any problems?

It seems that perl5 does it that way.
Perl/perl5@dace60f

Windows: Change command line interface to UTF-8

d472152

- script name - include paths - script input from stdin

This comment has been minimized.

Sign in to view

Windows: Use Unicode aware function to retrieve console inputs

805519b

larskanis force-pushed the utf8 branch from 0d1c23b to 805519b Compare October 5, 2024 09:47

larskanis commented Oct 14, 2024

View reviewed changes

deivid-rodriguez mentioned this pull request Nov 8, 2024

Mingw: Exclude failing tests due to the crt change #11991

Merged

YO4 mentioned this pull request Nov 11, 2024

[don't merge] windows console unicode input study #12055

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Windows: Switch ruby command line interface and console inputs to UTF-8 #11799

Windows: Switch ruby command line interface and console inputs to UTF-8 #11799

larskanis commented Oct 4, 2024 •

edited

Loading

This comment has been minimized.

launchable-app bot commented Oct 4, 2024

larskanis Oct 14, 2024

YO4 commented Oct 20, 2024

YO4 commented Nov 11, 2024

YO4 commented Nov 11, 2024

Windows: Switch ruby command line interface and console inputs to UTF-8 #11799

Are you sure you want to change the base?

Windows: Switch ruby command line interface and console inputs to UTF-8 #11799

Conversation

larskanis commented Oct 4, 2024 • edited Loading

This comment has been minimized.

launchable-app bot commented Oct 4, 2024

✅ All Tests passed!

larskanis Oct 14, 2024

Choose a reason for hiding this comment

YO4 commented Oct 20, 2024

YO4 commented Nov 11, 2024

YO4 commented Nov 11, 2024

larskanis commented Oct 4, 2024 •

edited

Loading