Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking Issue for utf8_chunks #99543

Closed
2 of 4 tasks
dylni opened this issue Jul 21, 2022 · 16 comments
Closed
2 of 4 tasks

Tracking Issue for utf8_chunks #99543

dylni opened this issue Jul 21, 2022 · 16 comments
Labels
C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC disposition-merge This issue / PR is in PFCP or FCP with a disposition to merge it. finished-final-comment-period The final comment period is finished for this PR / Issue. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@dylni
Copy link
Contributor

dylni commented Jul 21, 2022

Feature gate: #![feature(utf8_chunks)]

This is a tracking issue for an improved API for str::from_utf8.

Public API

// core::str

pub struct Utf8Chunks<'a> { ... }

impl<'a> Utf8Chunks<'a> {
    pub fn new(bytes: &'a [u8]) -> Self;
}

impl<'a> Iterator for Utf8Chunks<'a> {
    type Item = Utf8Chunk<'a>;
}

impl<'a> Clone for Utf8Chunks<'a>;
impl<'a> Debug for Utf8Chunks<'a>;
impl<'a> FusedIterator for Utf8Chunks<'a>;


pub struct Utf8Chunk<'a> { ... }

impl<'a> Utf8Chunk<'a> {
    pub fn valid(&self) -> &'a str;
    pub fn invalid(&self) -> &'a [u8];
}

impl<'a> Clone for Utf8Chunk<'a>;
impl<'a> Debug for Utf8Chunk<'a>;
impl<'a> PartialEq for Utf8Chunk<'a>;
impl<'a> Eq for Utf8Chunk<'a>;

Steps / History

Unresolved Questions

  • Should the constructor be Utf8Chunks::new or <[u8]>::utf8_chunks?
  • Should Utf8Chunks::debug or a similar method be exposed?

Footnotes

  1. https://std-dev-guide.rust-lang.org/feature-lifecycle/stabilization.html

@dylni dylni added C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Jul 21, 2022
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Aug 20, 2022
…lacrum

Expose `Utf8Lossy` as `Utf8Chunks`

This PR changes the feature for `Utf8Lossy` from `str_internals` to `utf8_lossy` and improves the API. This is done to eventually expose the API as stable.

Proposal: rust-lang/libs-team#54
Tracking Issue: rust-lang#99543
@dtolnay
Copy link
Member

dtolnay commented Nov 24, 2023

I'd be interested in using this to implement Display and Debug for CxxString in the cxx crate. Here is the current implementation without Utf8Chunks:

Here is what it looks like using Utf8Chunks as currently exists in nightly:

Are there other known use cases so far that we could look at before an FCP? One thing I am interested in is how the current Utf8Chunks API compares with this alternative one, not based on Iterator, with just 1 type:

pub struct Utf8Chunks<'a>;

impl<'a> Utf8Chunks<'a> {
    pub fn next_valid(&mut self) -> &'a str;
    pub fn next_invalid(&mut self) -> &'a [u8];
}

@dylni
Copy link
Contributor Author

dylni commented Dec 16, 2023

@dtolnay I am currently waiting on this ACP for stabilization.

Are there other known use cases so far that we could look at before an FCP?

I am aware of the following use cases.

  • Lossy conversion (String::from_utf8_lossy)
  • Debug formatting (as you mentioned)

I was originally going to use this feature in os_str_bytes for Debug formatting, but invalid returning individual "sequences" made this usage cumbersome. OsStr cannot be assumed to have the same invalid sequences. However, the individual sequences are required for lossy conversion to work with Utf8Chunks in its current form within libstd.

One thing I am interested in is how the current Utf8Chunks API compares with this alternative one, not based on Iterator, with just 1 type:

My concern is that the alternate API is easier to misuse (e.g., calling next_valid twice for two valid chunks). It also requires parsing each invalid sequence twice.

@Dylan-DPC
Copy link
Member

@dylni generally, you don't need an ACP for this to stabilise (unless the team explicitly asked for it which I don't think happened in this case).
The next step is an FCP. In which case, you can submit a stabilisation pr for it linking this issue and preferably putting the report you shared here in that pr and then the team will run an fcp either in the pr or the issue.

@dylni
Copy link
Contributor Author

dylni commented Mar 9, 2024

@Dylan-DPC Right, but the problem is that the ACP would change the API. Stabilizing at this point would prevent the API change from landing.

@dtolnay
Copy link
Member

dtolnay commented Apr 11, 2024

@rust-lang/libs-api:
@rfcbot fcp merge

I propose stabilizing core::str::Utf8Chunks and core::str::Utf8Chunk ❗with the minor modification described in rust-lang/libs-team#190 ❗.

I recently wanted this API for C++ string Debug impls in cxx as described in #99543 (comment), and also in libproc_macro for synthesizing C-string literals in #123769.

@rfcbot
Copy link

rfcbot commented Apr 11, 2024

Team member @dtolnay has proposed to merge this. The next step is review by the rest of the tagged team members:

No concerns currently listed.

Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

@rfcbot rfcbot added proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. disposition-merge This issue / PR is in PFCP or FCP with a disposition to merge it. final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. and removed proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. labels Apr 11, 2024
@rfcbot
Copy link

rfcbot commented Apr 11, 2024

🔔 This is now entering its final comment period, as per the review above. 🔔

@BurntSushi
Copy link
Member

Note that this same API, under the same name, is also in bstr. See also Utf8Chunks and Utf8Chunk. I can only identify two differences:

  • bstr doesn't have a Utf8Chunks::new constructor. This seems okay.
  • Utf8Chunk in std lacks bstr's Utf8Chunk::incomplete method. This seems okay and something we can add later if the use cases arise.

With that said, does this mean we also have an itertools situation here? bstr works by defining methods on [u8] via an extension trait.

@dtolnay dtolnay changed the title Tracking Issue for utf8_lossy Tracking Issue for utf8_chunks Apr 14, 2024
@dtolnay
Copy link
Member

dtolnay commented Apr 14, 2024

Yeah there would be 2 "itertools situations":

Code that calls .utf8_chunks() on [u8] and then calls .incomplete()

This can be mitigated by adding support for incomplete in libcore's Utf8Chunk.

use bstr::ByteSlice as _;

fn main() {
    for chunk in b"..."[..].utf8_chunks() {
        if chunk.incomplete() {
            /* ... */
        }
    }
}
error[E0599]: no method named `incomplete` found for struct `std::str::Utf8Chunk` in the current scope
 --> src/main.rs:5:18
  |
5 |         if chunk.incomplete() {
  |                  ^^^^^^^^^^ method not found in `Utf8Chunk<'_>`

warning: unused import: `bstr::ByteSlice`
 --> src/main.rs:1:5
  |
1 | use bstr::ByteSlice as _;
  |     ^^^^^^^^^^^^^^^
  |
  = note: `#[warn(unused_imports)]` on by default

Code that calls .utf8_chunks() on [u8] and expects the return type to be bstr::Utf8Chunks, as opposed to core::str::Utf8Chunks

This can be mitigated by having bstr check the compiler version and re-export libcore's Utf8Chunks and Utf8Chunk, instead of defining its own types.

use bstr::ByteSlice as _;

fn main() {
    let mut chunks: bstr::Utf8Chunks = b"..."[..].utf8_chunks();
    let chunk = chunks.next().unwrap();
    let _valid = bstr::Utf8Chunk::valid(&chunk);
}
error[E0308]: mismatched types
   --> src/main.rs:4:40
    |
4   |     let mut chunks: bstr::Utf8Chunks = b"..."[..].utf8_chunks();
    |                     ----------------   ^^^^^^^^^^^^^^^^^^^^^^^^ expected `bstr::Utf8Chunks<'_>`, found `std::str::Utf8Chunks<'_>`
    |                     |
    |                     expected due to this
    |
    = note: `std::str::Utf8Chunks<'_>` and `bstr::Utf8Chunks<'_>` have similar names, but are actually distinct types
note: `std::str::Utf8Chunks<'_>` is defined in crate `core`
   --> /rust/library/core/src/str/lossy.rs:175:1
    |
175 | pub struct Utf8Chunks<'a> {
    | ^^^^^^^^^^^^^^^^^^^^^^^^^
note: `bstr::Utf8Chunks<'_>` is defined in crate `bstr`
   --> ~/.cargo/registry/src/index.crates.io-6f17d22bba15001f/bstr-1.9.1/src/utf8.rs:227:1
    |
227 | pub struct Utf8Chunks<'a> {
    | ^^^^^^^^^^^^^^^^^^^^^^^^^

@BurntSushi
Copy link
Member

Thanks @dtolnay! I filed an issue on bstr: BurntSushi/bstr#181

I think it would be nice to have Utf8Chunk::incomplete included here, as that would make it easier to mitigate potential breakage. But I'm not sure it's worth blocking stabilization on that.

@dtolnay
Copy link
Member

dtolnay commented Apr 15, 2024

Do you happen to know any real-world usages of incomplete? If not, I can try crater on the stabilization PR to find them.

I would prefer not to blindly copy incomplete from bstr without having a sense of whether it's independently justifiable to include in core.

@BurntSushi
Copy link
Member

I don't have a specific example to point to unfortunately. But it's similar in concept at least to Utf8Error::error_len. My guess is that if you can find a real world use case for Utf8Error::error_len, then it would also serve as a use case for Utf8Chunk::incomplete. That is, it's useful in a streaming context where a single codepoint might straddle chunk boundaries. @m-ou-se actually added it to bstr long ago: BurntSushi/bstr#68

@m-ou-se
Copy link
Member

m-ou-se commented Apr 15, 2024

I don't remember exactly what I needed that function for, but I'm sure I sent that PR because I actually needed it. I can't find any uses of it in any of the code I currently have checked out, but it looks like at that time I was working on two relevant (never finished) projects where I might have needed this: 1. a terminal emulator, and 2. some embedded project with serial communication in chunks.

There is some separate discussion on use cases in this issue: BurntSushi/bstr#42

bors added a commit to rust-lang-ci/rust that referenced this issue Apr 15, 2024
Stabilize `Utf8Chunks`

Pending FCP in rust-lang#99543.

This PR includes the proposed modification in rust-lang/libs-team#190 as agreed in rust-lang#99543 (comment).
@coolreader18
Copy link
Contributor

Unless Utf8Chunks::new is intended as a temporary solution until there can be a utf8_chunks method on [u8] (à la array::IntoIter::new) I feel like I would expect the construction function to be core::str::utf8_chunks(&[u8]) -> Utf8Chunks<'_>. This is a different type of operation, but for example, std::fs::read_dir() returns std::fs::ReadDir, you don't call std::fs::ReadDir::new(). new implies an operation, just like any other function, and it semantically makes sense to me to create a new Vec or Cell or io::Error, but not to create a new Utf8Chunks. It's, like... a behavior type, not a data type.

@dtolnay
Copy link
Member

dtolnay commented Apr 19, 2024

GitHub search for "chunk.incomplete()" (guessing a likely variable name) shows 0 hits. GitHub search for ".incomplete()" shows 123 hits, of which I looked through all of them and 0 are related to utf8 chunks. In #123909, crater uncovered 0 uses of incomplete across crates.io and GitHub.

As such, I recommend that we do not import bstr's incomplete into core until such time as it is better justified.

Crater showed a single occurrence of the second type of expected break in #99543 (comment). The affected crate is actively developed (within the past 5 days) and is not depended on by any other crate on crates.io. This breakage is allowed by the standard library's API evolution policy (https://rust-lang.github.io/rfcs/1105-api-evolution.html#minor-change-adding-any-inherent-items).

I recommend that we move forward with #123909 as written (when the FCP completes in 2 days, of course).

@rfcbot rfcbot added finished-final-comment-period The final comment period is finished for this PR / Issue. to-announce Announce this issue on triage meeting and removed final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. labels Apr 21, 2024
@rfcbot
Copy link

rfcbot commented Apr 21, 2024

The final comment period, with a disposition to merge, as per the review above, is now complete.

As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed.

This will be merged soon.

bors added a commit to rust-lang-ci/rust that referenced this issue Apr 26, 2024
Stabilize `Utf8Chunks`

Pending FCP in rust-lang#99543.

This PR includes the proposed modification in rust-lang/libs-team#190 as agreed in rust-lang#99543 (comment).
RalfJung pushed a commit to RalfJung/miri that referenced this issue Apr 27, 2024
Stabilize `Utf8Chunks`

Pending FCP in rust-lang/rust#99543.

This PR includes the proposed modification in rust-lang/libs-team#190 as agreed in rust-lang/rust#99543 (comment).
RalfJung pushed a commit to RalfJung/rust-analyzer that referenced this issue Apr 27, 2024
Stabilize `Utf8Chunks`

Pending FCP in rust-lang/rust#99543.

This PR includes the proposed modification in rust-lang/libs-team#190 as agreed in rust-lang/rust#99543 (comment).
@dtolnay dtolnay closed this as completed Apr 28, 2024
@apiraino apiraino removed the to-announce Announce this issue on triage meeting label Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC disposition-merge This issue / PR is in PFCP or FCP with a disposition to merge it. finished-final-comment-period The final comment period is finished for this PR / Issue. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

8 participants