-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUE sheet UTF-8 encoding problem - broken track names with accented characters. #133
Comments
That looks like the file set is misjudged. Probably a single accented è is not sufficient. Could you provide me with the problem cue files? Logs are not required. Send me no audio files, of course, only the cuesheets. |
Thanks for taking a look! I really <3 ffmpegfs :) Here are the CUE sheets from my examples. I've attached both the originals and the iconv converted ones, appending .txt to the filenames to satisfy github. The "Le Voyage dans la Lune" is the problematic one but both have accented characters in their track lists. Original ISO-8559-1 encoded CUE sheet: iconv UTF-8 converted CUE sheet with the problem: Original "unknown-8bit" / WINDOWS-1252 / CP-1252 encoded CUE sheet: iconv converted UTF-8 CUE sheet: Regards, |
It is like I expected, the UTF-8 file is misdetected as ISO-8859-2. There is only a single UTF-8 character in the word "Décollage", so the underlying library (libchardet) gets misled. When I change the word to "Décolláge" it is correctly detected as UTF-8. Sorry, there is no way to fix that on my side. When checking the char set, libchardet uses heuristics that may sometimes fail. This is the first time I see that, though. But you may fix the problem by adding a BOM (Byte Order Mark) to your UTF-8 files. This will avoid misdetections. You can do that with sed like:
Just take care not to update the original files, though. |
Thanks for looking into this and providing a solution! This works perfectly. I see other people have commented in the libchardet github issue tracker about mis-detection when there is only one UTF-8 accented or international character: Joungkyun/libchardet#17 |
You are welcome! Thanks for the hint, looks like the problem is known. But the issue is there since 2020 without progress, hope that there will be a fix one day. |
Issue:
Certain UTF-8 encoded CUE sheets result in ffmpegfs transcoded flac.track/ files with broken track names.
Symptom:
For example, this CUE sheet:
Other UTF-8 encoded CUE sheets that have multiple characters with accents, either in one track name or multiple track names each with one accented character do not suffer this problem. For example:
Background:
My /etc/fstab:
I compiled ffmpegfs from git main:
Kernel:
I've been ripping CDs with CUERipper using wine in Devuan Daedalus (which is based on Debian Testing/Bookworm) to single FLAC files with embedded and separate CUE sheets. On releases that have track names including characters with accents (i.e. non- ASCII / ANSI), for example characters like é or ĉ or à, usually result in CUE sheet files that are not UTF-8 encoded. For example:
I've determined that "unknown-8bit" encoding is actually WINDOWS-1252 as you might expect.
I want all my CUE sheets to be UTF-8 encoded and *nix compliant and so I used
iconv
and thendos2unix
(to remove <CRLF>) to achieve this.The single album FLAC files given in my examples both have embedded CUE sheets as well as a separate one in the same folder. For the buggy one; either removing the separate UTF-8 encoded CUE sheet from the album folder (so that ffmpegfs reads the embedded one) or using the original ISO-8859-1 encoded one results in no track name bug.
Strangely, if I edit the buggy UTF-8 CUE sheet to either add another accented character like é to the bugged track name or add one in /certain/ other positions within a different track name, then remount ffmpegfs, the bug disappears. For example:
My DEBUG logs do not have any information that makes me any wiser. Let me know if you'd like TRACE logs.
The text was updated successfully, but these errors were encountered: