Skip to content
This repository has been archived by the owner on Jan 22, 2019. It is now read-only.

Invalid UTF8-character in non-UTF8 file is detected too early, so parsing can not be continued #132

Closed
flappingeagle opened this issue Aug 24, 2016 · 4 comments
Milestone

Comments

@flappingeagle
Copy link

flappingeagle commented Aug 24, 2016

following code-example can be tested with the attached file (test8.csv). The file is in ISO-8859 format and contains an UTF8 character, which is: é

            File file = new File("test8.csv");
            InputStream in = Files.newInputStream(file.toPath(), StandardOpenOption.READ);

            CsvSchema schema = CsvSchema.emptySchema().withHeader();
            CsvMapper mapper = new CsvMapper();
            ObjectReader reader = mapper.readerFor(Map.class).with(schema);
            MappingIterator<Map<String, String>> mappingIterator = reader.readValues(in);

            while (mappingIterator.hasNextValue()) {
                Map<String, String> line = mappingIterator.nextValue();
                System.out.println(line);
            }
            mappingIterator.close();

the parsing crashes in line 152 at the call of "nextValue()". But the problematic UTF8 character is in line 185. So the parsing does not crash at the position of the problematic character but much earlier... (must be because of buffering?)

i just ask, because if the parsing would crash at the exact position of the UTF8 character, we may simple ignore this line and continue with the next line. But this way the parsing crashes earlier and can not be recovered/continued.

Following parse-exception is output:

java.io.CharConversionException: Invalid UTF-8 middle byte 0x65 (at char #4861, byte #3999): check content encoding, does not look like UTF-8

The problematic character in the file test8.csv can be found in VI-Editor with ":goto 4861"

test8.csv.zip

@flappingeagle flappingeagle changed the title non-UTF8 content is detected to early, so parsing can not be continued non-UTF8 content is detected too early, so parsing can not be continued Aug 24, 2016
@flappingeagle flappingeagle changed the title non-UTF8 content is detected too early, so parsing can not be continued invalid UTF8-character in non-UTF8 file is detected too early, so parsing can not be continued Aug 24, 2016
@cowtowncoder
Copy link
Member

Yes, this is due to buffering: since character decoding is separate from tokenization for CSV backend (unlike with JSON where the two are integrated, for performance reasons, but also helps with exact error reporting) decoding proceeds block-by-block, ahead of tokenization.

I wonder if it might actually be possible to improve UTF8 reader to postpone error reporting, such that if there is at least one already decoded character, that (and whatever else was successfully decoded) would be returned; and exception only thrown if the problem occurs with the first character to decode.
This should achieve much more accurate error reporting?

What do you think?

@flappingeagle
Copy link
Author

flappingeagle commented Aug 25, 2016

it would be nice to postpone error-reporting if possible, like you said.

Say, you have already buffered line 50-100 of which line 80 has a character which can not be decoded.
I would think that it would be nice if line 50-79 could be processed without error, and then if "nextValue()" is called for line 80 it would throw the exception (for example).

So yes, i think what you say sounds good.

But i could also understand a viewpoint if you say that you want to throw an error as early as possible, but for my use-case it would be good to postpone.

It would be even better, if the reader could recover from the error and continue with the next lines in the file, but this may not be possible.

@cowtowncoder
Copy link
Member

@flappingeagle I agree with "synchronized" failure, and think too-early failure is not beneficial for most (or perhaps any) cases. So question is just whether I can figure out how to make this work without adding processing overhead. I think that is possible, just need to find time to play with the code.

Thank you again for reporting this: I think this would be great improvement -- and with CSV module, similar improvements have been made to allow dealing with occasional malformed/mismapping rows, and all in all vastly improving developer experience.

@cowtowncoder cowtowncoder changed the title invalid UTF8-character in non-UTF8 file is detected too early, so parsing can not be continued Invalid UTF8-character in non-UTF8 file is detected too early, so parsing can not be continued Aug 25, 2016
cowtowncoder added a commit that referenced this issue Aug 25, 2016
@cowtowncoder cowtowncoder modified the milestones: 2.3.0, 2.7.7 Aug 25, 2016
@cowtowncoder
Copy link
Member

Will be included in 2.7.7, 2.8.2, when released.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants