Invalid UTF8-character in non-UTF8 file is detected too early, so parsing can not be continued #132

flappingeagle · 2016-08-24T11:18:00Z

following code-example can be tested with the attached file (test8.csv). The file is in ISO-8859 format and contains an UTF8 character, which is: é

            File file = new File("test8.csv");
            InputStream in = Files.newInputStream(file.toPath(), StandardOpenOption.READ);

            CsvSchema schema = CsvSchema.emptySchema().withHeader();
            CsvMapper mapper = new CsvMapper();
            ObjectReader reader = mapper.readerFor(Map.class).with(schema);
            MappingIterator<Map<String, String>> mappingIterator = reader.readValues(in);

            while (mappingIterator.hasNextValue()) {
                Map<String, String> line = mappingIterator.nextValue();
                System.out.println(line);
            }
            mappingIterator.close();

the parsing crashes in line 152 at the call of "nextValue()". But the problematic UTF8 character is in line 185. So the parsing does not crash at the position of the problematic character but much earlier... (must be because of buffering?)

i just ask, because if the parsing would crash at the exact position of the UTF8 character, we may simple ignore this line and continue with the next line. But this way the parsing crashes earlier and can not be recovered/continued.

Following parse-exception is output:

java.io.CharConversionException: Invalid UTF-8 middle byte 0x65 (at char #4861, byte #3999): check content encoding, does not look like UTF-8

The problematic character in the file test8.csv can be found in VI-Editor with ":goto 4861"

test8.csv.zip

cowtowncoder · 2016-08-24T18:21:15Z

Yes, this is due to buffering: since character decoding is separate from tokenization for CSV backend (unlike with JSON where the two are integrated, for performance reasons, but also helps with exact error reporting) decoding proceeds block-by-block, ahead of tokenization.

I wonder if it might actually be possible to improve UTF8 reader to postpone error reporting, such that if there is at least one already decoded character, that (and whatever else was successfully decoded) would be returned; and exception only thrown if the problem occurs with the first character to decode.
This should achieve much more accurate error reporting?

What do you think?

flappingeagle · 2016-08-25T12:57:43Z

it would be nice to postpone error-reporting if possible, like you said.

Say, you have already buffered line 50-100 of which line 80 has a character which can not be decoded.
I would think that it would be nice if line 50-79 could be processed without error, and then if "nextValue()" is called for line 80 it would throw the exception (for example).

So yes, i think what you say sounds good.

But i could also understand a viewpoint if you say that you want to throw an error as early as possible, but for my use-case it would be good to postpone.

It would be even better, if the reader could recover from the error and continue with the next lines in the file, but this may not be possible.

cowtowncoder · 2016-08-25T18:31:05Z

@flappingeagle I agree with "synchronized" failure, and think too-early failure is not beneficial for most (or perhaps any) cases. So question is just whether I can figure out how to make this work without adding processing overhead. I think that is possible, just need to find time to play with the code.

Thank you again for reporting this: I think this would be great improvement -- and with CSV module, similar improvements have been made to allow dealing with occasional malformed/mismapping rows, and all in all vastly improving developer experience.

cowtowncoder · 2016-08-26T05:49:58Z

Will be included in 2.7.7, 2.8.2, when released.

flappingeagle changed the title ~~non-UTF8 content is detected to early, so parsing can not be continued~~ non-UTF8 content is detected too early, so parsing can not be continued Aug 24, 2016

flappingeagle changed the title ~~non-UTF8 content is detected too early, so parsing can not be continued~~ invalid UTF8-character in non-UTF8 file is detected too early, so parsing can not be continued Aug 24, 2016

cowtowncoder changed the title ~~invalid UTF8-character in non-UTF8 file is detected too early, so parsing can not be continued~~ Invalid UTF8-character in non-UTF8 file is detected too early, so parsing can not be continued Aug 25, 2016

cowtowncoder added a commit that referenced this issue Aug 25, 2016

Add failing test for #132

a3045b7

cowtowncoder closed this as completed in 9dbf008 Aug 25, 2016

cowtowncoder modified the milestones: 2.3.0, 2.7.7 Aug 25, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid UTF8-character in non-UTF8 file is detected too early, so parsing can not be continued #132

Invalid UTF8-character in non-UTF8 file is detected too early, so parsing can not be continued #132

flappingeagle commented Aug 24, 2016 •

edited

Loading

cowtowncoder commented Aug 24, 2016

flappingeagle commented Aug 25, 2016 •

edited

Loading

cowtowncoder commented Aug 25, 2016

cowtowncoder commented Aug 26, 2016

Invalid UTF8-character in non-UTF8 file is detected too early, so parsing can not be continued #132

Invalid UTF8-character in non-UTF8 file is detected too early, so parsing can not be continued #132

Comments

flappingeagle commented Aug 24, 2016 • edited Loading

cowtowncoder commented Aug 24, 2016

flappingeagle commented Aug 25, 2016 • edited Loading

cowtowncoder commented Aug 25, 2016

cowtowncoder commented Aug 26, 2016

flappingeagle commented Aug 24, 2016 •

edited

Loading

flappingeagle commented Aug 25, 2016 •

edited

Loading