UnicodeDecodeError while fetching items #154

mijamo · 2021-02-17T09:36:03Z

It seems like I randomly get errors like this:

 UnicodeDecodeError: 'utf-8' codec can't decode byte 0xde in position 174: invalid continuation byte

        at msgpack._cmsgpack.Unpacker._unpack (_unpacker.pyx:443)
        at msgpack._cmsgpack.Unpacker.__next__ (_unpacker.pyx:518)
        at mpdecode (/usr/local/lib/python3.7/site-packages/scrapinghub/hubstorage/serialization.py:33)
        at iter (/usr/local/lib/python3.7/site-packages/scrapinghub/client/proxy.py:115)

This happens while iterating the items through last_job.items.iter()
It seems to happen about 50% of the time from what I see. I scrape the same website every day and run that function and sometimes it works fine, sometimes raise that error. I am not sure if this is an issue with this library or with the ScrapingHub API though but it is very problematic.

This happens on the latest (2.3.1) version

The text was updated successfully, but these errors were encountered:

Gallaecio · 2021-02-18T11:45:26Z

May #151 be the answer to this?

mijamo · 2021-02-18T11:50:26Z

I am using msgpack v1.0.2 so I don't think this is the issue

Gallaecio · 2021-02-18T12:46:48Z

#121 also seems related. I would try uninstalling msgpack, see if that makes any difference.

mijamo · 2021-02-18T12:47:30Z

After checking my logs it seems like when the error occurs I have received such strings from the ScrapingHub API with well formatted UTF-8 appart from \xde\x00\x18\xa4 , \xde\x00\x16\xa4 or \xde\x00\x19\xa4. Those sequences seem to be inserted between some properties (for instance in my case I have a description field that I get correctly, and then that sequence gets inserted before the next property starts. The weird thing is that I cannot seem to be able to trigger the error manually because everytime I fetch the items through the command line it seems to work, and the source data seems correct.

mijamo · 2021-02-18T13:09:35Z

After looking even deeper in the logs it seems that those sequences are not randomly inserted. Instead it looks like the description field is in those cases "cut" at some point and then the weird sequence is inserted and then it moves to another field.

After seeing that I suspect this might not be a problem with this library but maybe more with ScrapingHub API?

Gallaecio · 2021-02-18T14:07:23Z

I think this is #121 . Using iter for long can be an issue. In #121 it’s better explained, including how to work around that issue.

mijamo · 2021-02-18T15:21:24Z

Thank you I will try that solution and close this issue if it fixes it. It might take a few days though as the error doesn't happen every day as I mentioned.

mijamo · 2021-03-01T09:56:04Z

It does seem like it fixed the issue, thank you for the help.

This might be worth mentioning in the documentation somewhere though because the error doesn't make it easy to understand the problem.

mijamo closed this as completed Mar 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError while fetching items #154

UnicodeDecodeError while fetching items #154

mijamo commented Feb 17, 2021 •

edited

Loading

Gallaecio commented Feb 18, 2021

mijamo commented Feb 18, 2021

Gallaecio commented Feb 18, 2021

mijamo commented Feb 18, 2021 •

edited

Loading

mijamo commented Feb 18, 2021

Gallaecio commented Feb 18, 2021

mijamo commented Feb 18, 2021

mijamo commented Mar 1, 2021

UnicodeDecodeError while fetching items #154

UnicodeDecodeError while fetching items #154

Comments

mijamo commented Feb 17, 2021 • edited Loading

Gallaecio commented Feb 18, 2021

mijamo commented Feb 18, 2021

Gallaecio commented Feb 18, 2021

mijamo commented Feb 18, 2021 • edited Loading

mijamo commented Feb 18, 2021

Gallaecio commented Feb 18, 2021

mijamo commented Feb 18, 2021

mijamo commented Mar 1, 2021

mijamo commented Feb 17, 2021 •

edited

Loading

mijamo commented Feb 18, 2021 •

edited

Loading