Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError while fetching items #154

Closed
mijamo opened this issue Feb 17, 2021 · 8 comments
Closed

UnicodeDecodeError while fetching items #154

mijamo opened this issue Feb 17, 2021 · 8 comments

Comments

@mijamo
Copy link

mijamo commented Feb 17, 2021

It seems like I randomly get errors like this:

 UnicodeDecodeError: 'utf-8' codec can't decode byte 0xde in position 174: invalid continuation byte

        at msgpack._cmsgpack.Unpacker._unpack (_unpacker.pyx:443)
        at msgpack._cmsgpack.Unpacker.__next__ (_unpacker.pyx:518)
        at mpdecode (/usr/local/lib/python3.7/site-packages/scrapinghub/hubstorage/serialization.py:33)
        at iter (/usr/local/lib/python3.7/site-packages/scrapinghub/client/proxy.py:115) 

This happens while iterating the items through last_job.items.iter()
It seems to happen about 50% of the time from what I see. I scrape the same website every day and run that function and sometimes it works fine, sometimes raise that error. I am not sure if this is an issue with this library or with the ScrapingHub API though but it is very problematic.

This happens on the latest (2.3.1) version

@Gallaecio
Copy link
Member

May #151 be the answer to this?

@mijamo
Copy link
Author

mijamo commented Feb 18, 2021

I am using msgpack v1.0.2 so I don't think this is the issue

@Gallaecio
Copy link
Member

#121 also seems related. I would try uninstalling msgpack, see if that makes any difference.

@mijamo
Copy link
Author

mijamo commented Feb 18, 2021

After checking my logs it seems like when the error occurs I have received such strings from the ScrapingHub API with well formatted UTF-8 appart from \xde\x00\x18\xa4 , \xde\x00\x16\xa4 or \xde\x00\x19\xa4. Those sequences seem to be inserted between some properties (for instance in my case I have a description field that I get correctly, and then that sequence gets inserted before the next property starts. The weird thing is that I cannot seem to be able to trigger the error manually because everytime I fetch the items through the command line it seems to work, and the source data seems correct.

@mijamo
Copy link
Author

mijamo commented Feb 18, 2021

After looking even deeper in the logs it seems that those sequences are not randomly inserted. Instead it looks like the description field is in those cases "cut" at some point and then the weird sequence is inserted and then it moves to another field.

After seeing that I suspect this might not be a problem with this library but maybe more with ScrapingHub API?

@Gallaecio
Copy link
Member

I think this is #121 . Using iter for long can be an issue. In #121 it’s better explained, including how to work around that issue.

@mijamo
Copy link
Author

mijamo commented Feb 18, 2021

Thank you I will try that solution and close this issue if it fixes it. It might take a few days though as the error doesn't happen every day as I mentioned.

@mijamo
Copy link
Author

mijamo commented Mar 1, 2021

It does seem like it fixed the issue, thank you for the help.

This might be worth mentioning in the documentation somewhere though because the error doesn't make it easy to understand the problem.

@mijamo mijamo closed this as completed Mar 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants