msgpack errors when using iter() with intervals between each batch call #121

BurnzZ · 2019-05-04T06:33:19Z

Good Day!

I've encountered this peculiar issue when trying to save up memory by processing the items in chunks. Here's a strip down version of the code for reproduction of the issue:

import pandas as pd

from scrapinghub import ScrapinghubClient

def read_job_items_by_chunk(jobkey, chunk=10000):
    """In order to prevent OOM issues, the jobs' data must be read in
    chunks.

    This will return a generator of pandas DataFrames.
    """

    client = ScrapinghubClient("APIKEY123")

    item_generator = client.get_job(jobkey).items.iter()

    while item_generator:
        yield pd.DataFrame(
            [next(item_generator) for _ in range(chunk)]
        )

for df_chunk in read_job_items_by_chunk('123/123/123'):
    # having a small chunk-size like 10000 won't have any problems

for df_chunk in read_job_items_by_chunk('123/123/123', chunk=25000):
    # having a bug chunk-size like 25000 will throw out errors like the one below

Here's the common error it throws:

<omitted stack trace above>

    [next(item_generator) for _ in range(chunk)]
  File "/usr/local/lib/python2.7/site-packages/scrapinghub/client/proxy.py", line 115, in iter
    _path, requests_params, **apiparams
  File "/usr/local/lib/python2.7/site-packages/scrapinghub/hubstorage/serialization.py", line 33, in mpdecode
    for obj in unpacker:
  File "msgpack/_unpacker.pyx", line 459, in msgpack._unpacker.Unpacker.__next__ (msgpack/_unpacker.cpp:459)
  File "msgpack/_unpacker.pyx", line 390, in msgpack._unpacker.Unpacker._unpack (msgpack/_unpacker.cpp:390)
  File "/usr/local/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 67: invalid start byte

Moreover, it throws out a different error when using a much bigger chunk-size, like 50000:

<omitted stack trace above>

    [next(item_generator) for _ in range(chunk)]
  File "/usr/local/lib/python2.7/site-packages/scrapinghub/client/proxy.py", line 115, in iter
    _path, requests_params, **apiparams
  File "/usr/local/lib/python2.7/site-packages/scrapinghub/hubstorage/serialization.py", line 33, in mpdecode
    for obj in unpacker:
  File "msgpack/_unpacker.pyx", line 459, in msgpack._unpacker.Unpacker.__next__ (msgpack/_unpacker.cpp:459)
  File "msgpack/_unpacker.pyx", line 390, in msgpack._unpacker.Unpacker._unpack (msgpack/_unpacker.cpp:390)
TypeError: unhashable type: 'dict'

I find that the workaround/solution for this would be to have a lower value for chunk. So far, 1000 works great.

This uses scrapy:1.5 stack in Scrapy Cloud.

I'm guessing this might have something to do with the long waiting time that happens when processing the pandas DataFrame chunk, and when the next batch of items are being iterated, the server might have deallocated the pointer to it or something.

May I ask if there might be a solution for this? since a much bigger chunk size will help with the speed of our jobs.

I've marked it as bug for now as this is quite an unexpected/undocumented behavior.

Cheers!

The text was updated successfully, but these errors were encountered:

vshlapakov · 2019-05-16T10:41:43Z

Hey Kevin, thanks for the report! Short question, have you used latest 2.1.1 version? There was a hidden bug up to the version which could lead to the wrong iteration behavior.

vshlapakov · 2019-05-16T11:21:02Z

If yes, there's something else you could try. In the current implementation you try to iterate through all items at once, and assuming the amount of items is huge and processing data takes time - it's possible to reach some timeout. But as you know the desired chunk size beforehand and it's large enough, you could send a request per chunk using pagination parameters and handle the data with pandas as long as you need, something like this:

In [34]: def read_job_items_by_chunk(job, chunk=10000):
    ...:     processed = 0
    ...:     while True:
    ...:         next_key = job.key + '/' + str(processed)
    ...:         items = [item for item in job.items.iter(count=chunk, start=next_key, meta='_key')]
    ...:         yield items
    ...:         processed += len(items)
    ...:         if len(items) < chunk:
    ...:             break

BurnzZ · 2019-05-19T14:30:47Z

Thanks @vshlapakov! The project was using 2.0.3. I'll try to use 2.1.1 to confirm if it indeed fixes up the wrong iteration behavior. I should have some results to verify in some weeks.

Otherwise, I'll try to use the the pagination suggestion you've introduced. Cheers!

BurnzZ · 2019-06-10T14:55:23Z

Hi @vshlapakov, reporting in that version 2.1.1 still has this problem in it. I'll now proceed in trying out the pagination behavior you've suggested :)

vshlapakov · 2019-06-13T12:41:42Z

Gotcha, thanks for the update! Let me know when you test the approach 👍

BurnzZ · 2019-10-04T07:43:36Z

Hi @vshlapakov, I've made a PR in #133 from your suggestion. I think having this convenient method would be really helpful in cases where we're processing a large number of items.

@manycoding, I see that this might also be of use to arche from your issue in scrapinghub/arche#140.

Thanks!

hermit-crab · 2019-11-23T09:18:42Z

Would it not be nicer to have it as a default (at some point) behavior in normal .iter()? I imagine people will start using it for no other reason then to avoid connection issues and everyone ends up with a useless nesting level in their code. Backward compatible alternative:

def iter(..., buffer: Optional[int] = None, in_chunks: bool = False):
    if not buffer:
        ... # proceed as usual
        return

    for chunk in self._list_iter(...):
        if in_chunks:
            # for those actually needing chunks
            yield chunk
        else:
            yield from chunk

~~This also addresses #135 as it would implicitly enable pagination.~~ Never mind this comment, I've confused the resource being iterated in there.

BurnzZ · 2019-11-30T08:08:14Z

I do get where you're coming from @hermit-crab, though I think iter() and iter_by_chunks() will have a different structure of items being returned. iter() will be a generator of items in this case whilst iter_by_chunks() is a generator of list of items.

hermit-crab · 2019-11-30T09:12:42Z

Thank you for the reply @BurnzZ. Yes, I understand they will generate different structures but to elaborate more on what I mean is I believe this solution creates a situation where you have 2 distinct methods which do roughly the same thing (retrieving a resource of a job) with one of them being clearly preferable over the other despite slightly different and less commonly needed output format of the same data. At that point why would you ever use iter()? If the docs will mention on iter() that in case your data is large (TODO define large), to avoid memory and connection issues see iter_by_chunks(), then the most reasonable thing a user would do is to just use that method right off the bat to avoid dealing with issues later when their data may grow. Even if that means extra nesting level that they won't need.

For instance on the issue mentioned above scrapinghub/shub-workflow#5, the solution will end up being something like in here:

    def _process_job_items(self, scrapername, spider_job):
        first_keyprefix = None
        items_gen = (item for chunk in spider_job.items.iter_in_chunks() for item in chunk)
        # or any other variation of flattening a list
        for item in items_gen:

While I think it would be nicer to just provide a flag to spider_job.items.iter() to enable a chunk read or for it to be enabled implicitly via a flag to the sh client on creation.

That would be similar style to how pandas do it to allow for io/memory efficient reads with chunksize parameter on their read_*() methods. For pandas tho they don't have a choice but to start generating chunks since DataFrame/Series are more or less immutable objects that you cannot easily stream. In case of iter() we can abstract the buffer away and provide a flag for cases where it's really needed.

vshlapakov · 2019-12-17T07:56:10Z

@hermit-crab That makes sense to me, the methods are very close to each other, the only major difference is the output format, while the list_iter "pagination" logic would be preferable in a default case in the long term. At the same time, I would try avoiding the situation when the public iter method returns a different result type depending on input, but it should be solvable via a common internal helper serving both iter and list_iter.

I'm going to close this issue as the original problem is solved, but I'm looking forward to improve it if possible, when we agree on the implementation.

theumairahmed · 2023-08-15T11:09:15Z

Hi @BurnzZ, @hermit-crab, @vshlapakov, @Gallaecio,

I have observed a situation (rare and random) where using iter() with even a low value of count=1000 is causing msgpack errors in this job: https://app.zyte.com/p/435191/873/1150/log?line=35838.

You can see in the urllib3 debug log that it's making the following API call in the backend:
[stderr] 2023-08-13 13:18:29 [urllib3.connectionpool] DEBUG: [https://storage.scrapinghub.com:443](https://storage.scrapinghub.com/) "GET /items/435191/897/44?start=435191%2F897%2F44%2F35697000&count=1000&meta=_key HTTP/1.1" 200 None.

Worth mentioning here that even though count=1000 but the start value (435191/897/44/35697000) is huge as the job is processing around 43M items in chunks of 1000. Would converting the iterator to a list help solve this issue? Let me know if I should open up an issue for this or if you need more input from me. Thanks.

BurnzZ changed the title ~~Msgpack errors~~ msgpack errors when using iter() with intervals between each batch call May 4, 2019

BurnzZ added the bug label May 4, 2019

manycoding mentioned this issue May 9, 2019

Consuming items data to df creates inconsistencies with jsonschema scrapinghub/arche#83

Closed

manycoding mentioned this issue Jul 11, 2019

Data download consumes too much memory scrapinghub/arche#140

Closed

BurnzZ mentioned this issue Oct 4, 2019

implement new iter_by_chunks() in items #133

Merged

hermit-crab mentioned this issue Nov 23, 2019

Having msgpack installed will fail large deliveries scrapinghub/shub-workflow#5

Closed

vshlapakov closed this as completed Dec 17, 2019

hermit-crab mentioned this issue Jan 8, 2020

Long intervals during resource iteration can lead to issues #141

Open

hermit-crab mentioned this issue Apr 1, 2020

Use new .list_iter() method of sh client to avoid issues scrapinghub/shub-workflow#8

Merged

Gallaecio mentioned this issue Feb 18, 2021

UnicodeDecodeError while fetching items #154

Closed

theumairahmed reopened this Aug 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

msgpack errors when using iter() with intervals between each batch call #121

msgpack errors when using iter() with intervals between each batch call #121

BurnzZ commented May 4, 2019 •

edited

Loading

vshlapakov commented May 16, 2019

vshlapakov commented May 16, 2019

BurnzZ commented May 19, 2019

BurnzZ commented Jun 10, 2019

vshlapakov commented Jun 13, 2019

BurnzZ commented Oct 4, 2019

hermit-crab commented Nov 23, 2019 •

edited

Loading

BurnzZ commented Nov 30, 2019

hermit-crab commented Nov 30, 2019 •

edited

Loading

vshlapakov commented Dec 17, 2019 •

edited

Loading

theumairahmed commented Aug 15, 2023 •

edited

Loading

msgpack errors when using iter() with intervals between each batch call #121

msgpack errors when using iter() with intervals between each batch call #121

Comments

BurnzZ commented May 4, 2019 • edited Loading

vshlapakov commented May 16, 2019

vshlapakov commented May 16, 2019

BurnzZ commented May 19, 2019

BurnzZ commented Jun 10, 2019

vshlapakov commented Jun 13, 2019

BurnzZ commented Oct 4, 2019

hermit-crab commented Nov 23, 2019 • edited Loading

BurnzZ commented Nov 30, 2019

hermit-crab commented Nov 30, 2019 • edited Loading

vshlapakov commented Dec 17, 2019 • edited Loading

theumairahmed commented Aug 15, 2023 • edited Loading

BurnzZ commented May 4, 2019 •

edited

Loading

hermit-crab commented Nov 23, 2019 •

edited

Loading

hermit-crab commented Nov 30, 2019 •

edited

Loading

vshlapakov commented Dec 17, 2019 •

edited

Loading

theumairahmed commented Aug 15, 2023 •

edited

Loading