implement new iter_by_chunks() in items #133

BurnzZ · 2019-10-04T07:41:32Z

Provides convenient wrapper for #121

BurnzZ · 2019-10-04T07:44:18Z

scrapinghub/client/items.py

@@ -59,3 +59,30 @@ def _modify_iter_params(self, params):
        if offset:
            params['start'] = '{}/{}'.format(self.key, offset)
        return params
+
+    def iter_by_chunks(self, chunksize=10000, *args, **kwargs):


I have no idea what might be the best method name for this, so please suggest if you have something in mind.

job.items.list_iter(n=10_000)
job.items.iter_by(n=10_000)
job.items.iter_by_chunks(chunksize=10_000)
I like the first two.

I like the first one @manycoding as it's giving a better context of what's being returned overall, which is a generator of lists.

I'll omit the PEP 515 for now for backwards compatibility. It would be a great addition in 2020 nonetheless, when Python2 is deprecated.

manycoding

Items class needs examples too.

scrapinghub/client/items.py

manycoding · 2019-10-04T17:09:36Z

scrapinghub/client/items.py

@@ -59,3 +59,30 @@ def _modify_iter_params(self, params):
        if offset:
            params['start'] = '{}/{}'.format(self.key, offset)
        return params
+
+    def iter_by_chunks(self, chunksize=10000, *args, **kwargs):


job.items.list_iter(n=10_000)
job.items.iter_by(n=10_000)
job.items.iter_by_chunks(chunksize=10_000)
I like the first two.

BurnzZ · 2019-10-05T06:48:08Z

Hi @vshlapakov, I was wondering if there was a guide on adding new the vcr cassettes for this repo? The missing cassettes for this new method is what's causing the test failures. Thanks!

manycoding

I approve keeping in mind that the tests need to be fixed before merge.

vshlapakov

Thanks for the PR, looks good!

Re VCR cassettes, it's a shame we haven't documented it yet. The updated tests generation logic is defined and described by the PR #112: basically you need some test SC project and your credentials (it will be wiped from the cassettes in the end), that's it. The last comment in the PR describes how to update existing cassettes.

vshlapakov · 2019-10-08T13:50:32Z

scrapinghub/client/items.py

@@ -59,3 +73,30 @@ def _modify_iter_params(self, params):
        if offset:
            params['start'] = '{}/{}'.format(self.key, offset)
        return params
+
+    def list_iter(self, chunksize=10000, *args, **kwargs):


I perfectly understand that the logic is a nice fit for iterating through huge items collections, but I'd suggest to have a lower default chunk size in the library, say 1k, for the general case. There're many parameters that should be taken into account when defining the specific value, like total amount of items and avg item size. Does it make sense?

From my experience, in most cases I would use 100_000 chunksize for 5m+ items job (there are few cases with heavily sized items), considering average computer has at least 16GB memory. Thus, 100_000 makes a reasonable default to me.
Am I missing something?

I'd say the average computer has still 8GB nowadays, and many budget notebooks have only 4GB. And it's not only about this: this is a thin client, it should be lightweight and fast, I wouldn't expect it eating even a few GBs of memory by default without my explicit approval. Also, having a pretty large chunk size means you need a relatively good network, which isn't often the case, so the decision regarding the chunk size should be conscious.

My key point here is that the default settings should fit an average user' demands and expectations, not an experienced one processing millions items in batches. It seems we have different opinions on "an average user", I'm OK with that. But as an experienced user you can use whatever you need by setting a custom chunk size value, I don't see a problem here.

@manycoding @BurnzZ Guys, to keep it clear: I don't enforce decreasing the default size by any cost, it's just something I'd like to get a bit more attention and discuss rather than accepting as-is. From a practical point of view, a chunk of 100k items by 10KB gives approximately 1GB of memory, is it something you'd expect as users of the client library using default settings? Maybe I'm missing something here?

Seems like a good time to send a questionnaire.

As an average user, I would use .iter() or work in an environment with enough memory first.

Hey gents! To chime in the discussion here, the default chunksize of 10_000 was unconsciously used since it was working for our particular project, which has pretty much a typical number of items that one would see in project. Size per item is pretty much average I would say.

Now when choosing the default chunksize value, I think we'll need to set it to whatever value that upholds the objective of this brand new method, which is to prevent OOM issues faced by users when processing large jobs.

Speaking of large jobs, I have worked on projects which has 1 MB per item which was insane. Consuming the jobs on those types of projects would really benefit in this new method we're creating. This makes me lean into choosing a lower chunksize altogether:

chunksize of 1_000 * 1 MB per item = ~1 GB

chunksize of 10_000 * 1 MB per item = ~10 GB

Lastly, I'm not sure if physical computers or laptops should be the basis for our benchmarks in setting this value. Nonetheless, we'll need to set the bar low. With that in mind, I'm proposing that we use Scrapy Cloud as the basis, where 1 Scrapy Cloud unit would have 1 GB of memory. Typical max SC units at the moment would be 6 which gives us a maximum of 6GB of memory per job. (P.S. We're using this heavily in Scrapy Cloud jobs so it would really make sense for us to make this as the basis)

In the example above, using chunksize=10_000 would result in OOM issues as 10 GB > 6 GB SC Maximum. With that, I'm leaning towards setting chunksize=1_000 as the default.

But before doing that, what do you guys think? @vshlapakov @manycoding

vshlapakov · 2019-10-08T13:50:57Z

scrapinghub/client/items.py

@@ -37,6 +37,20 @@ class Items(_DownloadableProxyMixin, _ItemsResourceProxy):
            'size': 100000,
        }]

+    - retrieve items via a generator of lists. This is most useful in cases


Special thanks for the nice docstrings in the PR ❤️

BurnzZ · 2019-10-10T09:23:50Z

Hi @vshlapakov, thanks for linking out #112. I've tried it out and works well on my local machine. However, when it's being run on CI it's going to fail. Mostly because when we empty out the env vars HS_ENDPOINT and DASH_ENDPOINT, it's going to use the live endpoints: https://github.com/scrapinghub/python-scrapinghub/blob/master/tests/conftest.py#L68

When running in CI, these aren't emptied out and thus will use the default endpoints like: https://github.com/scrapinghub/python-scrapinghub/blob/master/tests/conftest.py#L15.

I'm not sure how to proceed with this as it may entail updating how the conftest.py works.

manycoding · 2019-10-10T14:43:41Z

scrapinghub/client/items.py

+        processed = 0
+        while True:
+            next_key = self.key + '/' + str(processed)
+            items = [
+                item for item in self.iter(
+                    count=chunksize, start=next_key, *args, **kwargs)
+            ]
+            yield items
+            processed += len(items)
+            if len(items) < chunksize:
+                break


I tried the code and realised it doesn't support count and start. The first one is the most important I guess.

def list_iter(self, chunksize=10000, *args, **kwargs): start = kwargs.pop("start", 0) count = kwargs.pop("count", sys.maxsize) processed = 0 while True: next_key = self.key + "/" + str(start) if processed + chunksize > count: chunksize = count - processed items = [ item for item in self.iter(count=chunksize, start=next_key, *args, **kwargs) ] yield items processed += len(items) start += len(items) if processed >= count: break if len(items) < chunksize: break

I haven't considered this use case since we pretty much consume the entire items in a job. Much appreciated @manycoding! I've added some docstrings and test case as well.

Though the CI tests will fail because of the current project setup on vcr-py. @vshlapakov is it okay to mock this at the moment or should we standardize it to use cassettes like the current tests?

vshlapakov · 2019-10-11T16:19:36Z

@BurnzZ Thanks for checking! I see what you mean, it seems it doesn't work reliably 😕 I'm in lack of time right now, will check it in detail next week and update in the thread. And Valeriy is right, we still need to support count/offset logic, please take a look when you have time.

vshlapakov · 2019-10-18T10:34:29Z

@BurnzZ Sorry that the issue took so long, I finally found some time to go quietly through the problem and have something to share.

The case is pretty tricky. You correctly created the cassettes, and the normalizing process works fine. The real problem is that the normalization covers only requests data, not responses. So, it correctly forms the 1st run-spider request, VCR.py finds the related snapshot and provides the saved response, but the response has a real job key with a real project ID as a part of its key. As a result, the subsequent requests done using the related job instance have different paths and can't be matched with the saved snapshots which leads to the tests failure.

I made an attempt on top of your changes to fix the specific case by normalizing the job instances in tests, it works fine, feel free to use it to finish your changes. I have some doubts if this is the best approach to solve the puzzle, but that's clearly a completely different topic which will be addressed later.

codecov · 2019-10-21T12:10:01Z

Codecov Report

Merging #133 into master will increase coverage by 0.36%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #133      +/-   ##
==========================================
+ Coverage   93.59%   93.96%   +0.36%     
==========================================
  Files          28       28              
  Lines        1921     1938      +17     
==========================================
+ Hits         1798     1821      +23     
+ Misses        123      117       -6

Impacted Files	Coverage Δ
scrapinghub/client/items.py	`100% <100%> (ø)`	⬆️
scrapinghub/hubstorage/batchuploader.py	`94.15% <0%> (+3.5%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 92ff096...9a67373. Read the comment docs.

BurnzZ · 2019-10-21T12:11:26Z

Thanks @vshlapakov! I cherry-picked your commit and updated my tests. Everything passes now! 🎉

tests/client/test_items.py

scrapinghub/client/items.py

vshlapakov · 2019-10-22T06:51:02Z

scrapinghub/client/items.py

+                item for item in self.iter(
+                    count=chunksize, start=next_key, *args, **kwargs)
+            ]
+            yield items


Just as a side note, when the total amount of items >= count and count isn't divisible by chunksize, with the current logic the function will return amount of items >= count. Say, list_iter(chunksize=2, count=3) will return 2 batches by 2 items, so 4 items in total, not 3. This is probably fine, but if keeping it as-is, I'd ask to at least mention the behaviour in the function docstring.

Thanks @vshlapakov! Turns out there was a logic that @manycoding suggested before that wasn't added to this PR. This has been fixed and I've added a new test case for it.

vshlapakov · 2019-10-23T14:03:17Z

@BurnzZ Awesome, thanks, it's a pleasure to work with you! 🚀
@manycoding Thanks a lot for help in CR and your suggestions! 🙌

BurnzZ · 2019-10-23T14:16:25Z

Thanks for all the help @vshlapakov @manycoding! ❤️ Can't wait to use this in most of our projects! 🎉

implement new iter_by_chunks() in items

534a7b7

BurnzZ added the enhancement label Oct 4, 2019

BurnzZ requested a review from vshlapakov October 4, 2019 07:41

BurnzZ self-assigned this Oct 4, 2019

BurnzZ mentioned this pull request Oct 4, 2019

msgpack errors when using iter() with intervals between each batch call #121

Open

BurnzZ requested a review from manycoding October 4, 2019 07:43

BurnzZ commented Oct 4, 2019

View reviewed changes

manycoding suggested changes Oct 4, 2019

View reviewed changes

BurnzZ added 3 commits October 5, 2019 13:41

rename 'iter_by_chunks' to 'list_iter'

25cd4dc

update docstring of 'list_iter()' to be more clear

62cab36

add new docstring example to 'jobs.items' class for list_iter()

15d647d

BurnzZ force-pushed the add-featureitems-iter_by_chunks branch from b43d3fd to 15d647d Compare October 5, 2019 06:02

fix tests

90315bc

manycoding approved these changes Oct 7, 2019

View reviewed changes

vshlapakov reviewed Oct 8, 2019

View reviewed changes

BurnzZ force-pushed the add-featureitems-iter_by_chunks branch from 4909582 to 6ad4ee9 Compare October 10, 2019 08:28

manycoding self-requested a review October 10, 2019 15:22

manycoding suggested changes Oct 10, 2019

View reviewed changes

BurnzZ added 2 commits October 16, 2019 18:04

lower default chunksize of list_iter() from 10k to 1k

4aa0112

add support for 'start' and 'count' params

ee40071

BurnzZ force-pushed the add-featureitems-iter_by_chunks branch from 6ad4ee9 to ee40071 Compare October 16, 2019 11:34

manycoding approved these changes Oct 16, 2019

View reviewed changes

vshlapakov and others added 2 commits October 21, 2019 19:45

Normalize Job instances in tests

7dc57d1

update tests with normalize_job_for_tests() workaround

c4ef768

BurnzZ force-pushed the add-featureitems-iter_by_chunks branch from b3d5c73 to c4ef768 Compare October 21, 2019 12:03

manycoding suggested changes Oct 21, 2019

View reviewed changes

tests/client/test_items.py Outdated Show resolved Hide resolved

fix test_items_list_iter_with_start_and_count() to use correct param

6bc35ad

BurnzZ force-pushed the add-featureitems-iter_by_chunks branch from b5ebdea to 6bc35ad Compare October 22, 2019 05:17

vshlapakov reviewed Oct 22, 2019

View reviewed changes

fix cases when there are extra items than the specified 'count'

9a67373

vshlapakov merged commit 32b7c04 into scrapinghub:master Oct 23, 2019

andrewbaxter mentioned this pull request Nov 13, 2019

Method to iterate all jobs/items/whatever #137

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement new iter_by_chunks() in items #133

implement new iter_by_chunks() in items #133

BurnzZ commented Oct 4, 2019

BurnzZ Oct 4, 2019

manycoding Oct 4, 2019

BurnzZ Oct 5, 2019

BurnzZ Oct 5, 2019

manycoding left a comment

manycoding Oct 4, 2019

BurnzZ commented Oct 5, 2019

manycoding left a comment

vshlapakov left a comment

vshlapakov Oct 8, 2019

manycoding Oct 8, 2019

vshlapakov Oct 9, 2019

vshlapakov Oct 9, 2019

manycoding Oct 9, 2019

BurnzZ Oct 10, 2019

vshlapakov Oct 8, 2019

BurnzZ commented Oct 10, 2019 •

edited

Loading

manycoding Oct 10, 2019 •

edited

Loading

BurnzZ Oct 16, 2019

vshlapakov commented Oct 11, 2019 •

edited

Loading

vshlapakov commented Oct 18, 2019

codecov bot commented Oct 21, 2019 •

edited

Loading

BurnzZ commented Oct 21, 2019

vshlapakov Oct 22, 2019

BurnzZ Oct 23, 2019 •

edited

Loading

vshlapakov commented Oct 23, 2019

BurnzZ commented Oct 23, 2019

implement new iter_by_chunks() in items #133

implement new iter_by_chunks() in items #133

Conversation

BurnzZ commented Oct 4, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

manycoding left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BurnzZ commented Oct 5, 2019

manycoding left a comment

Choose a reason for hiding this comment

vshlapakov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BurnzZ commented Oct 10, 2019 • edited Loading

manycoding Oct 10, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vshlapakov commented Oct 11, 2019 • edited Loading

vshlapakov commented Oct 18, 2019

codecov bot commented Oct 21, 2019 • edited Loading

Codecov Report

BurnzZ commented Oct 21, 2019

Choose a reason for hiding this comment

BurnzZ Oct 23, 2019 • edited Loading

Choose a reason for hiding this comment

vshlapakov commented Oct 23, 2019

BurnzZ commented Oct 23, 2019

BurnzZ commented Oct 10, 2019 •

edited

Loading

manycoding Oct 10, 2019 •

edited

Loading

vshlapakov commented Oct 11, 2019 •

edited

Loading

codecov bot commented Oct 21, 2019 •

edited

Loading

BurnzZ Oct 23, 2019 •

edited

Loading