Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data download consumes too much memory #140

Closed
manycoding opened this issue Jul 11, 2019 · 7 comments
Closed

Data download consumes too much memory #140

manycoding opened this issue Jul 11, 2019 · 7 comments
Assignees
Milestone

Comments

@manycoding
Copy link
Contributor

manycoding commented Jul 11, 2019

Downloading 3m of items consumes about 8gb memory. Not much, but could it be lower?

Possible solution with chunks - scrapinghub/python-scrapinghub#121

@manycoding
Copy link
Contributor Author

  1. Read raw data to numpy in chunks, using coroutines - yield. Then collate so the index stays in order.
  2. Same with DataFrame.

We cannot go from max DataFrame size yet, but possibly its memory consumption is much lower than when it's being initialized.

@manycoding
Copy link
Contributor Author

manycoding commented Oct 10, 2019

The raw data is stored in a numpy array and has certain size. When downloading, there's no memory overhead (the memory consumption is smaller then the result array).
Thus, I don't see any room for improvement memory-wise as long as raw is kept (used for JSON schema validation), no matter how we get the data.

Pandas looks more memory efficient.

Code
https://jupyter.scrapinghub.com/user/v/lab/tree/shared/Arche/Memory%20consumption.ipynb
Bench notebook (to accuratelly measure memory consumption) https://jupyter.scrapinghub.com/user/v/lab/tree/shared/Arche/Memory%20consumption%20bench.ipynb

@manycoding
Copy link
Contributor Author

@andersonberg @felipefin @alexandr1988 Have you hit the wall with memory recently, in particular in Kosmonaut high memory profile?
As I haven't found a way to optimize memory in the library, I can increase memory for our server if needed.

@felipefin
Copy link

I remember having issues downloading more than 500k items, however I was not using Kosmonaut.
I'll keep that in mind next time I face this kind of situations.

@nerevaryn
Copy link

nerevaryn commented Oct 15, 2019 via email

@andersonberg
Copy link
Contributor

No, I didn't use Kosmonaut with high memory yet. Adding to that, I didn't face any memory problems in the Arche profile

@manycoding
Copy link
Contributor Author

Closing because the most straightforward way is to up the server memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants