You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This looks really nice. Just a general comment on Dask at a large scale. The issue we came across was that Dask will deploy as many jobs as possible to a single task which will mean if you set your cluster parameters such that you want 5 GB of memory because that is how much on job requires, it will violate it immediately as it just gives this amount of memory to one worker who can run as many jobs as possible.
The way I am trying to avoid this is to assign worker resources which limit the number of tasks it can deploy, e.g. 1 GPU per model training means only one model can run on the worker at a time. Or, for espresso, 1 "espresso". But in the case you have here, if you task was a large matrix computation and you heavily parallelised over nodes, I think you would reach a dead worker pretty fast.
This looks really nice. Just a general comment on Dask at a large scale. The issue we came across was that Dask will deploy as many jobs as possible to a single task which will mean if you set your cluster parameters such that you want 5 GB of memory because that is how much on job requires, it will violate it immediately as it just gives this amount of memory to one worker who can run as many jobs as possible.
The way I am trying to avoid this is to assign worker resources which limit the number of tasks it can deploy, e.g. 1 GPU per model training means only one model can run on the worker at a time. Or, for espresso, 1 "espresso". But in the case you have here, if you task was a large matrix computation and you heavily parallelised over nodes, I think you would reach a dead worker pretty fast.
Originally posted by @SamTov in #21 (review)
The text was updated successfully, but these errors were encountered: