Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Will cmdstanr error out if model compilation/initialization hangs? #1044

Open
emstruong opened this issue Nov 24, 2024 · 2 comments
Open

Will cmdstanr error out if model compilation/initialization hangs? #1044

emstruong opened this issue Nov 24, 2024 · 2 comments

Comments

@emstruong
Copy link

emstruong commented Nov 24, 2024

I've been running cmdstanr a few tens of thousands of times via brms on a HPC node. Something I've noticed is that very rarely, I'll get an error message like

! Native call to `processx_exec` failed
Caused by error in `chain_call(c_processx_exec, command, c(command, args), pty, pty_options, …` at initialize.R:138:3:
! cannot start processx process '/tmp/RtmpF0MfcO/model_a748e80cd26feeedb6f52f5458fcda4b' (system error 13, Permission denied) @unix/processx.c:611 (processx_exec)

Or

! Native call to `processx_exec` failed
Caused by error in `chain_call(c_processx_exec, command, c(command, args), pty, pty_options, …` at initialize.R:138:3:
! cannot start processx process '/tmp/RtmpF0MfcO/model_c245fc9080e08ffee6f5db7c5de9e950' (system error 2, No such file or directory) @unix/processx.c:611 (processx_exec)

Because this does not happen every time I compile a model through brms and because it doesn't happen very frequently, I do not think it is an issue with the data I'm feeding the model or with any other aspect of the code. However, I noticed that these errors seem to happen when many thousands of models have been fitted within one session and when the system load is very high. I suspect that the compilation of the model is hanging when the system load is too high. Or perhaps it's not even compilation, but the initialization of the model that is taking too long.

So my question is: How does cmdstanr react when it's waited for the compilation of the model for too long? Is it possible that the processx library/function will give the errors I got above when the system has hanged for too long?

I found the some associated processx code used in cmdstanr here

cmdstanr/R/run.R

Lines 755 to 760 in c681d32

poll = function(ms) { # time in milliseconds
processx::poll(private$processes_, ms)
},
wait = function(s) { # time in seconds
Sys.sleep(s)
},

The HPC node is a linux system, so I don't think it's related to how cmdstanr uses processx for mac_os or wsl systems.

@emstruong emstruong changed the title Will cmdstanr error out if model compilation hangs? Will cmdstanr error out if model compilation/initialization hangs? Nov 24, 2024
@jgabry
Copy link
Member

jgabry commented Nov 26, 2024

That’s a good question. I haven’t seen that particular error before.

Is it possible that the processx library/function will give the errors I got above when the system has hanged for too long?

I suppose it’s possible, but I haven’t seen it before. I think it would be determined by however processx is handling this internally, because I don’t think there’s any sort of limit imposed by cmdstanr itself. Unfortunately I don’t know enough about the internals of processx to speculate further.

@emstruong
Copy link
Author

emstruong commented Nov 26, 2024

Before I reply with a wall of text, I have some more concrete questions, which I would really appreciate any comments on:

  • If i tar the local .cmdstan directory, copy it into some other machine (the node in my case), untar and set the cmdstan path to the untarred .cmdstan directory on the new machine, do you think everything will run properly? Putting aside differences in environment (g++ compiler, etc), are there soft-links within the local .cmdstan directory that would break upon tar/untar?
    • Perhaps a simpler way of stating this is, "Is .cmdstan portable between machines--putting aside g++, make and CPU architecture?"

Looking at

https://github.com/r-lib/processx/blob/3cbce8443f58e59a9447bf1191b9e0b8c581bf96/R/run.R#L27-L36

and

https://github.com/r-lib/processx?tab=readme-ov-file#errors-1

I think it's true that some combination of cmdstanr's use of processx and high I/O latency/load might cause processx to error out and hence cause cmdstanr to fail. (Based on the documentation, processx seems to be able to error out if things take too long.)

So for future readers, I am thinking that my particular set of issues is being caused by a combination of the following:

  • When I push the cluster I/O too hard, there's some kind of timeout or out-of-sync operation, such that cmdstanr tries to access a Stan executable that doesn't exist yet.
  • Perhaps independent of this issue or related to it, this timeout/out-of-syncness is due to how SimDesign does parallel processing, cmdstanr's handling of timeouts or processx. 

To address these hypothetical issues, I will try this combination of changes:

  1. Set this environment variable, so that processx plays nicely with parallel
  2. Tar the .cmdstan installation on the login node, then copy and untar it to each compute node. Use set_cmdstan_path() so that the compute node uses the untarred .cmdstan on the compute node's local storage. This will hopefully help with I/O load.
  3. Change the R options for cmdstanr_write_stan_file_dir and cmdstanr_output_dir to use the compute node's local storage to further reduce I/O load.
  4. Don't try to use R to remove the locally compiled Stan files as this may increase I/O load. Especially given that Stan produces many small files, as opposed to one large file.
  5. Don't allocate 'too many' cores to running the simulation itself, so that there are spare cores to handle the compilation of the Stan executable (does it work that way?); or at the very least, handle I/O operations.

EDIT: I tried these changes and I still very rarely and randomly get the processx error, but at least the simulation is faster. My speculation was likely wrong.


Regarding what this has to do with cmdstanr, perhaps the way processx is handled could be made more robust/user-managed? For example, if it's true that these issues are due to high I/O latency, then perhaps there's some argument you can pass to it, to wait longer?

I think implementing this would require knowledge of both how cmdstanr uses processx and the way processx itself works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants