-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to train controller on colab #26
Comments
I have recently updated the codebase - can you retry with the updated version? |
The error is still occurring, it crashes with the following error 2020-07-05 14:24:49.292902: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
|
Hi,
You need to allow it run as root on colab. |
@MUHAMMAD0KASHIF Thanks it worked but if I train the controller on Colab with GPU runtime, the runtime crashes after some time (it also appear to be utilizing a lot of RAM before crashing) so it means I should train the controller using Colab with CPU runtime, requesting suggestions please |
I used to run the it on colab pro with 25 GB ram but I just tried the code with a very little # of episodes. |
@MUHAMMAD0KASHIF Can you kindly share what parameters you have used for training the controller and how much time it took to complete? Also I haven't been able to understand what is the stopping criteria, please share if you understood |
!xvfb-run -a -s "-screen 0 1400x900x24" python 05_train_controller.py car_racing --num_worker 16 --num_worker_trial 2 --num_episode 4 --max_length 1000 --eval_steps 25
/usr/local/lib/python3.6/dist-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from
float
tonp.floating
is deprecated. In future, it will be treated asnp.float64 == np.dtype(float).type
.from ._conv import register_converters as _register_converters
Using TensorFlow backend.
['mpirun', '-np', '17', '/usr/bin/python3', '05_train_controller.py', 'car_racing', '--num_worker', '16', '--num_worker_trial', '2', '--num_episode', '4', '--max_length', '1000', '--eval_steps', '25']
mpirun has detected an attempt to run as root.
Running at root is strongly discouraged as any mistake (e.g., in
defining TMPDIR) or bug can result in catastrophic damage to the OS
file system, leaving your system in an unusable state.
You can override this protection by adding the --allow-run-as-root
option to your cmd line. However, we reiterate our strong advice
against doing so - please do so at your own risk.
Traceback (most recent call last):
File "05_train_controller.py", line 525, in
if "parent" == mpi_fork(args.num_worker+1): os.exit()
File "05_train_controller.py", line 492, in mpi_fork
subprocess.check_call(["mpirun", "-np", str(n), sys.executable] +['-u']+ sys.argv, env=env)
File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['mpirun', '-np', '17', '/usr/bin/python3', '-u', '05_train_controller.py', 'car_racing', '--num_worker', '16', '--num_worker_trial', '2', '--num_episode', '4', '--max_length', '1000', '--eval_steps', '25']' returned non-zero exit status 1.
[36c641d9ccde:04530] *** Process received signal ***
[36c641d9ccde:04530] Signal: Segmentation fault (11)
[36c641d9ccde:04530] Signal code: Address not mapped (1)
[36c641d9ccde:04530] Failing at address: 0x7f395ad0320d
[36c641d9ccde:04530] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7f395ddb2890]
[36c641d9ccde:04530] [ 1] /lib/x86_64-linux-gnu/libc.so.6(getenv+0xa5)[0x7f395d9f1785]
[36c641d9ccde:04530] [ 2] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(_ZN13TCMallocGuardD1Ev+0x34)[0x7f395e25ce44]
[36c641d9ccde:04530] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xf5)[0x7f395d9f2615]
[36c641d9ccde:04530] [ 4] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(+0x13cb3)[0x7f395e25acb3]
[36c641d9ccde:04530] *** End of error message ***
The text was updated successfully, but these errors were encountered: