Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to train controller on colab #26

Open
MUHAMMAD0KASHIF opened this issue Apr 11, 2020 · 6 comments
Open

Unable to train controller on colab #26

MUHAMMAD0KASHIF opened this issue Apr 11, 2020 · 6 comments

Comments

@MUHAMMAD0KASHIF
Copy link

MUHAMMAD0KASHIF commented Apr 11, 2020

!xvfb-run -a -s "-screen 0 1400x900x24" python 05_train_controller.py car_racing --num_worker 16 --num_worker_trial 2 --num_episode 4 --max_length 1000 --eval_steps 25

/usr/local/lib/python3.6/dist-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
Using TensorFlow backend.
['mpirun', '-np', '17', '/usr/bin/python3', '05_train_controller.py', 'car_racing', '--num_worker', '16', '--num_worker_trial', '2', '--num_episode', '4', '--max_length', '1000', '--eval_steps', '25']

mpirun has detected an attempt to run as root.
Running at root is strongly discouraged as any mistake (e.g., in
defining TMPDIR) or bug can result in catastrophic damage to the OS
file system, leaving your system in an unusable state.

You can override this protection by adding the --allow-run-as-root
option to your cmd line. However, we reiterate our strong advice
against doing so - please do so at your own risk.

Traceback (most recent call last):
File "05_train_controller.py", line 525, in
if "parent" == mpi_fork(args.num_worker+1): os.exit()
File "05_train_controller.py", line 492, in mpi_fork
subprocess.check_call(["mpirun", "-np", str(n), sys.executable] +['-u']+ sys.argv, env=env)
File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['mpirun', '-np', '17', '/usr/bin/python3', '-u', '05_train_controller.py', 'car_racing', '--num_worker', '16', '--num_worker_trial', '2', '--num_episode', '4', '--max_length', '1000', '--eval_steps', '25']' returned non-zero exit status 1.
[36c641d9ccde:04530] *** Process received signal ***
[36c641d9ccde:04530] Signal: Segmentation fault (11)
[36c641d9ccde:04530] Signal code: Address not mapped (1)
[36c641d9ccde:04530] Failing at address: 0x7f395ad0320d
[36c641d9ccde:04530] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7f395ddb2890]
[36c641d9ccde:04530] [ 1] /lib/x86_64-linux-gnu/libc.so.6(getenv+0xa5)[0x7f395d9f1785]
[36c641d9ccde:04530] [ 2] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(_ZN13TCMallocGuardD1Ev+0x34)[0x7f395e25ce44]
[36c641d9ccde:04530] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xf5)[0x7f395d9f2615]
[36c641d9ccde:04530] [ 4] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(+0x13cb3)[0x7f395e25acb3]
[36c641d9ccde:04530] *** End of error message ***

@davidADSP
Copy link

I have recently updated the codebase - can you retry with the updated version?

@NeoBoy
Copy link

NeoBoy commented Jul 5, 2020

The error is still occurring, it crashes with the following error

2020-07-05 14:24:49.292902: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
['mpirun', '-np', '17', '/usr/bin/python3', '05_train_controller.py', 'car_racing', '--num_worker', '16', '--num_worker_trial', '2', '--num_episode', '4', '--max_length', '1000', '--eval_steps', '25']

mpirun has detected an attempt to run as root.
Running at root is strongly discouraged as any mistake (e.g., in
defining TMPDIR) or bug can result in catastrophic damage to the OS
file system, leaving your system in an unusable state.

You can override this protection by adding the --allow-run-as-root
option to your cmd line. However, we reiterate our strong advice
against doing so - please do so at your own risk.

Traceback (most recent call last):
File "05_train_controller.py", line 525, in
if "parent" == mpi_fork(args.num_worker+1): os.exit()
File "05_train_controller.py", line 492, in mpi_fork
subprocess.check_call(["mpirun", "-np", str(n), sys.executable] +['-u']+ sys.argv, env=env)
File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['mpirun', '-np', '17', '/usr/bin/python3', '-u', '05_train_controller.py', 'car_racing', '--num_worker', '16', '--num_worker_trial', '2', '--num_episode', '4', '--max_length', '1000', '--eval_steps', '25']' returned non-zero exit status 1.
[30cff116c958:16537] *** Process received signal ***
[30cff116c958:16537] Signal: Segmentation fault (11)
[30cff116c958:16537] Signal code: Address not mapped (1)
[30cff116c958:16537] Failing at address: 0x7fc5429bb20d
[30cff116c958:16537] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7fc545c6f890]
[30cff116c958:16537] [ 1] /lib/x86_64-linux-gnu/libc.so.6(getenv+0xa5)[0x7fc5458ae785]
[30cff116c958:16537] [ 2] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(_ZN13TCMallocGuardD1Ev+0x34)[0x7fc546119e44]
[30cff116c958:16537] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xf5)[0x7fc5458af615]
[30cff116c958:16537] [ 4] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(+0x13cb3)[0x7fc546117cb3]
[30cff116c958:16537] *** End of error message ***

CalledProcessError Traceback (most recent call last)
in ()
----> 1 get_ipython().run_cell_magic('shell', '', '\nxvfb-run -s "-screen 0 1400x900x24" python 05_train_controller.py car_racing --num_worker 16 --num_worker_trial 2 --num_episode 4 --max_length 1000 --eval_steps 25')

2 frames
/usr/local/lib/python3.6/dist-packages/google/colab/_system_commands.py in check_returncode(self)
136 if self.returncode:
137 raise subprocess.CalledProcessError(
--> 138 returncode=self.returncode, cmd=self.args, output=self.output)
139
140 def repr_pretty(self, p, cycle): # pylint:disable=unused-argument

CalledProcessError: Command '
xvfb-run -s "-screen 0 1400x900x24" python 05_train_controller.py car_racing --num_worker 16 --num_worker_trial 2 --num_episode 4 --max_length 1000 --eval_steps 25' returned non-zero exit status 1.

@MUHAMMAD0KASHIF
Copy link
Author

Hi,
I did not try the updated code but I was able to run the previous push by adding the following at line 492. Please let me know if it works for you or not?

  • subprocess.check_call(["mpirun", "-np", str(n), sys.executable] +['-u']+ sys.argv, env=env)
  • subprocess.check_call(["mpirun", "--allow-run-as-root", "-np", str(n), sys.executable] +['-u']+ sys.argv, env=env)

You need to allow it run as root on colab.

@NeoBoy
Copy link

NeoBoy commented Jul 6, 2020

@MUHAMMAD0KASHIF Thanks it worked but if I train the controller on Colab with GPU runtime, the runtime crashes after some time (it also appear to be utilizing a lot of RAM before crashing) so it means I should train the controller using Colab with CPU runtime, requesting suggestions please

@MUHAMMAD0KASHIF
Copy link
Author

I used to run the it on colab pro with 25 GB ram but I just tried the code with a very little # of episodes.

@NeoBoy
Copy link

NeoBoy commented Jul 11, 2020

@MUHAMMAD0KASHIF Can you kindly share what parameters you have used for training the controller and how much time it took to complete? Also I haven't been able to understand what is the stopping criteria, please share if you understood

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants