You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been trying to get the quadruped_walk demo working to see how to integrate a custom robot (instead of a vehicle object type) into the gym-chrono architecture. However, it's been giving me nan errors during the training process.
The demo would not run as-is since the Go1 URDF is not included in the feature/robot_model branch, but copying it from the primary Chrono repo to the right place in the file system worked fine. During runtime, I also got warnings from the Parser module, as the links with the suffix _shoulder_thigh have no inertial values; but this does not immediately cause the sim to fail. Before even a single iterations is complete, I get the following warning:
Traceback (most recent call last):
File "/home/josh/wheel_limb_robotics/simulation/chrono/build_chrono/../gym-chrono/gym_chrono/train/quadruped_walk_train.py", line 126, in <module>
model.learn(training_steps_per_save, callback=TensorboardCallback()) # This is where the errors happen
File "/home/josh/miniforge3/envs/chrono/lib/python3.10/site-packages/stable_baselines3/ppo/ppo.py", line 315, in learn
return super().learn(
File "/home/josh/miniforge3/envs/chrono/lib/python3.10/site-packages/stable_baselines3/common/on_policy_algorithm.py", line 300, in learn
continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps)
File "/home/josh/miniforge3/envs/chrono/lib/python3.10/site-packages/stable_baselines3/common/on_policy_algorithm.py", line 179, in collect_rollouts
actions, values, log_probs = self.policy(obs_tensor)
File "/home/josh/miniforge3/envs/chrono/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/josh/miniforge3/envs/chrono/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/josh/miniforge3/envs/chrono/lib/python3.10/site-packages/stable_baselines3/common/policies.py", line 654, in forward
distribution = self._get_action_dist_from_latent(latent_pi)
File "/home/josh/miniforge3/envs/chrono/lib/python3.10/site-packages/stable_baselines3/common/policies.py", line 694, in _get_action_dist_from_latent
return self.action_dist.proba_distribution(mean_actions, self.log_std)
File "/home/josh/miniforge3/envs/chrono/lib/python3.10/site-packages/stable_baselines3/common/distributions.py", line 164, in proba_distribution
self.distribution = Normal(mean_actions, action_std)
File "/home/josh/miniforge3/envs/chrono/lib/python3.10/site-packages/torch/distributions/normal.py", line 57, in __init__
super().__init__(batch_shape, validate_args=validate_args)
File "/home/josh/miniforge3/envs/chrono/lib/python3.10/site-packages/torch/distributions/distribution.py", line 70, in __init__
raise ValueError(
ValueError: Expected parameter loc (Tensor of shape (14, 12)) of distribution Normal(loc: torch.Size([14, 12]), scale: torch.Size([14, 12])) to satisfy the constraint Real(), but found invalid values:
tensor([[-8.0589e-03, 4.3196e-02, -7.4162e-02, 5.9881e-02, 4.0528e-02,
-4.7503e-02, 8.0593e-03, 2.0091e-02, 1.1184e-01, -4.7284e-02,
-8.9032e-02, -1.0514e-01],
[-2.2090e-03, 6.7961e-02, -7.1827e-02, 7.1473e-02, 3.7119e-02,
-7.2184e-02, 4.3422e-03, 1.5177e-03, 1.3603e-01, -3.3372e-02,
-1.1141e-01, -1.3102e-01],
[ 1.3250e-02, 5.9128e-02, -5.6648e-02, 5.1138e-02, 2.0293e-02,
-6.5499e-02, 1.2423e-02, 1.0288e-03, 1.0119e-01, -2.8113e-02,
-9.3130e-02, -1.0186e-01],
[-2.1020e-02, 5.4918e-02, -6.5764e-02, 6.9619e-02, 4.7154e-02,
-4.2845e-02, 1.1443e-02, 1.8706e-02, 1.2611e-01, -4.3891e-02,
-8.8186e-02, -1.0046e-01],
[-1.0927e-02, 4.6719e-02, -7.1205e-02, 5.6766e-02, 3.9871e-02,
-4.6968e-02, 1.1437e-02, 1.8659e-02, 1.0487e-01, -4.7093e-02,
-8.7585e-02, -1.0848e-01],
[ 8.4263e-03, 4.9921e-02, -3.3118e-02, 4.1117e-02, 2.0859e-02,
-4.6968e-02, 1.7116e-02, -1.6121e-04, 8.4842e-02, -2.1907e-02,
-7.1619e-02, -5.7212e-02],
[ 2.2725e-02, 7.3390e-02, -6.7675e-02, 2.6447e-02, 5.9733e-03,
-7.6256e-02, 1.5032e-02, 1.7301e-02, 1.2674e-01, -2.5455e-02,
-1.1069e-01, -9.8125e-02],
[ 2.6836e-03, 5.3712e-02, -4.2878e-02, 4.3726e-02, 1.7628e-02,
-5.7632e-02, 1.6966e-02, 4.4743e-03, 1.1360e-01, -2.1165e-02,
-6.6384e-02, -6.1909e-02],
[ 1.3325e-02, 5.5623e-02, -5.3349e-02, 4.4153e-02, 8.4252e-03,
-8.2826e-02, 1.1303e-02, -7.5424e-03, 1.1314e-01, -3.6277e-02,
-9.4814e-02, -1.0011e-01],
[ 3.5857e-02, 5.7642e-02, -6.2666e-02, 4.0946e-02, 1.6901e-02,
-9.1604e-02, 1.8820e-02, 2.3655e-03, 1.3066e-01, -3.3751e-02,
-1.0431e-01, -9.9027e-02],
[ 1.2128e-02, 6.9653e-02, -6.1580e-02, 3.1485e-02, 5.9991e-03,
-8.1966e-02, 7.6969e-04, 4.0367e-03, 1.2343e-01, -3.2923e-02,
-1.0237e-01, -1.1420e-01],
[-2.3512e-02, 6.2824e-02, -5.2266e-02, 5.2913e-02, 3.8880e-02,
-4.2128e-02, -8.0685e-03, 6.9352e-03, 1.1351e-01, -3.2397e-02,
-7.8904e-02, -8.8390e-02],
[-1.7412e-02, 1.1974e-01, -8.6609e-02, 5.2703e-02, 2.8904e-02,
-8.8414e-02, 2.5642e-03, 1.1933e-02, 1.9067e-01, -1.8322e-02,
-1.4315e-01, -1.6094e-01],
[ nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan]], device='cuda:0')
Destructor called, No device to delete.
I've seen no consistency as to which element of the tensor ends up with nan values, but whichever tensor element ends up as nan ends up as allnan values. Numerical values elsewhere in the tensor are random, as expected.
In case it was an issue with the inertia-free links, I tried adding very small (but larger than the "ignore this link" threshold in the Parser code) inertial values to the _shoulder_thigh links. then running the sim with and without these (and with and without commenting out the warning in Parser); no change was observed.
I've managed to get two bits of insight while playing around with setup parameters:
(1) Changing the number of CPU cores/environments trained in parallel also changes the number of iterations before failure. At a timestep of 5e-4 with no other changes besides the number of cores:
num_cpu = 1 : Fails in 5 iterations
num_cpu = 2 : Fails in 56 iterations
num_cpu = 3 : Fails in 13 iterations
num_cpu = 4 : Fails in 19 iterations
num_cpu = 5 : Fails in 17 iterations
num_cpu >= 6 : Fails immediately (mostly*)
*EDIT: Some num_cpu selections > 6 manage to succeed for a few iterations, with no discernible pattern. For instance, 6 iterations are completed when running on 26 cores.
(2) The number of above iterations before failure increases as timestep decreases, too; for instance with a timestep of 1e-3 (twice the default), using 14 CPU cores yields failure after 5 iterations.
The fact that quantity of successful iterations corresponds directly to the number of CPU cores (i.e. the total steps accomplished) suggests to me that it might be a memory issue (as opposed to a core dynamics engine thing where there's just a chance of the dynamics breaking), but the fact that there's no immediate consistently decreasing relationship between number of cores used and number of successful iterations confuses me.
If both the CPU cores used and the environment step size remain consistent, failure seems to occur at exactly the same number of iterations each time. For instance, with the default step size and 1 CPU core, it always fails after 5 successful iterations. To me, this implies a memory issue, but I'm not entirely sure how to deal with that if that were the case. The fact that not even a single iteration is completable with the default configuration would mean that the memory leak sidestepping addressed in Issue #14 wouldn't help since I could not even save a single checkpoint.
Any idea why this could be happening? is this a memory leak issue, a dynamics engine issue, something unique to the Go1 robot model, or something completely different? Any tips for how to go about debugging?
The custom system I'm interested in using in this simulator has leg-like suspension, so this simple example is pretty important to me; if the core dynamics engine has trouble handling these sorts of systems instead of just this system, it could be useful to know. Thanks in advance!
Edit 2: The cobra_wpts.py example seems to run without this issue (though it's pretty slow given the use of the system's camera?); looks like this might be a Unitree-specific problem, but I'm not sure how to debug it.
The text was updated successfully, but these errors were encountered:
jpmartin42
changed the title
Frequent errors
Frequent nan errors in quadruped_walk_train.py example
Oct 8, 2024
System info: Ubuntu 22.04
I've been trying to get the
quadruped_walk
demo working to see how to integrate a custom robot (instead of a vehicle object type) into the gym-chrono architecture. However, it's been giving menan
errors during the training process.The demo would not run as-is since the Go1 URDF is not included in the
feature/robot_model
branch, but copying it from the primary Chrono repo to the right place in the file system worked fine. During runtime, I also got warnings from the Parser module, as the links with the suffix_shoulder_thigh
have no inertial values; but this does not immediately cause the sim to fail. Before even a single iterations is complete, I get the following warning:I've seen no consistency as to which element of the tensor ends up with
nan
values, but whichever tensor element ends up asnan
ends up as allnan
values. Numerical values elsewhere in the tensor are random, as expected.In case it was an issue with the inertia-free links, I tried adding very small (but larger than the "ignore this link" threshold in the Parser code) inertial values to the
_shoulder_thigh
links. then running the sim with and without these (and with and without commenting out the warning in Parser); no change was observed.I've managed to get two bits of insight while playing around with setup parameters:
(1) Changing the number of CPU cores/environments trained in parallel also changes the number of iterations before failure. At a timestep of 5e-4 with no other changes besides the number of cores:
num_cpu = 1
: Fails in 5 iterationsnum_cpu = 2
: Fails in 56 iterationsnum_cpu = 3
: Fails in 13 iterationsnum_cpu = 4
: Fails in 19 iterationsnum_cpu = 5
: Fails in 17 iterationsnum_cpu >= 6
: Fails immediately (mostly*)*EDIT: Some num_cpu selections > 6 manage to succeed for a few iterations, with no discernible pattern. For instance, 6 iterations are completed when running on 26 cores.
(2) The number of above iterations before failure increases as timestep decreases, too; for instance with a timestep of 1e-3 (twice the default), using 14 CPU cores yields failure after 5 iterations.
The fact that quantity of successful iterations corresponds directly to the number of CPU cores (i.e. the total steps accomplished) suggests to me that it might be a memory issue (as opposed to a core dynamics engine thing where there's just a chance of the dynamics breaking), but the fact that there's no immediate consistently decreasing relationship between number of cores used and number of successful iterations confuses me.
If both the CPU cores used and the environment step size remain consistent, failure seems to occur at exactly the same number of iterations each time. For instance, with the default step size and 1 CPU core, it always fails after 5 successful iterations. To me, this implies a memory issue, but I'm not entirely sure how to deal with that if that were the case. The fact that not even a single iteration is completable with the default configuration would mean that the memory leak sidestepping addressed in Issue #14 wouldn't help since I could not even save a single checkpoint.
Any idea why this could be happening? is this a memory leak issue, a dynamics engine issue, something unique to the Go1 robot model, or something completely different? Any tips for how to go about debugging?
The custom system I'm interested in using in this simulator has leg-like suspension, so this simple example is pretty important to me; if the core dynamics engine has trouble handling these sorts of systems instead of just this system, it could be useful to know. Thanks in advance!
Edit 2: The cobra_wpts.py example seems to run without this issue (though it's pretty slow given the use of the system's camera?); looks like this might be a Unitree-specific problem, but I'm not sure how to debug it.
The text was updated successfully, but these errors were encountered: