Frequent nan errors in quadruped_walk_train.py example #17

jpmartin42 · 2024-10-08T02:36:10Z

System info: Ubuntu 22.04

I've been trying to get the quadruped_walk demo working to see how to integrate a custom robot (instead of a vehicle object type) into the gym-chrono architecture. However, it's been giving me nan errors during the training process.

The demo would not run as-is since the Go1 URDF is not included in the feature/robot_model branch, but copying it from the primary Chrono repo to the right place in the file system worked fine. During runtime, I also got warnings from the Parser module, as the links with the suffix _shoulder_thigh have no inertial values; but this does not immediately cause the sim to fail. Before even a single iterations is complete, I get the following warning:

Traceback (most recent call last):
  File "/home/josh/wheel_limb_robotics/simulation/chrono/build_chrono/../gym-chrono/gym_chrono/train/quadruped_walk_train.py", line 126, in <module>
    model.learn(training_steps_per_save, callback=TensorboardCallback()) # This is where the errors happen
  File "/home/josh/miniforge3/envs/chrono/lib/python3.10/site-packages/stable_baselines3/ppo/ppo.py", line 315, in learn
    return super().learn(
  File "/home/josh/miniforge3/envs/chrono/lib/python3.10/site-packages/stable_baselines3/common/on_policy_algorithm.py", line 300, in learn
    continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps)
  File "/home/josh/miniforge3/envs/chrono/lib/python3.10/site-packages/stable_baselines3/common/on_policy_algorithm.py", line 179, in collect_rollouts
    actions, values, log_probs = self.policy(obs_tensor)
  File "/home/josh/miniforge3/envs/chrono/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/josh/miniforge3/envs/chrono/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/josh/miniforge3/envs/chrono/lib/python3.10/site-packages/stable_baselines3/common/policies.py", line 654, in forward
    distribution = self._get_action_dist_from_latent(latent_pi)
  File "/home/josh/miniforge3/envs/chrono/lib/python3.10/site-packages/stable_baselines3/common/policies.py", line 694, in _get_action_dist_from_latent
    return self.action_dist.proba_distribution(mean_actions, self.log_std)
  File "/home/josh/miniforge3/envs/chrono/lib/python3.10/site-packages/stable_baselines3/common/distributions.py", line 164, in proba_distribution
    self.distribution = Normal(mean_actions, action_std)
  File "/home/josh/miniforge3/envs/chrono/lib/python3.10/site-packages/torch/distributions/normal.py", line 57, in __init__
    super().__init__(batch_shape, validate_args=validate_args)
  File "/home/josh/miniforge3/envs/chrono/lib/python3.10/site-packages/torch/distributions/distribution.py", line 70, in __init__
    raise ValueError(
ValueError: Expected parameter loc (Tensor of shape (14, 12)) of distribution Normal(loc: torch.Size([14, 12]), scale: torch.Size([14, 12])) to satisfy the constraint Real(), but found invalid values:
tensor([[-8.0589e-03,  4.3196e-02, -7.4162e-02,  5.9881e-02,  4.0528e-02,
         -4.7503e-02,  8.0593e-03,  2.0091e-02,  1.1184e-01, -4.7284e-02,
         -8.9032e-02, -1.0514e-01],
        [-2.2090e-03,  6.7961e-02, -7.1827e-02,  7.1473e-02,  3.7119e-02,
         -7.2184e-02,  4.3422e-03,  1.5177e-03,  1.3603e-01, -3.3372e-02,
         -1.1141e-01, -1.3102e-01],
        [ 1.3250e-02,  5.9128e-02, -5.6648e-02,  5.1138e-02,  2.0293e-02,
         -6.5499e-02,  1.2423e-02,  1.0288e-03,  1.0119e-01, -2.8113e-02,
         -9.3130e-02, -1.0186e-01],
        [-2.1020e-02,  5.4918e-02, -6.5764e-02,  6.9619e-02,  4.7154e-02,
         -4.2845e-02,  1.1443e-02,  1.8706e-02,  1.2611e-01, -4.3891e-02,
         -8.8186e-02, -1.0046e-01],
        [-1.0927e-02,  4.6719e-02, -7.1205e-02,  5.6766e-02,  3.9871e-02,
         -4.6968e-02,  1.1437e-02,  1.8659e-02,  1.0487e-01, -4.7093e-02,
         -8.7585e-02, -1.0848e-01],
        [ 8.4263e-03,  4.9921e-02, -3.3118e-02,  4.1117e-02,  2.0859e-02,
         -4.6968e-02,  1.7116e-02, -1.6121e-04,  8.4842e-02, -2.1907e-02,
         -7.1619e-02, -5.7212e-02],
        [ 2.2725e-02,  7.3390e-02, -6.7675e-02,  2.6447e-02,  5.9733e-03,
         -7.6256e-02,  1.5032e-02,  1.7301e-02,  1.2674e-01, -2.5455e-02,
         -1.1069e-01, -9.8125e-02],
        [ 2.6836e-03,  5.3712e-02, -4.2878e-02,  4.3726e-02,  1.7628e-02,
         -5.7632e-02,  1.6966e-02,  4.4743e-03,  1.1360e-01, -2.1165e-02,
         -6.6384e-02, -6.1909e-02],
        [ 1.3325e-02,  5.5623e-02, -5.3349e-02,  4.4153e-02,  8.4252e-03,
         -8.2826e-02,  1.1303e-02, -7.5424e-03,  1.1314e-01, -3.6277e-02,
         -9.4814e-02, -1.0011e-01],
        [ 3.5857e-02,  5.7642e-02, -6.2666e-02,  4.0946e-02,  1.6901e-02,
         -9.1604e-02,  1.8820e-02,  2.3655e-03,  1.3066e-01, -3.3751e-02,
         -1.0431e-01, -9.9027e-02],
        [ 1.2128e-02,  6.9653e-02, -6.1580e-02,  3.1485e-02,  5.9991e-03,
         -8.1966e-02,  7.6969e-04,  4.0367e-03,  1.2343e-01, -3.2923e-02,
         -1.0237e-01, -1.1420e-01],
        [-2.3512e-02,  6.2824e-02, -5.2266e-02,  5.2913e-02,  3.8880e-02,
         -4.2128e-02, -8.0685e-03,  6.9352e-03,  1.1351e-01, -3.2397e-02,
         -7.8904e-02, -8.8390e-02],
        [-1.7412e-02,  1.1974e-01, -8.6609e-02,  5.2703e-02,  2.8904e-02,
         -8.8414e-02,  2.5642e-03,  1.1933e-02,  1.9067e-01, -1.8322e-02,
         -1.4315e-01, -1.6094e-01],
        [        nan,         nan,         nan,         nan,         nan,
                 nan,         nan,         nan,         nan,         nan,
                 nan,         nan]], device='cuda:0')
Destructor called, No device to delete.

I've seen no consistency as to which element of the tensor ends up with nan values, but whichever tensor element ends up as nan ends up as all nan values. Numerical values elsewhere in the tensor are random, as expected.

In case it was an issue with the inertia-free links, I tried adding very small (but larger than the "ignore this link" threshold in the Parser code) inertial values to the _shoulder_thigh links. then running the sim with and without these (and with and without commenting out the warning in Parser); no change was observed.

I've managed to get two bits of insight while playing around with setup parameters:

(1) Changing the number of CPU cores/environments trained in parallel also changes the number of iterations before failure. At a timestep of 5e-4 with no other changes besides the number of cores:

num_cpu = 1 : Fails in 5 iterations
num_cpu = 2 : Fails in 56 iterations
num_cpu = 3 : Fails in 13 iterations
num_cpu = 4 : Fails in 19 iterations
num_cpu = 5 : Fails in 17 iterations
num_cpu >= 6 : Fails immediately (mostly*)

*EDIT: Some num_cpu selections > 6 manage to succeed for a few iterations, with no discernible pattern. For instance, 6 iterations are completed when running on 26 cores.

(2) The number of above iterations before failure increases as timestep decreases, too; for instance with a timestep of 1e-3 (twice the default), using 14 CPU cores yields failure after 5 iterations.

The fact that quantity of successful iterations corresponds directly to the number of CPU cores (i.e. the total steps accomplished) suggests to me that it might be a memory issue (as opposed to a core dynamics engine thing where there's just a chance of the dynamics breaking), but the fact that there's no immediate consistently decreasing relationship between number of cores used and number of successful iterations confuses me.

If both the CPU cores used and the environment step size remain consistent, failure seems to occur at exactly the same number of iterations each time. For instance, with the default step size and 1 CPU core, it always fails after 5 successful iterations. To me, this implies a memory issue, but I'm not entirely sure how to deal with that if that were the case. The fact that not even a single iteration is completable with the default configuration would mean that the memory leak sidestepping addressed in Issue #14 wouldn't help since I could not even save a single checkpoint.

Any idea why this could be happening? is this a memory leak issue, a dynamics engine issue, something unique to the Go1 robot model, or something completely different? Any tips for how to go about debugging?

The custom system I'm interested in using in this simulator has leg-like suspension, so this simple example is pretty important to me; if the core dynamics engine has trouble handling these sorts of systems instead of just this system, it could be useful to know. Thanks in advance!

Edit 2: The cobra_wpts.py example seems to run without this issue (though it's pretty slow given the use of the system's camera?); looks like this might be a Unitree-specific problem, but I'm not sure how to debug it.

The text was updated successfully, but these errors were encountered:

jpmartin42 changed the title ~~Frequent errors~~ Frequent nan errors in quadruped_walk_train.py example Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frequent nan errors in quadruped_walk_train.py example #17

Frequent nan errors in quadruped_walk_train.py example #17

jpmartin42 commented Oct 8, 2024 •

edited

Loading

Frequent nan errors in quadruped_walk_train.py example #17

Frequent nan errors in quadruped_walk_train.py example #17

Comments

jpmartin42 commented Oct 8, 2024 • edited Loading

jpmartin42 commented Oct 8, 2024 •

edited

Loading