Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

owl_vit training problem.Unable to connect to Google during training, unable to retrieve data. How can data be stored locally and read in #1091

Open
lxyzler opened this issue Aug 8, 2024 · 0 comments

Comments

@lxyzler
Copy link

lxyzler commented Aug 8, 2024

python -m scenic.projects.owl_vit.main --alsologtostderr=true --workdir=/tmp/training --config=scenic/projects/owl_vit/configs/clip_b32_finetune.py

2024-08-08 01:14:33.266603: W external/xla/xla/service/gpu/nvptx_compiler.cc:836] The NVIDIA driver's CUDA version is 12.2 which is older than the PTX compiler version (12.5.82). Because the driver is older than the PTX compiler version, XLA is disabling parallel compilation, which may slow down compilation. You should update your NVIDIA driver or use the NVIDIA-provided CUDA forward compatibility packages.
I0808 01:14:37.952140 140605538547520 app.py:92] JAX host: 0 / 1
I0808 01:14:37.952368 140605538547520 app.py:93] JAX devices: [CudaDevice(id=0), CudaDevice(id=1), CudaDevice(id=2), CudaDevice(id=3), CudaDevice(id=4), CudaDevice(id=5), CudaDevice(id=6), CudaDevice(id=7)]
I0808 01:14:37.952456 140605538547520 local.py:45] Setting task status: host_id: 0, host_count: 1
I0808 01:14:37.952512 140605538547520 local.py:50] Created artifact Workdir of type ArtifactType.DIRECTORY and value /tmp/training.
I0808 01:14:37.954501 140605538547520 app.py:104] RNG: [0 0]
I0808 01:14:38.603692 140605538547520 checkpoints.py:1101] Found no checkpoint files in /tmp/training with prefix checkpoint_
&&&&&&&&&&&&&&&&&&&&&&&&&&&&&
I0808 01:14:38.604115 140605538547520 train_utils.py:380] device_count: 8
I0808 01:14:38.604308 140605538547520 train_utils.py:381] num_hosts : 1
I0808 01:14:38.604445 140605538547520 train_utils.py:382] host_id : 0
I0808 01:14:38.605386 140605538547520 train_utils.py:405] local_batch_size : 256
I0808 01:14:38.605548 140605538547520 train_utils.py:406] device_batch_size : 32
2024-08-08 01:14:38.973571: W external/local_tsl/tsl/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata.google.internal".
2024-08-08 01:15:39.983812: E external/local_tsl/tsl/platform/cloud/curl_http_request.cc:610] The transmission of request 0xdc1a0d0 (URI: https://www.googleapis.com/storage/v1/b/tfds-data/o/dataset_info%2Flvis%2F1.3.0?fields=size%2Cgeneration%2Cupdated) has been stuck at 0 of 0 bytes for 61 seconds and will be aborted. CURL timing information: lookup time: 0.010952 (No error), connect time: 0 (No error), pre-transfer time: 0 (No error), start-transfer time: 0 (No error)

@lxyzler lxyzler changed the title Unable to connect to Google during training, unable to retrieve data. How can data be stored locally and read in owl_vit training problem.Unable to connect to Google during training, unable to retrieve data. How can data be stored locally and read in Aug 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant