Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash on GPU in Roberta branch #727

Open
dlwh opened this issue Sep 12, 2024 · 0 comments
Open

Crash on GPU in Roberta branch #727

dlwh opened this issue Sep 12, 2024 · 0 comments

Comments

@dlwh
Copy link
Member

dlwh commented Sep 12, 2024

https://github.com/JulienDarve/levanter/tree/broken_gpu

F0912 11:59:01.941641  741969 shape_tree.cc:54] Check failed: result->children_start_id >= 0 (-1 vs. 0)
*** Check failure stack trace: ***
    @     0x7f0e2365c1d4  absl::lts_20230802::log_internal::LogMessage::SendToLog()
    @     0x7f0e2365c0d4  absl::lts_20230802::log_internal::LogMessage::Flush()
    @     0x7f0e2365c579  absl::lts_20230802::log_internal::LogMessageFatal::~LogMessageFatal()
    @     0x7f0e2076c1f1  xla::internal::IndexTable::operator[]()
    @     0x7f0e204fd1d4  xla::HloDataflowAnalysis::GetValueSet()
    @     0x7f0e2025905c  xla::BufferAssignment::GetUniqueSlice()
    @     0x7f0e200667ce  xla::gpu::GetAllocationSlice()
    @     0x7f0e1f270b3b  xla::gpu::(anonymous namespace)::GetResultSlice()
    @     0x7f0e1f274217  xla::ShapeUtil::ForEachSubshapeWithStatus<>()::{lambda()#1}::operator()()
    @     0x7f0e1f274088  xla::ShapeUtil::ForEachMutableSubshapeWithStatusHelper<>()
    @     0x7f0e1f27410f  xla::ShapeUtil::ForEachMutableSubshapeWithStatusHelper<>()
    @     0x7f0e1f26bbed  xla::gpu::DynamicSliceFusion::Emit()
    @     0x7f0e1c0cc4a4  xla::gpu::IrEmitterUnnested::EmitFusion()
    @     0x7f0e1c0d7135  xla::gpu::IrEmitterUnnested::EmitHloInstruction()
    @     0x7f0e1c0b9cbe  xla::gpu::IrEmitterUnnested::EmitHloComputation()
    @     0x7f0e1be94d2a  xla::gpu::CompileModuleToLlvmIr()
    @     0x7f0e1be75920  xla::gpu::GpuCompiler::CompileToBackendResult()
    @     0x7f0e1be784fe  xla::gpu::GpuCompiler::RunBackend()
    @     0x7f0e1be39982  xla::Service::BuildExecutable()
    @     0x7f0e1be01f55  xla::LocalService::CompileExecutables()
    @     0x7f0e1bdf5a14  xla::LocalClient::Compile()
    @     0x7f0e1bd9adcb  xla::PjRtStreamExecutorClient::CompileInternal()
    @     0x7f0e1bd9be7e  xla::PjRtStreamExecutorClient::Compile()
    
Stack (most recent call first):
  File "/sailhome/jdarve/miniconda3/envs/levanter/lib/python3.10/site-packages/jax/_src/compiler.py", line 260 in backend_compile
  File "/sailhome/jdarve/miniconda3/envs/levanter/lib/python3.10/site-packages/jax/_src/profiler.py", line 333 in wrapper
  File "/sailhome/jdarve/miniconda3/envs/levanter/lib/python3.10/site-packages/jax/_src/compiler.py", line 654 in _compile_and_write_cache
  File "/sailhome/jdarve/miniconda3/envs/levanter/lib/python3.10/site-packages/jax/_src/compiler.py", line 426 in compile_or_get_cached
  File "/sailhome/jdarve/miniconda3/envs/levanter/lib/python3.10/site-packages/jax/_src/interpreters/pxla.py", line 2639 in _cached_compilation
  File "/sailhome/jdarve/miniconda3/envs/levanter/lib/python3.10/site-packages/jax/_src/interpreters/pxla.py", line 2827 in from_hlo
  File "/sailhome/jdarve/miniconda3/envs/levanter/lib/python3.10/site-packages/jax/_src/interpreters/pxla.py", line 2313 in compile
  File "/sailhome/jdarve/miniconda3/envs/levanter/lib/python3.10/site-packages/jax/_src/pjit.py", line 1651 in _pjit_call_impl_python
  File "/sailhome/jdarve/miniconda3/envs/levanter/lib/python3.10/site-packages/jax/_src/pjit.py", line 1721 in call_impl_cache_miss
  File "/sailhome/jdarve/miniconda3/envs/levanter/lib/python3.10/site-packages/jax/_src/pjit.py", line 1739 in _pjit_call_impl
  File "/sailhome/jdarve/miniconda3/envs/levanter/lib/python3.10/site-packages/jax/_src/core.py", line 949 in process_primitive
  File "/sailhome/jdarve/miniconda3/envs/levanter/lib/python3.10/site-packages/jax/_src/core.py", line 443 in bind_with_trace
  File "/sailhome/jdarve/miniconda3/envs/levanter/lib/python3.10/site-packages/jax/_src/core.py", line 2782 in bind
  File "/sailhome/jdarve/miniconda3/envs/levanter/lib/python3.10/site-packages/jax/_src/pjit.py", line 190 in _python_pjit_helper
  File "/sailhome/jdarve/miniconda3/envs/levanter/lib/python3.10/site-packages/jax/_src/pjit.py", line 332 in cache_miss
  File "/sailhome/jdarve/miniconda3/envs/levanter/lib/python3.10/site-packages/jax/_src/traceback_util.py", line 180 in reraise_with_filtered_traceback
  File "/sailhome/jdarve/miniconda3/envs/levanter/lib/python3.10/site-packages/haliax/partitioning.py", line 337 in _call
  File "/sailhome/jdarve/miniconda3/envs/levanter/lib/python3.10/site-packages/equinox/_module.py", line 1078 in __call__
  File "/sailhome/jdarve/miniconda3/envs/levanter/lib/python3.10/site-packages/haliax/partitioning.py", line 261 in __call__
  File "/sailhome/jdarve/levanter/src/levanter/trainer.py", line 364 in train_step
  File "/sailhome/jdarve/levanter/src/levanter/trainer.py", line 380 in training_steps
  File "/sailhome/jdarve/levanter/src/levanter/trainer.py", line 397 in train
  File "/sailhome/jdarve/levanter/src/levanter/main/train_mlm.py", line 215 in main
  File "/sailhome/jdarve/levanter/src/levanter/config.py", line 84 in wrapper_inner
  File "/sailhome/jdarve/levanter/src/levanter/main/train_mlm.py", line 218 in <module>

Extension modules: jaxlib.cpu_feature_guard, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, zstandard.backend_c, pyarrow.lib, pyarrow._parquet, pyarrow._fs, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, ray._raylet, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, multidict._multidict, yarl._helpers_c, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, xxhash._xxhash, pyarrow._json, PIL._imaging, kiwisolver._cext, regex._regex (total: 86)
Aborted (core dumped)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant