Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent panic: Custom error class must have a builder registered: Uncaught null #730

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mlafeldt
Copy link
Contributor

@mlafeldt mlafeldt commented May 6, 2024

Note: This isn't so much a fix (yet) but a starting point for discussion.

We've been running into a panic in to_v8_error: Custom error class must have a builder registered: Uncaught null. This panic has the unfortunate side effect of being unrecoverable, taking down one of our production services.

Here's a stack trace from lldb:

Unexpectedly panicked!: Custom error class must have a builder registered: Uncaught null
...
fatal runtime error: failed to initiate panic, error 5
Process 34421 stopped
* thread #21, stop reason = signal SIGABRT
    frame #0: 0x000000019e462a60 libsystem_kernel.dylib`__pthread_kill + 8
libsystem_kernel.dylib`__pthread_kill:
->  0x19e462a60 <+8>:  b.lo   0x19e462a80    ; <+40>
    0x19e462a64 <+12>: pacibsp
    0x19e462a68 <+16>: stp    x29, x30, [sp, #-0x10]!
    0x19e462a6c <+20>: mov    x29, sp
(lldb) bt
* thread #21, stop reason = signal SIGABRT
  * frame #0: 0x000000019e462a60 libsystem_kernel.dylib`__pthread_kill + 8
    frame #1: 0x000000019e49ac20 libsystem_pthread.dylib`pthread_kill + 288
    frame #2: 0x000000019e3a7a20 libsystem_c.dylib`abort + 180
    frame #3: 0x00000001064e8ffc my_project`std::sys::pal::unix::abort_internal::h7fdb07780559b948 at mod.rs:369:14 [opt]
    frame #4: 0x00000001064e2a80 my_project`rust_panic at panicking.rs:832:5 [opt]
    frame #5: 0x00000001064e26d0 my_project`std::panicking::rust_panic_with_hook::h84760468187ddc85 at panicking.rs:801:5 [opt]
    frame #6: 0x00000001064e2034 my_project`std::panicking::begin_panic_handler::_$u7b$$u7b$closure$u7d$$u7d$::he666a5eb600a7203 at panicking.rs:657:13 [opt]
    frame #7: 0x00000001064e0b38 my_project`std::sys_common::backtrace::__rust_end_short_backtrace::h592f44d2bf9f843f at backtrace.rs:171:18 [opt]
    frame #8: 0x00000001064e1dac my_project`rust_begin_unwind at panicking.rs:645:5 [opt]
    frame #9: 0x0000000106595280 my_project`core::panicking::panic_fmt::h98bbf7bdf4994454 at panicking.rs:72:14 [opt]
    frame #10: 0x0000000103e021cc my_project`core::panicking::panic_display::h49a5f85cc12b168a(x=0x0000000171fe9230) at panicking.rs:197:5
    frame #11: 0x000000010656e3f8 my_project`deno_core::error::to_v8_error::panic_cold_display::h5afd472d602e0190(arg=0x0000000171fe9230) at panic.rs:99:13
    frame #12: 0x0000000103dbedc4 my_project`deno_core::error::to_v8_error::h815a6b2d2ccf79fc(scope=0x0000000171feaf78, get_class=&dyn core::ops::function::Fn<(&anyhow::Error), Output=&str> @ 0x0000000171fe9388, error=0x0000000171feb6e8) at error.rs:130:7
    frame #13: 0x0000000100192fa0 my_project`my_project_ext::bindings::op_sign_ecdsa::op_sign_ecdsa::v8_fn_ptr::h37d3f58f24cd364b [inlined] my_project_ext::bindings::op_sign_ecdsa::op_sign_ecdsa::slow_function_impl::hc80cd1f40c0a8958(info=0x0000000171fed778) at bindings.rs:255:1
    frame #14: 0x000000010018faf8 my_project`my_project_ext::bindings::op_sign_ecdsa::op_sign_ecdsa::v8_fn_ptr::h37d3f58f24cd364b(info=0x0000000171fed778) at bindings.rs:255:1
    frame #15: 0x0000000105342f98 my_project`Builtins_CallApiCallbackGeneric + 184
    frame #16: 0x0000000105340ef0 my_project`Builtins_InterpreterEntryTrampoline + 272
    frame #17: 0x0000000105340ef0 my_project`Builtins_InterpreterEntryTrampoline + 272
    frame #18: 0x0000000105340ef0 my_project`Builtins_InterpreterEntryTrampoline + 272
    frame #19: 0x0000000105340ef0 my_project`Builtins_InterpreterEntryTrampoline + 272
    frame #20: 0x000000010533ec0c my_project`Builtins_JSEntryTrampoline + 172
    frame #21: 0x000000010533e8f4 my_project`Builtins_JSEntry + 148
    frame #22: 0x000000010432e080 my_project`::Invoke() at execution.cc:418:22 [opt]
    frame #23: 0x000000010432e660 my_project`::CallScript() at execution.cc:515:10 [opt]
    frame #24: 0x00000001042236d8 my_project`::Run() at api.cc:2170:7 [opt]
    frame #25: 0x0000000103e50e84 my_project`v8::script::_$LT$impl$u20$v8..data..Script$GT$::run::_$u7b$$u7b$closure$u7d$$u7d$::hc55df96f0e9b0c0f((null)={closure_env#0} @ 0x0000000171fede38, sd=0x00006000021245b0) at script.rs:94:29
    frame #26: 0x0000000103db1654 my_project`v8::script::_$LT$impl$u20$v8..data..Script$GT$::run::ha661a3fa3c29b130 [inlined] v8::scope::HandleScope$LT$$LP$$RP$$GT$::cast_local::h2648d36164639866(self=0x0000000171fee3d8, f={closure_env#0} @ 0x0000000171fedf68) at scope.rs:239:21
    frame #27: 0x0000000103db1374 my_project`v8::script::_$LT$impl$u20$v8..data..Script$GT$::run::ha661a3fa3c29b130(self=0x000000015405da48, scope=0x0000000171fee3d8) at script.rs:94:7
    frame #28: 0x0000000103f25450 my_project`deno_core::runtime::jsrealm::JsRealm::execute_script::hc4f0f80d4eb346fa(self=0x0000000171ff4c98, isolate=0x00000001380b0000, name=FastString @ 0x0000000171fee950, source_code=FastString @ 0x0000000171fee968) at jsrealm.rs:289:11
    frame #29: 0x000000010273864c my_project`deno_core::runtime::jsruntime::JsRuntime::execute_script::haf61263922869892(self=0x0000000171ff4c80, name=(data_ptr = "<user_provided_script>Uncaught Error: execution terminatedError running worker.execute_script(): Your function exceeded the maximum runtime of ms and was terminated.Your function exceeded the maximum memory of  MB and was terminated.Event loop has completed: ", length = 22), source_code=String @ 0x0000000171fef2f0) at jsruntime.rs:1346:5
    frame #30: 0x0000000100130ff8 my_project`my_project_server::runtime::execute_js::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::h343ae0eabc6afa3b((null)=0x0000000171ff41c0) at runtime.rs:302:21
...

For context: We've implemented ops that communicate with another process via gRPC. If that other process disconnects for whatever reason, the panic is (sometimes?) triggered. In that case to_v8_error receives this error from tonic:

Error {
    context: "op_sign_ecdsa",
    source: Disconnected,
}

And the JsError inside the has_caught block looks like this:

JsError {
    name: None,
    message: None,
    stack: None,
    cause: None,
    exception_message: "Uncaught null",
    frames: [],
    source_line: None,
    source_line_frame_index: None,
    aggregated: None,
}

As you can see in the diff, I've added a workaround that works for us but is likely not a proper fix.

I honestly don't know what "custom error class" means here or why it only fails in this particular case.

This problem has existed since at least Deno v1.37, with the difference that the affected ops now require to be marked as reentrant.

I can provide more info if needed. ✌️

@mlafeldt mlafeldt force-pushed the fix/v8-error-panic branch 2 times, most recently from a0c5aab to 258da57 Compare May 11, 2024 09:32
@mlafeldt
Copy link
Contributor Author

Rebased

@bartlomieju
Copy link
Member

@mlafeldt this seems reasonable, could you add a unit test that exercises this code path?

Return the underlying error message instead of panicking with:
"Custom error class must have a builder registered: Uncaught null".
@mlafeldt
Copy link
Contributor Author

@mlafeldt this seems reasonable, could you add a unit test that exercises this code path?

@bartlomieju I'm trying. :)

However, as mentioned, I've yet to understand what's going on exactly.

The Uncaught null situation seems to occur when a reentrant op fails (in our case due to a closed flume channel) and the runtime is aborted by isolate.terminate_execution() at the same time. And it seems to happen for any error type (casting to generic_error did not help), so "Custom error class must have a builder registered" might not be the actual problem.

@bartlomieju
Copy link
Member

Interesting, any chance you could try to reproduce similar setup? It might be easier to do so in the checkin directory which is sort of a standalone runtime, than trying to do a unit test.

@mlafeldt
Copy link
Contributor Author

Thanks for the tip. Will take another look.

My main question right now is why maybe_exception ends up being None despite Uncaught null (even in case of a generic error), which triggers the panic.

deno_core/core/error.rs

Lines 145 to 157 in c106dd1

let maybe_exception = cb.call(tc_scope, this, &args);
match maybe_exception {
Some(exception) => exception,
None => {
let mut msg =
"Custom error class must have a builder registered".to_string();
if tc_scope.has_caught() {
let e = tc_scope.exception().unwrap();
let js_error = JsError::from_v8_exception(tc_scope, e);
msg = format!("{}: {}", msg, js_error.exception_message);
}
panic!("{}", msg);

@bartlomieju
Copy link
Member

I'm not entirely sure, it might be a bug in rusty_v8 that improperly casts null into None.

@bartlomieju
Copy link
Member

@mlafeldt sorry for a slow turn-around. Do you have a test case for this particular panic? If so, can you please link it and I'll look into the problem. I'm really not a fan of matching on string repr of the error to change behavior (even though we already do it in a few places 😬).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants