Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_os_unfair_lock_corruption_abort after fork on MacOS #660

Closed
SGrondin opened this issue Dec 18, 2023 · 15 comments
Closed

_os_unfair_lock_corruption_abort after fork on MacOS #660

SGrondin opened this issue Dec 18, 2023 · 15 comments

Comments

@SGrondin
Copy link
Collaborator

On MacOS, dune exec stress/stress_proc.exe fails with

Fatal error: exception Multiple exceptions:
- Failure("Unexpected end-of-file at offset 0")
- Failure("Unexpected end-of-file at offset 0")

It usually completes a few rounds successfully before failing. It sometimes fails with just a single exception instead of 2+ like the example above.

It completes a lot more rounds (and sometimes even reaches 100 rounds successfully) on the pool-systhreads branch.

@talex5
Copy link
Collaborator

talex5 commented Dec 20, 2023

Can you get a strace (or mac equivalent) log for it? Does it still happen with n_domains = 1?

Could also try with OCAMLRUNPARAM=b, though I guess the error is trying to read the process output. The main interesting questions are: is it reading from the correct FD, and did it close the FD too soon?

@SGrondin
Copy link
Collaborator Author

SGrondin commented Dec 20, 2023

Does it still happen with n_domains = 1?

No, it does not fail with n_domains = 1.

Interestingly, setting OCAMLRUNPARAM=b makes the failures occur about 3x less frequently, similar to running the stress test on the pool-systhreads branch. It seems the extra overhead creates enough "spacing" to make the race condition less frequent?

Similarly, sudo dtruss _build/default/stress/stress_proc.exe (dtruss being the strace equivalent) slows down or creates enough synchronization to make the bug simply not occur when observed under dtruss.

I've tried with more domains, fewer domains, more procs per domain, all kinds of combinations, etc. I can't get it to crash under dtruss. Do you think there's any value in looking at the dtruss output of successful runs?

Could also try with OCAMLRUNPARAM=b, though I guess the error is trying to read the process output.

On current main (a26c3ebccc5dcf1480aa72c39dd645b43ab32db2), the backtrace is:

Fatal error: exception Failure("Unexpected end-of-file at offset 0")
Raised at Stdlib.failwith in file "stdlib.ml", line 29, characters 17-33
Called from Eio__Process.parse_out.(fun) in file "lib_eio/process.ml", line 154, characters 17-61
Called from Eio__core__Switch.run_internal in file "lib_eio/core/switch.ml", line 134, characters 8-12
Re-raised at Eio__core__Switch.maybe_raise_exs in file "lib_eio/core/switch.ml", line 118, characters 21-56
Called from Eio__core__Switch.run_internal in file "lib_eio/core/switch.ml", line 150, characters 4-21
Called from Eio__core__Cancel.with_cc in file "lib_eio/core/cancel.ml", line 115, characters 8-12
Re-raised at Eio__core__Cancel.with_cc in file "lib_eio/core/cancel.ml", line 117, characters 32-40
Called from Dune__exe__Stress_proc.run_in_domain.(fun) in file "stress/stress_proc.ml", line 12, characters 21-27
Called from Eio__core__Fiber.fork.(fun) in file "lib_eio/core/fiber.ml", line 20, characters 10-14
Re-raised at Eio__core__Switch.maybe_raise_exs in file "lib_eio/core/switch.ml", line 118, characters 21-56
Called from Eio__core__Switch.run_internal in file "lib_eio/core/switch.ml", line 139, characters 4-21
Called from Eio__core__Cancel.with_cc in file "lib_eio/core/cancel.ml", line 115, characters 8-12
Re-raised at Eio__core__Cancel.with_cc in file "lib_eio/core/cancel.ml", line 117, characters 32-40
Called from Eio__core__Fiber.any.(fun).wrap in file "lib_eio/core/fiber.ml", line 97, characters 16-20
Re-raised at Eio__core__Fiber.any in file "lib_eio/core/fiber.ml", line 138, characters 26-61
Called from Eio__Domain_manager.run.(fun) in file "lib_eio/domain_manager.ml", line 29, characters 4-231
Re-raised at Eio__Domain_manager.run.(fun) in file "lib_eio/domain_manager.ml", line 43, characters 11-19
Called from Eio_posix__Domain_mgr.wrap_backtrace in file "lib_eio_posix/domain_mgr.ml", line 80, characters 8-12
Re-raised at Eio_posix__Domain_mgr.unwrap_backtrace in file "lib_eio_posix/domain_mgr.ml", line 88, characters 22-57
Called from Eio__core__Fiber.fork.(fun) in file "lib_eio/core/fiber.ml", line 20, characters 10-14
Re-raised at Eio__core__Switch.maybe_raise_exs in file "lib_eio/core/switch.ml", line 118, characters 21-56
Called from Eio__core__Switch.run_internal in file "lib_eio/core/switch.ml", line 139, characters 4-21
Called from Eio__core__Cancel.with_cc in file "lib_eio/core/cancel.ml", line 115, characters 8-12
Re-raised at Eio__core__Cancel.with_cc in file "lib_eio/core/cancel.ml", line 117, characters 32-40
Called from Dune__exe__Stress_proc.main in file "stress/stress_proc.ml", line 21, characters 4-232
Called from Eio_posix__Sched.with_op in file "lib_eio_posix/sched.ml", line 310, characters 8-12
Re-raised at Eio_posix__Sched.with_op in file "lib_eio_posix/sched.ml", line 316, characters 4-12
Called from Eio_posix__Sched.run.(fun) in file "lib_eio_posix/sched.ml", line 368, characters 25-43
Re-raised at Eio_posix__Sched.run.fork.(fun) in file "lib_eio_posix/sched.ml", line 331, characters 12-76
Called from Stdlib__Fun.protect in file "fun.ml", line 33, characters 8-15
Re-raised at Stdlib__Fun.protect in file "fun.ml", line 38, characters 6-52
Called from Eio_posix__Sched.run in file "lib_eio_posix/sched.ml", line 364, characters 4-217
Called from Eio_posix__Sched.with_sched in file "lib_eio_posix/sched.ml", line 250, characters 8-12
Re-raised at Eio_posix__Sched.with_sched in file "lib_eio_posix/sched.ml", line 255, characters 4-39
Called from Dune__exe__Stress_proc in file "stress/stress_proc.ml", line 34, characters 2-71

@talex5
Copy link
Collaborator

talex5 commented Dec 22, 2023

It would be good to find a way of making it fail more often. e.g. the example runs a shell; what happens if you call /bin/echo directly? Or, if making the process faster doesn't help, what about the reverse and putting a sleep in it?

I tried running it on Linux with eio_posix and tsan and that didn't report anything.

Another way to make it fail sooner might be to replace the use of the high-level Eio.Process API with Eio_posix.Low_level.Process. The would also narrow down where the error could be.

@SGrondin
Copy link
Collaborator Author

SGrondin commented Dec 22, 2023

    1. If there is a correlation between the number of domains and the failure rate, then that correlation is a fairly weak one
    1. The fastest and most reliably (on average) I've been able to make it fail is ["sh"; "-c"; "sleep 0.25 && echo " ^ string_of_int n]
    1. Calling /bin/echo directly is fast enough to avoid the race condition
    • but it fails almost as much as ii. if I have 3 cores nearly maxed out with 3 yes > /dev/null & running in the background
    1. I noticed that if I start recompiling stress_proc.exe while it's running, the "jitter" introduced by the sudden addition of a CPU-bound process (and probably the OS moving threads around between cores) is enough to make it instantly fail 50% of the time.
    1. Running it under TSAN often (but not always) makes it stall (?) after 10 seconds? It's still consuming CPU but not visibly progressing through test rounds. Sometimes it also fails the same way as it does without TSAN.

Here's my version using Eio_posix.Low_level.Process, it fails with WSIGNALED SIGILL. Sometimes after 20+ rounds, often in under 5 rounds.

ASAN fails almost immediately with

==35289==ERROR: AddressSanitizer failed to deallocate 0x20000 (131072) bytes at address 0x000104f31800
AddressSanitizer: CHECK failed: sanitizer_posix.cpp:61 "(("unable to unmap" && 0)) != (0)" (0x0, 0x0) (tid=2740376)
    <empty stack>

Illegal instruction: 4

@talex5
Copy link
Collaborator

talex5 commented Dec 22, 2023

So the child process is dying with SIGILL and your failwith is getting called to display the error? Do you get a coredump with that?

It looks like it must be either the forked launcher (e.g. action_execve) crashing or sh itself (which seems unlikely).

What happens if you remove the pipe stuff (inherit_fds, read_all pipe1_r)?

@SGrondin
Copy link
Collaborator Author

(I edited the previous message with ASAN information, make sure to refresh the page)

Yeah, child process dies with SIGILL and failwith is just used to display the error.

I do not get a coredump.

It fails the same way when I use bash instead of sh. Same thing when I remove everything pipe-related.

@talex5
Copy link
Collaborator

talex5 commented Dec 22, 2023

Maybe coredumps need to be enabled: https://stackoverflow.com/a/2080938/50926

(not sure if OCaml is expected to work with ASAN - does it work with other programs?)

@SGrondin
Copy link
Collaborator Author

SGrondin commented Dec 22, 2023

I've had ASAN successfully detect a race condition in C++ code invoked from OCaml before.

I totally forgot about ulimit -c unlimited! Once enabled, I got a 3.9Gb core dump. Using lldb (since gdb doesn't understand MacOS core dumps):
lldb _build/default/stress/stress_proc.exe -c /cores/core.91104
I got the following backtrace!

* thread #1
  * frame #0: 0x00007ff811be7f23 libsystem_platform.dylib`_os_unfair_lock_corruption_abort + 37
    frame #1: 0x00007ff811be22e4 libsystem_platform.dylib`_os_unfair_lock_lock_slow + 257
    frame #2: 0x00007ff811a9e013 libsystem_c.dylib`localeconv_l + 55
    frame #3: 0x00007ff811a99b93 libsystem_c.dylib`__vfprintf + 113
    frame #4: 0x00007ff811acc868 libsystem_c.dylib`_vsnprintf + 249
    frame #5: 0x00007ff811aa72d0 libsystem_c.dylib`vsnprintf_l + 41
    frame #6: 0x00007ff811aa7389 libsystem_c.dylib`snprintf + 156
    frame #7: 0x00007ff81c6948d2 libsystem_coreservices.dylib`_dirhelper_init + 204
    frame #8: 0x00007ff811be1726 libsystem_platform.dylib`_os_once_callout + 18
    frame #9: 0x00007ff81c9c7c7f libSystem.B.dylib`libSystem_atfork_child + 48
    frame #10: 0x00007ff811ac282d libsystem_c.dylib`fork + 84
    frame #11: 0x000000010fb2f6d1 stress_proc.exe`caml_eio_posix_spawn(v_errors=271, v_actions=<unavailable>) at eio_posix_stubs.c:390:15 [opt]
    frame #12: 0x000000010fb6c4d7 stress_proc.exe`caml_c_call + 27
    frame #13: 0x000000010fa6cbf7 stress_proc.exe`camlEio_posix__Low_level.fun_2486 + 23
    frame #14: 0x000000010fa74d3a stress_proc.exe`camlEio_unix__Rcfd.use_609 + 106
    frame #15: 0x000000010fa6ca4d stress_proc.exe`camlEio_posix__Low_level.fun_2474 + 189
    frame #16: 0x000000010faa3a31 stress_proc.exe`camlEio__core__Switch.run_internal_1059 + 49
    frame #17: 0x000000010faa1822 stress_proc.exe`camlEio__core__Cancel.with_cc_1009 + 210
    frame #18: 0x000000010fa66895 stress_proc.exe`camlDune__exe__Stress_proc._echo2_532 + 421
    frame #19: 0x000000010fa66a23 stress_proc.exe`camlDune__exe__Stress_proc.fun_1442 + 35
    frame #20: 0x000000010faa8843 stress_proc.exe`camlEio__core__Fiber.fun_1348 + 67
    frame #21: 0x000000010fb128cb stress_proc.exe`camlStdlib__Fun.protect_320 + 107
    frame #22: 0x000000010fb6c804 stress_proc.exe`caml_runstack + 68
    frame #23: 0x000000010fb128cb stress_proc.exe`camlStdlib__Fun.protect_320 + 107
    frame #24: 0x000000010fa693dc stress_proc.exe`camlEio_posix__Sched.run_1324 + 284
    frame #25: 0x000000010fa68aef stress_proc.exe`camlEio_posix__Sched.with_sched_1189 + 543
    frame #26: 0x000000010fa66f0a stress_proc.exe`camlDune__exe__Stress_proc.entry + 250
    frame #27: 0x000000010fa638cb stress_proc.exe`caml_program + 2059
    frame #28: 0x000000010fb6c590 stress_proc.exe`caml_start_program + 112
    frame #29: 0x000000010fb6bdcb stress_proc.exe`caml_startup_common(argv=0x00007ff7b049d1e8, pooling=<unavailable>) at startup_nat.c:132:9 [opt]
    frame #30: 0x000000010fb6be3b stress_proc.exe`caml_main [inlined] caml_startup_exn(argv=<unavailable>) at startup_nat.c:139:10 [opt]
    frame #31: 0x000000010fb6be34 stress_proc.exe`caml_main [inlined] caml_startup(argv=<unavailable>) at startup_nat.c:144:15 [opt]
    frame #32: 0x000000010fb6be34 stress_proc.exe`caml_main(argv=<unavailable>) at startup_nat.c:151:3 [opt]
    frame #33: 0x000000010fb592ac stress_proc.exe`main(argc=<unavailable>, argv=<unavailable>) at main.c:37:3 [opt]
    frame #34: 0x000000011bdc252e dyld`start + 462

So yeah, this line.
I'm now trying to find out what can cause _os_unfair_lock_corruption_abort

@SGrondin
Copy link
Collaborator Author

SGrondin commented Dec 22, 2023

It's difficult to find information on it where the _os_unfair_lock_corruption_abort error comes from fork, but I found this: https://developer.apple.com/forums/thread/737464

Also this: https://hacks.mozilla.org/2022/10/improving-firefox-responsiveness-on-macos/

beware that it doesn’t play well with fork(). That’s because the lock stores internally the mach thread IDs to ensure consistent acquisition and release.
These IDs change after a call to fork() as the thread creates new ones when copying your process’ threads. This can lead to potential crashes in the child process. If your application uses fork(), or your library needs to be fork()-safe you’ll need to register at-fork handlers using pthread_atfork() to acquire all the locks in the parent before the fork, then release them after the fork (also in the parent), and reset them in the child.

However:

/* Note: fork actions MUST NOT allocate (either on the OCaml heap or with C malloc).
* This is because e.g. we might have forked while another thread in the parent had a lock.
* In the child, we inherit a copy of the locked mutex, but no corresponding thread to
* release it.
*/

EDIT: This OCaml 4.14 issue contains information regarding the pitfalls of mixing fork and pthreads, and advocates for posix_spwawn instead of fork: ocaml/ocaml#12636
More interesting discussions: ocaml/ocaml-lsp#1194 and ocaml/dune#8090

@SGrondin
Copy link
Collaborator Author

SGrondin commented Jan 5, 2024

@talex5 this PR (less than 30 minutes old!) may be related: ocaml/ocaml#12886

Quoting myself:

but it fails almost as much as ii. if I have 3 cores nearly maxed out with 3 yes > /dev/null & running in the background

The PR mentions thread switches causing issues, and that extra load of yes could definitely explain that.

@TheNumbat
Copy link

TheNumbat commented Jan 5, 2024

Yeah, this sounds related. I figured out that code was missing while debugging a forked process that aborted after failing to lock a locked pthreads mutex. The macos equivalent appears to be _os_unfair_lock_corruption_abort.

@SGrondin SGrondin changed the title MacOS fails stress_proc with "Unexpected end-of-file at offset 0" _os_unfair_lock_corruption_abort after fork on MacOS Jan 6, 2024
@SGrondin SGrondin changed the title _os_unfair_lock_corruption_abort after fork on MacOS _os_unfair_lock_corruption_abort after fork on MacOS Jan 6, 2024
@SGrondin
Copy link
Collaborator Author

SGrondin commented Jan 6, 2024

@TheNumbat I just tried your proposed fix (ocaml/ocaml#12886) and unfortunately it doesn't fix this issue, so maybe they are not the same issue after all?

@talex5
Copy link
Collaborator

talex5 commented Jan 8, 2024

Does that only apply to Unix.fork? We're not making any use of OCaml after forking, so I think it shouldn't matter. But the stacktrace shows macos using localeconv within the call to fork, which seems bad.

https://stackoverflow.com/questions/58076064/forkexec-process-with-threads-broken-on-macos-children-process-stuck-in-atf says "So far it seems that fork in multi-threaded environment is just broken on macOS."

http://sealiesoftware.com/blog/archive/2017/6/5/Objective-C_and_fork_in_macOS_1013.html suggests that OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES might help.

@SGrondin
Copy link
Collaborator Author

SGrondin commented Jan 8, 2024

OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES did not make a difference, I tried it both at build time and run time.

Every solution to this problem seems to mention the need to use posix_spawn on MacOS instead of fork, I haven't looked at what that would entail yet.

@SGrondin
Copy link
Collaborator Author

I cannot explain it with certainty, but the problem is gone. I've tried to replicate it on the Eio commit where I first encountered the issue. It's not happening anymore, whether I use the OCaml trunk branch or any of the numerous open OCaml PRs addressing various systhreads issues.

I've installed a large OS update recently (same major version though) and I'm going to tentatively assume it contained a fix or mitigation in some way.

I'll reopen this issue if it reoccurs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants