-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory Leak when Recompiling Modules and Reinstantiating Runtimes #1534
Comments
I was able to reproduce that issue, and I run some PPROF and take a look in the code. So, my suspicious was: "some mmap is leaking" and that seems o be the case, but that was odd. Because Wazero uses However, the Please, apply that change and see if that fix the issue: 7367c2e Patch: https://github.com/tetratelabs/wazero/commit/7367c2ea24b3127280bf7210be71cb972b55aa00.patch I notice the following:
If you take a look at
Each That is not the perfect solution. Someone need to find why |
This is great! Thank you!! 🙏 Testing with our contrived test seems to have a stable memory profile and runtime! 🎉 |
Grrr, fat fingered "enter"... ANYWAY, I was going to say that we tried it in our environment and actually started getting panics for what seems like a related reference.
This only appears to happen when we're recreating the runtime (calling the Close methods), not on first instantiation, or during routine operation. |
on it |
There are a lot of errors possible if a file cache is in use and it is closed in the middle of a function call. @abraithwaite is this possible in your case? |
See #1534 Signed-off-by: Adrian Cole <[email protected]>
#1537 reproduces the panic |
We're not using a file cache to my knowledge, unless one is created/set by default internally by wazero for something. |
Thanks @abraithwaite ... hmm I was unable to trigger the panic except yanking the cache. any chance you can share any code that may be custom about below:
|
Okay, looking at the stack trace and the fix in #1535, I think I understand what's going on. I don't believe it has to do with any filesystem cache. Our setup hasn't changed hardly at all from this example I made a little while back: What I see in #1535 is a change that, when closing an engine, we're resetting the parent references of the functions to nil. I believe when we deployed this, it's panicing because we've still got a function running. Specifically, the start routine of the python interpreter, which doesn't and shouldn't return until close is called (at least that's my understanding of how it's implemented now). When we're hitting the following code, the error is wazero/internal/engine/compiler/engine.go Line 1056 in 451a1b6
So the panic is triggered because After that panic triggers, we enter the deferred recover routine on line 739: wazero/internal/engine/compiler/engine.go Lines 733 to 745 in 451a1b6
The This is our run and our close code: func (w WASM) Run() error {
var err error
_, err = w.runtime.InstantiateModule(w.modCtx, w.module, w.config)
if errors.Is(err, sys.NewExitError(sys.ExitCodeContextCanceled)) {
return nil
}
return err
}
func (w *WASM) Close(ctx context.Context) error {
w.modCancel()
w.stdin.Close()
w.stdout.Close()
return w.runtime.Close(ctx)
} We're setting Perhaps canceling the context then closing the runtime is causing a double close unintentionally? Also something I don't understand is what's actually happening in the close function for the engine. Is setting everything to nil really all that's necessary to stop the engine? wazero/internal/engine/compiler/engine.go Lines 494 to 508 in 451a1b6
Finally, I see that wazero/internal/engine/compiler/engine.go Lines 723 to 732 in 451a1b6
and here: wazero/internal/engine/compiler/engine.go Lines 749 to 752 in 451a1b6
For my own understanding, why is that? Thanks again for the great work! 🤩 Love what you all are doing and it's been working pretty great for us. This stuff is so complex, I can't fathom how hard it must be to keep all these stateful things in your head. Best of luck and let me know if I can provide anything else! |
Since I'm in the "blame" for parts of this I can help explain. The I'd have to dig back into this bit but they could probably be consolidated by moving lines 750-751 into the default case of the select on line 730. Without revisiting the code I can't remember if there was some side-effecty thing requiring the |
I suspect the issue is that the function that is left running is never given a chance to verify that the module has closed. In fact, I could verify this with a quick and dirty change: if I modify
and then I add an extra check above:
then the reproducer at abraithwaite@ca10894d won't fail. However, this is not really a fix because it introduces an extra atomic load at every single host function call 🤔 |
To be clear, the example I linked with the race is/was a different issue. It's not directly relevant to this issue, I merely included it to show the configuration and setup of our runtime environment. |
oh I see. If that's representative at all (it does panic) it may be enough to apply this small patch evacchi@f319648 it appears that the function that is running exits as expected (people correct me if I am wrong, but the |
I think this may be related #1546 -- it is not the same exact issue, though, because evacchi@f319648 doesn't fix it |
We'll try this patch. Looks as if there shouldn't be any major side effects except maybe slightly reduced debugability, is that right? |
not really, the error you were seeing was a side effect of #1535: because the fields were cleared but the function was still running, in some specific cases (like yours on the Python interpreter) the method |
@abraithwaite I think I figured out what the issue really is. You don't need that patch, but you need to ensure that the module is shut down safely. What is happening e.g. in the example abraithwaite@ca10894d is that you have a blocking read going on on stdin, and you can't just close the runtime abruptly, otherwise you will be clearing some internal fields while a function is potentially still running. What you really want to do is:
so, I added to the func (w *WASM) Run() error {
var err error
w.moduleInstance, err = w.runtime.InstantiateModule(w.modCtx, w.module, w.config)
close(w.done)
if errors.Is(err, sys.NewExitError(sys.ExitCodeContextCanceled)) {
return nil
}
return err
} then, your close routine becomes: func (w *WASM) Close(ctx context.Context) error {
w.modCancel() // cancel the context
<-w.modCtx.Done() // wait for Done()
w.stdin.Close() // close the file handles
w.stdout.Close()
<-w.done // wait until the module terminates cleanly
return w.runtime.Close(ctx) // close the runtime
} for an example look here https://github.com/evacchi/wazero/pull/24/files let me know if I understood everything correctly and this works :) |
Hey @evacchi, really appreciate the feedback and advice! I'm curious if and how this will help with the memory leak mentioned in the OP. Does closing the module cleanly properly free the resources? Just trying to reconcile your advice with the original reproducing code I gave in the OP: https://github.com/runreveal/wazero_memleak/blob/main/main.go I don't see how this much simpler example code has any issues, or I'm clearly missing something. I added context cancellation to match what you've given and it's still leaking memory. There are no file handles to close in the example I've given. |
wait, I thought that #1535 solved the memory issues, and now you had panics for that nil reference, I was trying to address that part of the problem :P EDIT: to be more precise, the issue re: the nil reference should be specifically related to closing the runtime before the function has terminated; so you still have to wait on, say, a |
Nope, apologies for the confusion. I try to stay pretty strict about staying on topic, but this one involved a lot of context in the code. I was talking about the panic not because the panic itself was a problem, but because I believed that the panic meant that some memory wasn't being freed properly. The panic only happened after #1535, so I mentioned it because I don't believe it's the right fix. |
Describe the bug
We are recompiling modules frequently because our users may change their code.
We're also reinstantiating runtimes as a way to easily clear the module cache and start with a "clean slate" (we think/thought) for isolation purposes.
We're using the python interpreter guest.
We're closing all references to all wazero structs before reinstantiating things.
The memory continues to grow between each close/create cycle.
To Reproduce
Reproducable, just run main.go:
https://github.com/runreveal/wazero_memleak
Expected behavior
We expect memory to not grow when we're not holding any references to wazero objects any longer.
Environment (please complete the relevant information):
The text was updated successfully, but these errors were encountered: