Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cohttp-eio.Server: Don't blow up in callback on client read timeout. #1021

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mefyl
Copy link
Contributor

@mefyl mefyl commented Feb 8, 2024

Yet another escaping exception. We can't catch the specific Eio_unix.Unix_error (ETIMEDOUT, _, _) since this is backend agnostic, so I'm stopping all backend exceptions. Which I dislike, suggestions welcome.

Backtrace
routine-push-notifications: internal error, uncaught exception:
                            Eio.Io Unix_error (Operation timed out, "readv", "")
                            Raised at Eio_posix__Flow.Impl.single_read in file "lib_eio_posix/flow.ml", line 88, characters 55-85
                            Called from Eio__Flow.single_read in file "lib_eio/flow.ml", line 85, characters 12-31
                            Called from Eio__Buf_read.ensure_slow_path in file "lib_eio/buf_read.ml", line 122, characters 18-50
                            Called from Eio__Buf_read.ensure in file "lib_eio/buf_read.ml" (inlined), line 131, characters 20-40
                            Called from Cohttp_eio__Io.IO.refill in file "cohttp-eio/src/io.ml", line 17, characters 29-62
                            Called from Cohttp__Request.Make.read in file "cohttp/src/request.ml", line 183, characters 8-20
                            Called from Cohttp_eio__Server.read in file "cohttp-eio/src/server.ml", line 46, characters 8-29
                            Called from Cohttp_eio__Server.callback.handle in file "cohttp-eio/src/server.ml", line 98, characters 10-20

@talex5
Copy link
Contributor

talex5 commented Feb 8, 2024

Why do we want to stop exceptions escaping? Typically they will just be logged by the user's on_error handler. And this code is just logging them anyway!

I think it makes sense to drop connection-reset errors (or log them at debug level), since dropping connections is normal behaviour for a browser when you close a tab, but the others should be passed on I think.

@mefyl
Copy link
Contributor Author

mefyl commented Feb 8, 2024

I would understand if you'd prefer to have it just below in run instead of callback, but it's truly needed at least there lest the http server will shut down if one client times out.

Now regarding whether it belongs in callback, I think the argument is similar to connection_reset. The client could reset the connection mid-request and callback would finish silently, I think it should be the same for timeouts. Otherwise, I think we're setting up users of the library for failure as you pretty much always want to catch and ignore these. We'd need to document precisely that these can escape, and it is my opinion that we will only see servers written with cohttp-eio restart on such errors until all holes are plugged. And every single person writing a server with cohttp-eio will end up rewriting the same try/with.

Now what I would personally prefer would be for callback to return a Result.t, which has the best of both worlds: the user can't miss an error that will crash her server, but we also don't forcibly hide errors from her if she wants to specifically handle them. But that would require changing Cohttp.Generic.Server.S.

So my take is that 1/ it's mandatory to handle the exception in run 2/ not handling it in callback means setting up an endless stream of users for unexpected server shutdowns until they all write the same try/with.

@talex5
Copy link
Contributor

talex5 commented Feb 8, 2024

but it's truly needed at least there lest the http server will shut down if one client times out.

That would only happen if the user runs the server with ~on_error:raise. Which I see the example does. We should probably change that to e.g.

  Cohttp_eio.Server.run socket server
    ~on_error:(fun ex -> Logs.warn (fun f -> f "%a" Eio.Exn.pp ex))

@mefyl
Copy link
Contributor Author

mefyl commented Feb 8, 2024

You're right that it's because of the example's on_error; if we change that I suppose that as far as run goes, it's explicit enough for the user that exceptions must be handled, although we don't specify which ones.

Regarding callback, I personally dislike the fact a whole set of exception can escape, because a straightforward usage of the library creates servers that stop on networking issues. To me callback handles the interaction with a client, and that includes exiting properly on external conditions. Otherwise we're basically asking users to write the same try/with in every single server instance, which sounds like a catch 22 to me. Not missing any also isn't trivial, we handled Eio.Net.E, but missed Eio.Exn.X, and possibly others. In the end one ends up in a painful iterative process of deploying servers, monitoring the combination of exception that can escape and handling them one by one. That is, unless you flip the table before that and end up catching ALL exceptions because you just can't risk it in production, but then beware not to catch Cancelled exceptions, etc. All in all, so many ways to shoot oneself in the foot.

I think it's healthier for the library to do the most correct thing by default, which is handling interaction with the HTTP client including connection issues, or return a Result to force proper handling.

@talex5
Copy link
Contributor

talex5 commented Feb 8, 2024

because a straightforward usage of the library creates servers that stop on networking issues

It shouldn't do. Even if you call Net.accept_fork manually, you still need to provide an on_error handler.

Note that on_error isn't called for cancelled exceptions (see https://ocaml-multicore.github.io/eio/eio/Eio/Net/index.html#val-accept_fork), so we don't need to handle that here.

Otherwise we're basically asking users to write the same try/with in every single server instance

When would users use callback directly, though? I would think users would normally use the provided run function, starting with the logging example we provide and customising as desired (e.g. they could choose to crash on non-IO errors instead of logging them). It's not really clear what you should do with e.g. a Not_found exception handling a connection. I tend to favour just logging it, so occasional errors don't take the server down, and having a separate health-check restart the service if it stops responding.

I think having users log everything by default and adjusting it if the logs get filled with e.g. timeout spam would be reasonable. But in some cases seeing that timeouts keep happening would be useful. It's generally nicer for users if they can respond to problems by changing their code rather than having to file bug-reports.

@mefyl
Copy link
Contributor Author

mefyl commented Feb 29, 2024

Sorry for the delay, I was successively off and swamped these past weeks.

Agreed, accept_fork's on_error will force you to somehow handle these exceptions, I was still tainted by the ~on_error:raise, sorry about that.

I'm still a bit hesitant with the semantics of callback : it transparently handles all protocol level error and some transport level errors. If the client sends invalid headers, it will answer with bad_request, if the client closes the connection mid request, it will ignore it, etc. I'm not sure what the criteria is to determine what should be handled automatically and what should be forwarded to the user.

I'm also always worried about the possibly escaping exception. Trying to properly handle them individually is very tedious, as it's hard to determine the set of possible exceptions, and I personally end up watching what exception kill my servers, add it to the list, rinse and repeat until hopefully they are all handled. The alternative is the infamous catching and ignoring all exceptions, but then you're also muting "no space left on device" or "file descriptor exhausted" exceptions, in which case you truly wanted the container to shut down and be replaced. Hence my idea that the library should handle all errors related to individual connections, and let through only exceptions that actually impact the process as a whole.

My 2 cents; I can easily wrap the callbacks myself, feel free to close if you're unconvinced.

@talex5
Copy link
Contributor

talex5 commented Mar 5, 2024

The current situation (since #1023) is:

  • Without this PR: timeouts are logged at warning level, but the user can override if desired.
  • With this PR: timeouts are logged at info level, and the user can't override.

I don't mind too much either way, but the first is simpler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants