cohttp-eio.Server: Don't blow up in `callback` on client read timeout. #1021

mefyl · 2024-02-08T08:57:57Z

Yet another escaping exception. We can't catch the specific Eio_unix.Unix_error (ETIMEDOUT, _, _) since this is backend agnostic, so I'm stopping all backend exceptions. Which I dislike, suggestions welcome.

Backtrace

routine-push-notifications: internal error, uncaught exception:
                            Eio.Io Unix_error (Operation timed out, "readv", "")
                            Raised at Eio_posix__Flow.Impl.single_read in file "lib_eio_posix/flow.ml", line 88, characters 55-85
                            Called from Eio__Flow.single_read in file "lib_eio/flow.ml", line 85, characters 12-31
                            Called from Eio__Buf_read.ensure_slow_path in file "lib_eio/buf_read.ml", line 122, characters 18-50
                            Called from Eio__Buf_read.ensure in file "lib_eio/buf_read.ml" (inlined), line 131, characters 20-40
                            Called from Cohttp_eio__Io.IO.refill in file "cohttp-eio/src/io.ml", line 17, characters 29-62
                            Called from Cohttp__Request.Make.read in file "cohttp/src/request.ml", line 183, characters 8-20
                            Called from Cohttp_eio__Server.read in file "cohttp-eio/src/server.ml", line 46, characters 8-29
                            Called from Cohttp_eio__Server.callback.handle in file "cohttp-eio/src/server.ml", line 98, characters 10-20

talex5 · 2024-02-08T09:55:29Z

Why do we want to stop exceptions escaping? Typically they will just be logged by the user's on_error handler. And this code is just logging them anyway!

I think it makes sense to drop connection-reset errors (or log them at debug level), since dropping connections is normal behaviour for a browser when you close a tab, but the others should be passed on I think.

mefyl · 2024-02-08T13:57:25Z

I would understand if you'd prefer to have it just below in run instead of callback, but it's truly needed at least there lest the http server will shut down if one client times out.

Now regarding whether it belongs in callback, I think the argument is similar to connection_reset. The client could reset the connection mid-request and callback would finish silently, I think it should be the same for timeouts. Otherwise, I think we're setting up users of the library for failure as you pretty much always want to catch and ignore these. We'd need to document precisely that these can escape, and it is my opinion that we will only see servers written with cohttp-eio restart on such errors until all holes are plugged. And every single person writing a server with cohttp-eio will end up rewriting the same try/with.

Now what I would personally prefer would be for callback to return a Result.t, which has the best of both worlds: the user can't miss an error that will crash her server, but we also don't forcibly hide errors from her if she wants to specifically handle them. But that would require changing Cohttp.Generic.Server.S.

So my take is that 1/ it's mandatory to handle the exception in run 2/ not handling it in callback means setting up an endless stream of users for unexpected server shutdowns until they all write the same try/with.

talex5 · 2024-02-08T14:11:30Z

but it's truly needed at least there lest the http server will shut down if one client times out.

That would only happen if the user runs the server with ~on_error:raise. Which I see the example does. We should probably change that to e.g.

  Cohttp_eio.Server.run socket server
    ~on_error:(fun ex -> Logs.warn (fun f -> f "%a" Eio.Exn.pp ex))

mefyl · 2024-02-08T15:34:08Z

You're right that it's because of the example's on_error; if we change that I suppose that as far as run goes, it's explicit enough for the user that exceptions must be handled, although we don't specify which ones.

Regarding callback, I personally dislike the fact a whole set of exception can escape, because a straightforward usage of the library creates servers that stop on networking issues. To me callback handles the interaction with a client, and that includes exiting properly on external conditions. Otherwise we're basically asking users to write the same try/with in every single server instance, which sounds like a catch 22 to me. Not missing any also isn't trivial, we handled Eio.Net.E, but missed Eio.Exn.X, and possibly others. In the end one ends up in a painful iterative process of deploying servers, monitoring the combination of exception that can escape and handling them one by one. That is, unless you flip the table before that and end up catching ALL exceptions because you just can't risk it in production, but then beware not to catch Cancelled exceptions, etc. All in all, so many ways to shoot oneself in the foot.

I think it's healthier for the library to do the most correct thing by default, which is handling interaction with the HTTP client including connection issues, or return a Result to force proper handling.

talex5 · 2024-02-08T16:26:53Z

because a straightforward usage of the library creates servers that stop on networking issues

It shouldn't do. Even if you call Net.accept_fork manually, you still need to provide an on_error handler.

Note that on_error isn't called for cancelled exceptions (see https://ocaml-multicore.github.io/eio/eio/Eio/Net/index.html#val-accept_fork), so we don't need to handle that here.

Otherwise we're basically asking users to write the same try/with in every single server instance

When would users use callback directly, though? I would think users would normally use the provided run function, starting with the logging example we provide and customising as desired (e.g. they could choose to crash on non-IO errors instead of logging them). It's not really clear what you should do with e.g. a Not_found exception handling a connection. I tend to favour just logging it, so occasional errors don't take the server down, and having a separate health-check restart the service if it stops responding.

I think having users log everything by default and adjusting it if the logs get filled with e.g. timeout spam would be reasonable. But in some cases seeing that timeouts keep happening would be useful. It's generally nicer for users if they can respond to problems by changing their code rather than having to file bug-reports.

mefyl · 2024-02-29T11:02:53Z

Sorry for the delay, I was successively off and swamped these past weeks.

Agreed, accept_fork's on_error will force you to somehow handle these exceptions, I was still tainted by the ~on_error:raise, sorry about that.

I'm still a bit hesitant with the semantics of callback : it transparently handles all protocol level error and some transport level errors. If the client sends invalid headers, it will answer with bad_request, if the client closes the connection mid request, it will ignore it, etc. I'm not sure what the criteria is to determine what should be handled automatically and what should be forwarded to the user.

I'm also always worried about the possibly escaping exception. Trying to properly handle them individually is very tedious, as it's hard to determine the set of possible exceptions, and I personally end up watching what exception kill my servers, add it to the list, rinse and repeat until hopefully they are all handled. The alternative is the infamous catching and ignoring all exceptions, but then you're also muting "no space left on device" or "file descriptor exhausted" exceptions, in which case you truly wanted the container to shut down and be replaced. Hence my idea that the library should handle all errors related to individual connections, and let through only exceptions that actually impact the process as a whole.

My 2 cents; I can easily wrap the callbacks myself, feel free to close if you're unconvinced.

talex5 · 2024-03-05T15:21:16Z

The current situation (since #1023) is:

Without this PR: timeouts are logged at warning level, but the user can override if desired.
With this PR: timeouts are logged at info level, and the user can't override.

I don't mind too much either way, but the first is simpler.

cohttp-eio.Server: Don't blow up in callback on client read timeout.

6399acc

mefyl force-pushed the master branch from 2297e64 to 6399acc Compare February 8, 2024 08:58

talex5 mentioned this pull request Feb 27, 2024

cohttp-eio: Improve error handling in example server #1023

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cohttp-eio.Server: Don't blow up in `callback` on client read timeout. #1021

cohttp-eio.Server: Don't blow up in `callback` on client read timeout. #1021

mefyl commented Feb 8, 2024

talex5 commented Feb 8, 2024

mefyl commented Feb 8, 2024 •

edited

Loading

talex5 commented Feb 8, 2024

mefyl commented Feb 8, 2024

talex5 commented Feb 8, 2024

mefyl commented Feb 29, 2024

talex5 commented Mar 5, 2024

cohttp-eio.Server: Don't blow up in callback on client read timeout. #1021

Are you sure you want to change the base?

cohttp-eio.Server: Don't blow up in callback on client read timeout. #1021

Conversation

mefyl commented Feb 8, 2024

talex5 commented Feb 8, 2024

mefyl commented Feb 8, 2024 • edited Loading

talex5 commented Feb 8, 2024

mefyl commented Feb 8, 2024

talex5 commented Feb 8, 2024

mefyl commented Feb 29, 2024

talex5 commented Mar 5, 2024

cohttp-eio.Server: Don't blow up in `callback` on client read timeout. #1021

cohttp-eio.Server: Don't blow up in `callback` on client read timeout. #1021

mefyl commented Feb 8, 2024 •

edited

Loading