Consider adding a config option to specify failure recovery policy #112

cam-schultz · 2023-12-14T22:02:12Z

cam-schultz
Dec 14, 2023
Maintainer

Context and scope
AWM Relayer spins up a listener goroutine for each source chain specified in the config. The current behavior when a an unrecoverable error occurs in a goroutine is to mark the application as unhealthy, which most often results in the application being killed and restarted.

If the cause of the unrecoverable error is isolated to a single chain (e.g. the configured API node for that chain become unreachable) then the relayer will still be marked unhealthy on the whole. This may be desirable for some use cases, but in others, there may be flexibility to allow for downtime on one chain while still relaying between others.

Discussion and alternatives
One possible solution would be to add a per-source chain configuration option to specify a failure policy, e.g. kill_on_error | allow_failure. For example, a user could specify that the relayer should cease to function altogether if Chain A becomes unreachable (or otherwise produces critical relayer errors) by setting the Chain A config to kill_on_error, but allow Chain B to fail without interrupting the rest of the relayer process by specifying allow_failure.

For the allow_failure case, we should make it very obvious that the relayer is in a not fully functional, although valid, state.

geoff-vball · 2023-12-14T23:54:23Z

geoff-vball
Dec 14, 2023
Collaborator

What if you could specify a config value per chain for reconnect_timeout. The relayer would try to reconnect for that timeframe (currently static 10 seconds) and then kill the goroutine and set the status as unhealthy. We could also change the health endpoint to give finer-grained detail, but I'm not sure what the lift would be like on the infra side.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider adding a config option to specify failure recovery policy #112

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Consider adding a config option to specify failure recovery policy #112

cam-schultz Dec 14, 2023 Maintainer

Replies: 1 comment

geoff-vball Dec 14, 2023 Collaborator

cam-schultz
Dec 14, 2023
Maintainer

geoff-vball
Dec 14, 2023
Collaborator