Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Event loop lag by slot second #6929

Open
nflaig opened this issue Jul 2, 2024 · 1 comment
Open

Event loop lag by slot second #6929

nflaig opened this issue Jul 2, 2024 · 1 comment
Labels
scope-performance Performance issue and ideas to improve performance.

Comments

@nflaig
Copy link
Member

nflaig commented Jul 2, 2024

I've been collecting some data to investigate delay of REST API responses in addition to data we get from metrics like #6691. This is from a Holesky beacon node running in a DVT setup with ~250 connected validators.

The data for this was simply collected by creating a log event if event loop lag > 1 second by running this branch unstable...nflaig/event-loop-delay. And all the data points are collected on the main thread, meaning event loop lag in network thread is not considered which might cause delays on some APIs that interact with the network, like getting the peer count, or submitting attestations / blocks.

Using data from event-loop-lag-detected.log created the following diagrams.

Event Loop Lag: Slot Seconds vs. Delay

This clearly shows the expected lag during the 8 second of the slot due to state / epoch transitions. But other seconds of the slot are mostly unaffected by event loop lag and should have a marginal effect on API latency (see % distribution below)

image

Percentage of Event Loop Lags per Slot Second

The percentage of lags above > 1 second are also mostly in the 8 second of the slot

image

Percentage of Slots with Event Loop Lag > 1 second

When looking at the percentage of slots over last few days, the amount of slots with an event loop lag is relatively low, especially for slot seconds other than 8.

image

Conclusion

Based on this data, it seems unlikely that event loop lag has a significant impact on API latency as during the 8-9 slot second, the validator client does not send any requests and the main tasks on the beacon node side is state and epoch transition while tasks like polling validator indices and getting duties happens at the beginning of the first slot of the epoch, and event loop lag there is relatively low and should not cause timeouts of the request even for really short timeouts like 2 seconds.

Next steps

It would be great if we could visualize similar data points in our metrics, one approach for this could be to look at event loop utilization (ELU) for certain slot seconds, this also gives us more data look at if we improve state / epoch transition or block processing as it should reduce the ELU during those slot seconds, see #6820 (comment).

@nflaig nflaig added the scope-performance Performance issue and ideas to improve performance. label Jul 2, 2024
@nflaig
Copy link
Member Author

nflaig commented Aug 9, 2024

New data from latest release (v1.21.0), looks quite a bit better 🎉

Compared to previous, the event loop lag in the range of 3-4 seconds is less frequent

image

The next one is interesting, while we improved the lag in the 8 second of the slot, it looks like we have much less lags in other slots of the epoch as well that are > 1 second

image

We have ~2% less slots with event loop lag > 1 second, the percentage in 8 second slot went up which is kinda strange but the lag duration overall is less as show above

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
scope-performance Performance issue and ideas to improve performance.
Projects
None yet
Development

No branches or pull requests

1 participant