-
-
Notifications
You must be signed in to change notification settings - Fork 876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Table locks / slow queries on 0.19.6
betas
#4983
Comments
what's the time frame the screenshot data is from? purely after the upgrade? how long did it run? Could just be due to the general DB overload though. |
One more thing, after a 30min+ downtime the instance will get hammered by incoming federation queries trying to catch the instance up to the current state of the network. Since our incoming federation is not limited, it will be implicitly only limited by resource limits (CPU+DB), which might also appear as general load and cause everything to slow down. So if you're looking at perf issues, make sure that you only start measuring once the federation state is up to date for all incoming instances. |
I turned this on ~ 20 minutes after the migrations finished, and startup, after I saw things were going slow. I let it run for maybe 5-10m.
I checked the federation queue using your site after this happened, and it was up to date. iirc I also tried turning off that separate dedicated federation docker-container, and it was still slow, so its pry not federation related. This is gonna be a tough one to solve, and we probably need to look at changes to the DB and the post query list function since |
That container only handles outgoing federation, the incoming federation which phiresky mentioned goes through the main container which handles http requests. |
Yeah, it's a bit harder to check incoming federation state which is the important part here. Outgoing federation will be idle after downtime. A query like |
Is this issue still relevant? Weve been running 0.19.6-beta1 on lemmy.ml for a while now, and I havent noticed any problems. This is also the only issue remaining inthe 0.19.6 milestone, so once its closed we can publish the new release. |
Yes, beta1 doesn't have any of the DB changes, only that one specific federation commit. So we still have to investigate which commit is causing the slowness. |
The changes to post_view.rs are really trivial so that cant be the problem:
And for triggers theres only a single change which also looks very simple: And migrations: image_details is really the only major change between these versions, everything else are minor bug fixes or dependency upgrades. |
I wouldn't necessarily assume that a code change is the cause, it might just be lemmy in general handling recovering from downtime poorly (as in, if incoming federation gets hammered recovering from downtime, it might cause compounding slowness everywhere). To test, you could shut down the instance for 30min or however long it was down before and just start it again on the same version, I would tentatively expect the same extra load. One reason why I'm saying this is that people have been complaining about lemmy becoming "slower" after every upgrade for multiple releases and often it seems to just be temporary the next few hours after the upgrade. |
I'm willing to try it again, as long as @Nutomic and someone else is available to help me to test. I don't think its federation, because I tried turning off federation, and it still wasn't usable. But when I say the site was unusable, I mean that it was inaccessible to apps, and the web ui would only intermittently work. 78702b5 (the apub_url trigger changes) is the only one that sticks out to me that something could've gone wrong there. |
One problem with 0.19.6 is with the migration from #3205. It takes a long time to recalculate all controversial scores. Once Lemmy starts again, postgres runs auto vacuum on the post_aggregates table (probably to regenerate the index). This is quite slow as it also needs to handle api requests at the same time. So maybe we should run vacuum as part of the migration, so it can use the full server cpu. And it would be good if the migration could filter some rows, eg posts with one or zero votes. cc @dullbananas |
But the main problem is that db queries are still slowing down extremely, so the site becomes unusable within a few minutes after startup. The slow queries are all annotated as |
Some stats: Diff to look at: 0.19.5...0.19.6-beta.9-no-triggers |
i'm still not convinced it's related to any actual change rather than just a combination of the migration rewriting a table, destroying the in-memory page cache, and then the downtime causing the server to get hammered with federation requests maybe just skip the controversial update? it's mostly eventually consistent anyways without the migration, no? |
@phiresky Incoming federation would result in create and update queries, but the stats show only select queries at the top. Besides if lemmy.ml is down for half an hour then I believe it would take at least half an hour more for other instances to send activities again. But what we saw was no server load on startup, quickly ramping up to 100% server load within a minute. We also would have seen similar problems during previous upgrades, but those were fine. Anyway if there are no better ideas we can try to make a beta without any migrations so its easy to revert. If that fails we need to bisect to find the problematic commit. Otherwise apply the commits with migrations one by one to see which one causes problems. |
Requirements
Summary
Earlier tonight I tried to deploy
0.19.6-beta.6
to lemmy.ml, after having tested various versions of it on voyager.lemmy.ml for a few weeks. Post queries start stalling out pretty quickly, and it becomes unusable.I didn't think we changed anything major with the post queries, so this could be trigger related, or something to do with the site and community aggregates causing locks.
Also the controversial migration does take ~30m and locks up things, but I spose that's unavoidable, and not too big a deal since its only run once.
I turned on pg_stat_statements and got this:
For now I restored lemmy.ml from the backup I took before.
cc @dullbananas @phiresky @Nutomic
Steps to Reproduce
NA
Technical Details
NA
Version
0.19.6
Lemmy Instance URL
voyager.lemmy.ml
The text was updated successfully, but these errors were encountered: