Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alerting: Use alerting.GrafanaAlertmanager instead of initialising Alertmanager components directly #61230

Merged
merged 5 commits into from
Jan 13, 2023

Conversation

gotjosh
Copy link
Contributor

@gotjosh gotjosh commented Jan 10, 2023

In this PR we're introducing the replacement of Grafana's Alertmanager components with a unified struct directly from https://github.com/grafana/alerting.

Grafana's multi-tenancy and configuration CRUD (e.g. list configurations from the database) remain untouched as those are Grafana-specific operations that do not belong within grafana/alerting

Configuration is injected as part of the struct initialisation to ensure a 1:1 match with what the Alertmanager in here is currently doing. There should be no functional change as part of this PR.

@gotjosh gotjosh changed the title Remove Alertmanager dependency from Grafana Alerting: Remove Alertmanager dependency from Grafana Jan 10, 2023
@grafanabot grafanabot added area/backend pr/external This PR is from external contributor labels Jan 10, 2023
@gotjosh gotjosh force-pushed the alertmanager-to-grafana/alerting branch 4 times, most recently from b1355ee to 234c73e Compare January 11, 2023 17:52
@gotjosh gotjosh added no-backport Skip backport of PR no-changelog Skip including change in changelog/release notes labels Jan 11, 2023
@gotjosh gotjosh added this to the 10.0.0 milestone Jan 11, 2023
@gotjosh gotjosh force-pushed the alertmanager-to-grafana/alerting branch 2 times, most recently from e33ccb6 to 206e2a3 Compare January 12, 2023 18:52
@gotjosh gotjosh force-pushed the alertmanager-to-grafana/alerting branch from 206e2a3 to 339f3eb Compare January 12, 2023 19:39
@gotjosh gotjosh force-pushed the alertmanager-to-grafana/alerting branch from 339f3eb to 297ddc2 Compare January 12, 2023 20:05
@@ -54,6 +54,7 @@ func setupAMTest(t *testing.T) *Alertmanager {
}

func TestPutAlert(t *testing.T) {
t.SkipNow()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why this no longer passes? Or are there just outdated references?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is asserting on the Alertmanager internals. It uses what the Alertmanager calls the "marker" to assert for alerts that have been received but not processed.

These components are private and are no longer accessible in any form from grafana/alerting (as a consumer shouldn't have a need to access them).

The test was ported 1:1 to grafana/alerting in https://github.com/grafana/alerting/blob/main/alerting/grafana_alertmanager_test.go#L34 so I reckon is safe to skip for now. There are several integration tests that asserts this functionality anyway.

@@ -1,3 +1,4 @@
//nolint:golint,unused
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remember to drop before merge

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't drop it - In order to keep the diff minimal, there are now a lot of attributes and functions that remain unused. My next PR shortly after this one will take care of removing this linter skip and delete all the dead code.

@gotjosh gotjosh changed the title Alerting: Remove Alertmanager dependency from Grafana Alerting: Use alerting.GrafanaAlertmanager instead of initialising Alertmanager components directly Jan 12, 2023
@@ -260,61 +200,48 @@ func (am *Alertmanager) Ready() bool {
// We consider AM as ready only when the config has been
// applied at least once successfully. Until then, some objects
// can still be nil.
am.reloadConfigMtx.RLock()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This no longer needs a lock as Base already has one.

}

// SaveAndApplyDefaultConfig saves the default configuration the database and applies the configuration to the Alertmanager.
// It rollbacks the save if we fail to apply the configuration.
func (am *Alertmanager) SaveAndApplyDefaultConfig(ctx context.Context) error {
am.reloadConfigMtx.Lock()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff here is ruined because we need to take Base's lock, which causes nesting - there's no functional change here. We just need to make sure that we lock (with the same lock) all the way down to calling applyConfig which does not use Base's lock.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It preserved it well enough to follow. These lock changes seem straightforward 👍

@@ -325,26 +252,27 @@ func (am *Alertmanager) SaveAndApplyConfig(ctx context.Context, cfg *apimodels.P
return fmt.Errorf("failed to serialize to the Alertmanager configuration: %w", err)
}

am.reloadConfigMtx.Lock()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff here is ruined because we need to take Base's lock, which causes nesting - there's no functional change here. We just need to make sure that we lock (with the same lock) all the way down to calling applyConfig which does not use Base's lock.

@@ -355,16 +283,18 @@ func (am *Alertmanager) ApplyConfig(dbCfg *ngmodels.AlertConfiguration) error {
return fmt.Errorf("failed to parse Alertmanager config: %w", err)
}

am.reloadConfigMtx.Lock()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not conflate ApplyConfig with applyConfig - this one also needs a lock because all public functions should be safe to be called concurrently.

if err != nil {
return err
}

// Finally, build the integrations map using the receiver configuration and templates.
integrationsMap, err := am.buildIntegrationsMap(cfg.AlertmanagerConfig.Receivers, tmpl)
err = am.Base.ApplyConfig(AlertingConfiguration{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This diff is somewhat clear but the gist is that everything where we doing is now done by Base.ApplyConfig - whatever we need from here - we inject via AlertingConfiguration

@@ -179,28 +179,28 @@ func TestMultiOrgAlertmanager_SyncAlertmanagersForOrgsWithFailures(t *testing.T)
{
require.NoError(t, mam.LoadAndSyncAlertmanagersForOrgs(ctx))
require.Len(t, mam.alertmanagers, 3)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer have ready as this was originally designed to be an internal non-safe version of Ready - this now lives in in grafana/alerting.

That aside, there's no reason why this test should rely on ready except to make it a tiny bit faster.

greceivers := make([]*alerting.GrafanaReceiver, 0, len(r.GrafanaManagedReceivers))
for _, gr := range r.PostableGrafanaReceivers.GrafanaManagedReceivers {
var settings map[string]string
//TODO: We shouldn't need to do this marshalling.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as grafana/alerting#38 but inverse.

@@ -80,3 +80,29 @@ func TestProcessNotifierError(t *testing.T) {
require.Equal(t, err, processNotifierError(r, err))
})
}

// TODO: Copied from Alerting, needs to be made public.
Copy link
Contributor Author

@gotjosh gotjosh Jan 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apimodels "github.com/grafana/grafana/pkg/services/ngalert/api/tooling/definitions"
)

// TODO: We no longer do apimodels at this layer, move it to the API.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issue as is part of a very large, very obvious effort - don't have a concrete plan yet will create one once I do.

@gotjosh gotjosh marked this pull request as ready for review January 12, 2023 21:00
@gotjosh gotjosh requested a review from a team January 12, 2023 21:00
@gotjosh gotjosh removed the pr/external This PR is from external contributor label Jan 12, 2023
Copy link
Contributor

@alexweav alexweav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines 12 to 13
if am.Base.Ready() {
if err := json.Unmarshal(am.Base.GetStatus(), config); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need the lock here? there is a gap between am.Base.Ready() and am.Base.GetStatus(), the config could be concurrently updated there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good shout -- it was a bit confusing, so I've removed the am.Base.Ready() check.

If you look at Base.GetStatus it does hold a lock for getting the status from the configuration - so really all that we need is to execute the call against this check and safely remove the check to Base.Ready() as you correctly guessed it could change between those two calls so we don't gain anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to answer your question concretely - no, we don't need a lock and we also don't need the Ready check as we can rely on Base.GetStatus to give us the accurate information.

@gotjosh gotjosh requested a review from alexweav January 13, 2023 02:34
@gotjosh gotjosh enabled auto-merge (squash) January 13, 2023 02:34
@gotjosh gotjosh disabled auto-merge January 13, 2023 02:35
@gotjosh gotjosh merged commit e7cd6eb into main Jan 13, 2023
@gotjosh gotjosh deleted the alertmanager-to-grafana/alerting branch January 13, 2023 16:54
@zerok zerok modified the milestones: 10.0.0, 9.4.0 May 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/backend enterprise-ok no-backport Skip backport of PR no-changelog Skip including change in changelog/release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants