-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Statuscode Handler to Agent Health Extension #1423
base: main
Are you sure you want to change the base?
Conversation
…s from status codes
9465711
to
5dc40f4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you fix the linting issues ?
extension/agenthealth/config.go
Outdated
@@ -12,6 +12,7 @@ import ( | |||
type Config struct { | |||
IsUsageDataEnabled bool `mapstructure:"is_usage_data_enabled"` | |||
Stats agent.StatsConfig `mapstructure:"stats"` | |||
StatusCodeOnly *bool `mapstructure:"is_status_code_only,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick - can be renamed to is_only_status_code_enabled
also keep it consistent with IsUsageDataEnabled with IsOnlyStatusCodeEnabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also any reason to make it a pointer ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bool with omitempty
in the tag is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with mitali. Instead of a pointer, can we treat the default value as false?
Additionally, can you explain a bit more what this field is used for? It seems like a way to set the agent health extension such that it turns the other things off. I think that may be confusing because you could set IsUsageDataEnabled
to true and also set StatusCodeOnly
to true, so it's not clear what it would actually do (without looking at the implementation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I will rename it. The pointer is used so we can emit it even if it's empty (nil). As for its functionality, I can add a comment, but the main purpose is to ensure that we only use the statuscodehandler when calling the agenthealthextension for a processor, exporter, or receiver, without including other handlers like processstats, clientstats, etc.
extension/agenthealth/config.go
Outdated
@@ -12,6 +12,7 @@ import ( | |||
type Config struct { | |||
IsUsageDataEnabled bool `mapstructure:"is_usage_data_enabled"` | |||
Stats agent.StatsConfig `mapstructure:"stats"` | |||
StatusCodeOnly *bool `mapstructure:"is_status_code_only,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bool with omitempty
in the tag is enough.
@@ -104,6 +127,26 @@ func (of OperationsFilter) IsAllowed(operationName string) bool { | |||
return of.allowAll || of.operations.Contains(operationName) | |||
} | |||
|
|||
var StatusCodeOperations = []string{ // all the operations that are allowed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're missing AssumeRole
here
@@ -59,6 +70,10 @@ func (t *translator) Translate(conf *confmap.Conf) (component.Config, error) { | |||
if usageData, ok := common.GetBool(conf, common.ConfigKey(common.AgentKey, usageDataKey)); ok { | |||
cfg.IsUsageDataEnabled = cfg.IsUsageDataEnabled && usageData | |||
} | |||
cfg.IsStatusCodeEnabled = t.statuscodeonly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this saying we only have status code when it's status code only? Why couldn't it do more than one thing? I don't know this package very well, but this behavior is slightly different than how "usage data" is enabled.
@@ -59,6 +70,10 @@ func (t *translator) Translate(conf *confmap.Conf) (component.Config, error) { | |||
if usageData, ok := common.GetBool(conf, common.ConfigKey(common.AgentKey, usageDataKey)); ok { | |||
cfg.IsUsageDataEnabled = cfg.IsUsageDataEnabled && usageData | |||
} | |||
cfg.IsStatusCodeEnabled = t.statuscodeonly | |||
if t.statuscodeonly != false && t.statuscodeonly { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
redundant conditions
if t.statuscodeonly != false && t.statuscodeonly { | |
if t.statuscodeonly { |
} | ||
|
||
if agentStatsEnabled { | ||
clientStats := client.NewHandler(agent.NewOperationsFilter()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes the behavior of the client stats handler. None of the operations are allowlisted in this case because nothing is passed in.
} | ||
} | ||
|
||
func NewStatusCodeAndOtherOperationsFilter(operations []string) OperationsFilter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't see a use case for this.
UsageFlags map[Flag]any `mapstructure:"usage_flags,omitempty"` | ||
// NewStatusCodeOperationsFilter creates a new filter for allowed operations and status codes. | ||
func NewStatusCodeOperationsFilter() OperationsFilter { | ||
allowed := make(map[string]struct{}, len(StatusCodeOperations)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just an empty map that's the size of the StatusCodeOperations
.
@@ -77,6 +75,6 @@ func TestProcessStats(t *testing.T) { | |||
mock.mu.Unlock() | |||
provider.refresh() | |||
assert.Eventually(t, func() bool { | |||
return provider.getStats() == agent.Stats{} | |||
return len(provider.getStats().StatusCodes) == 0 // map isn't comparable, so we check the length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this checking the status codes? The point of this is to check that the stats it gets have been reset. If we're only checking the status codes, which are never set in this test, then it's a false positive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test I think is already flaky and is skipped but I added a function that will check if all attributes were reset.
handler.filter = opsFilter | ||
} | ||
handler.startResetTimer() | ||
statusCodeSingleton = handler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure we want to have the handler be a singleton? If we have locks, that means that every single API call the handler is attached to will be blocked until the locks are released. IMO the stats.Provider
is the only one that needs to be a singleton. The handler doesn't. They can be separate.
case "PutMetricData": | ||
return "pmd" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is PutMetricData
here?
Description of the Issue
The CloudWatch Agent lacked an
agenthealth
configuration to monitor API health by tracking HTTP status codes for specific responses. Monitoring the status codes (200
,400
,408
,419
, and429
) across all APIs is critical for diagnosing issues and ensuring comprehensive observability of API behaviors. Without this configuration, users were unable to quickly identify trends in API health metrics or correlate specific status codes with performance issues.Changes Made
Added
agenthealth
Configuration:200
: Success400
: Bad Request408
: Request Timeout419
: Authentication Timeout429
: Too Many RequestsUpdated CloudWatch Agent Configuration JSON:
agenthealth
metrics.Validated the Configuration:
Impact
Testing
200
,400
,408
,419
, and429
are captured in the header as you can see from the image below: