Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Statuscode Handler to Agent Health Extension #1423

Open
wants to merge 49 commits into
base: main
Choose a base branch
from

Conversation

Paramadon
Copy link
Contributor

@Paramadon Paramadon commented Nov 12, 2024


Description of the Issue

The CloudWatch Agent lacked an agenthealth configuration to monitor API health by tracking HTTP status codes for specific responses. Monitoring the status codes (200, 400, 408, 419, and 429) across all APIs is critical for diagnosing issues and ensuring comprehensive observability of API behaviors. Without this configuration, users were unable to quickly identify trends in API health metrics or correlate specific status codes with performance issues.


Changes Made

  • Added agenthealth Configuration:

    • Configured the CloudWatch Agent to track the following status codes across all APIs:
      • 200: Success
      • 400: Bad Request
      • 408: Request Timeout
      • 419: Authentication Timeout
      • 429: Too Many Requests
    • This ensures that health metrics for these common response codes are captured and reported.
  • Updated CloudWatch Agent Configuration JSON:

    • Introduced a new section under the metrics collection configuration to specify agenthealth metrics.
    • Ensured compatibility with existing metrics and logs collection by integrating the new configuration seamlessly.
  • Validated the Configuration:

    • Tested the updated configuration with the CloudWatch Agent to ensure the new metrics are collected and reported correctly.
    • Confirmed that the metrics appear as expected in CloudWatch for all specified APIs.

Impact

  • Provides real-time visibility into API status codes, helping identify anomalies or patterns that indicate issues.
  • Enhances observability for critical services by adding detailed health metrics.

Testing

  • Deployed the updated CloudWatch Agent health configuraion
  • Verified that metrics for status codes 200, 400, 408, 419, and 429 are captured in the header as you can see from the image below:
Screenshot 2024-11-15 at 4 36 21 PM

@Paramadon Paramadon requested a review from a team as a code owner November 12, 2024 20:47
@Paramadon Paramadon changed the title fixing issue Adding Agent Health Extension Nov 15, 2024
@Paramadon Paramadon changed the title Adding Agent Health Extension Adding Agent Health Extension Statuscode handler Nov 19, 2024
@Paramadon Paramadon changed the title Adding Agent Health Extension Statuscode handler Adding Agent Health Extension Statuscode Handler Nov 19, 2024
@Paramadon Paramadon changed the title Adding Agent Health Extension Statuscode Handler Adding Statuscode Handler to Agent Health Extension Nov 19, 2024
Copy link
Contributor

@mitali-salvi mitali-salvi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you fix the linting issues ?

@@ -12,6 +12,7 @@ import (
type Config struct {
IsUsageDataEnabled bool `mapstructure:"is_usage_data_enabled"`
Stats agent.StatsConfig `mapstructure:"stats"`
StatusCodeOnly *bool `mapstructure:"is_status_code_only,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick - can be renamed to is_only_status_code_enabled
also keep it consistent with IsUsageDataEnabled with IsOnlyStatusCodeEnabled

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also any reason to make it a pointer ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bool with omitempty in the tag is enough.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with mitali. Instead of a pointer, can we treat the default value as false?

Additionally, can you explain a bit more what this field is used for? It seems like a way to set the agent health extension such that it turns the other things off. I think that may be confusing because you could set IsUsageDataEnabled to true and also set StatusCodeOnly to true, so it's not clear what it would actually do (without looking at the implementation).

Copy link
Contributor Author

@Paramadon Paramadon Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I will rename it. The pointer is used so we can emit it even if it's empty (nil). As for its functionality, I can add a comment, but the main purpose is to ensure that we only use the statuscodehandler when calling the agenthealthextension for a processor, exporter, or receiver, without including other handlers like processstats, clientstats, etc.

extension/agenthealth/handler/stats/agent/agent.go Outdated Show resolved Hide resolved
extension/agenthealth/handler/stats/agent/agent.go Outdated Show resolved Hide resolved
extension/agenthealth/handler/stats/agent/agent.go Outdated Show resolved Hide resolved
extension/agenthealth/handler/stats/provider/interval.go Outdated Show resolved Hide resolved
extension/agenthealth/handler/stats/agent/agent_test.go Outdated Show resolved Hide resolved
extension/agenthealth/handler/stats/agent/agent.go Outdated Show resolved Hide resolved
extension/agenthealth/handler/stats/handler.go Outdated Show resolved Hide resolved
extension/agenthealth/handler/stats/provider/statuscode.go Outdated Show resolved Hide resolved
extension/agenthealth/handler/stats/provider/statuscode.go Outdated Show resolved Hide resolved
internal/ecsservicediscovery/containerinstanceprocessor.go Outdated Show resolved Hide resolved
extension/agenthealth/handler/stats/provider/statuscode.go Outdated Show resolved Hide resolved
@@ -12,6 +12,7 @@ import (
type Config struct {
IsUsageDataEnabled bool `mapstructure:"is_usage_data_enabled"`
Stats agent.StatsConfig `mapstructure:"stats"`
StatusCodeOnly *bool `mapstructure:"is_status_code_only,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bool with omitempty in the tag is enough.

cmd/amazon-cloudwatch-agent/amazon-cloudwatch-agent.go Outdated Show resolved Hide resolved
extension/agenthealth/handler/stats/provider/interval.go Outdated Show resolved Hide resolved
extension/agenthealth/handler/stats/provider/interval.go Outdated Show resolved Hide resolved
extension/agenthealth/handler/stats/agent/agent.go Outdated Show resolved Hide resolved
extension/agenthealth/config.go Show resolved Hide resolved
extension/agenthealth/handler/stats/agent/agent.go Outdated Show resolved Hide resolved
extension/agenthealth/handler/stats/provider/statuscode.go Outdated Show resolved Hide resolved
extension/agenthealth/handler/stats/provider/statuscode.go Outdated Show resolved Hide resolved
@@ -104,6 +127,26 @@ func (of OperationsFilter) IsAllowed(operationName string) bool {
return of.allowAll || of.operations.Contains(operationName)
}

var StatusCodeOperations = []string{ // all the operations that are allowed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're missing AssumeRole here

@@ -59,6 +70,10 @@ func (t *translator) Translate(conf *confmap.Conf) (component.Config, error) {
if usageData, ok := common.GetBool(conf, common.ConfigKey(common.AgentKey, usageDataKey)); ok {
cfg.IsUsageDataEnabled = cfg.IsUsageDataEnabled && usageData
}
cfg.IsStatusCodeEnabled = t.statuscodeonly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this saying we only have status code when it's status code only? Why couldn't it do more than one thing? I don't know this package very well, but this behavior is slightly different than how "usage data" is enabled.

@@ -59,6 +70,10 @@ func (t *translator) Translate(conf *confmap.Conf) (component.Config, error) {
if usageData, ok := common.GetBool(conf, common.ConfigKey(common.AgentKey, usageDataKey)); ok {
cfg.IsUsageDataEnabled = cfg.IsUsageDataEnabled && usageData
}
cfg.IsStatusCodeEnabled = t.statuscodeonly
if t.statuscodeonly != false && t.statuscodeonly {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant conditions

Suggested change
if t.statuscodeonly != false && t.statuscodeonly {
if t.statuscodeonly {

}

if agentStatsEnabled {
clientStats := client.NewHandler(agent.NewOperationsFilter())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the behavior of the client stats handler. None of the operations are allowlisted in this case because nothing is passed in.

}
}

func NewStatusCodeAndOtherOperationsFilter(operations []string) OperationsFilter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't see a use case for this.

UsageFlags map[Flag]any `mapstructure:"usage_flags,omitempty"`
// NewStatusCodeOperationsFilter creates a new filter for allowed operations and status codes.
func NewStatusCodeOperationsFilter() OperationsFilter {
allowed := make(map[string]struct{}, len(StatusCodeOperations))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just an empty map that's the size of the StatusCodeOperations.

extension/agenthealth/extension.go Show resolved Hide resolved
@@ -77,6 +75,6 @@ func TestProcessStats(t *testing.T) {
mock.mu.Unlock()
provider.refresh()
assert.Eventually(t, func() bool {
return provider.getStats() == agent.Stats{}
return len(provider.getStats().StatusCodes) == 0 // map isn't comparable, so we check the length
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this checking the status codes? The point of this is to check that the stats it gets have been reset. If we're only checking the status codes, which are never set in this test, then it's a false positive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test I think is already flaky and is skipped but I added a function that will check if all attributes were reset.

handler.filter = opsFilter
}
handler.startResetTimer()
statusCodeSingleton = handler
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure we want to have the handler be a singleton? If we have locks, that means that every single API call the handler is attached to will be blocked until the locks are released. IMO the stats.Provider is the only one that needs to be a singleton. The handler doesn't. They can be separate.

Comment on lines +114 to +115
case "PutMetricData":
return "pmd"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is PutMetricData here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants