Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Pod metrics is gone when using containerd as runtime #188

Open
Tracked by #5
pingleig opened this issue Mar 21, 2021 · 11 comments · Fixed by #189
Open
Tracked by #5

[k8s] Pod metrics is gone when using containerd as runtime #188

pingleig opened this issue Mar 21, 2021 · 11 comments · Fixed by #189
Assignees
Labels
area/k8s Kubernetes bug Something isn't working component/container-insight
Milestone

Comments

@pingleig
Copy link
Member

pingleig commented Mar 21, 2021

This is exported from internal ticket

TL;DR

The latest image is released, if you were using temp image from this comment #188 (comment) please update to the latest tag.

If the error message W! No pod metric collected, metrics count is still 7 is containerd socket mounted? https://github.com/aws/amazon-cloudwatch-agent/issues/188leads you to this issue

  • make sure you have updated the yaml to mount the containerd socket into the cloudwatch-agent pod
  • the path for containerd socket may not be in the standard location. e.g. bottlerocket uses /run/dockershim.sock instead of /run/containerd/containerd.sock

Background

We were relying on pause container to have POD for detecting pod, which is the case for docker but not for containerd containerd/cri#922 (comment)

User will not see pod metrics in container insight dashboard and they will find the following log which is introduced in #171

log.Printf("W! No pod metric collected, metrics count is still %d", beforePod)

The root cause is we are expecting containerName == 'POD' to mark a path as pod

if containerName != infraContainerName {
tags[ContainerNamekey] = containerName
tags[ContainerIdkey] = path.Base(info.Name)
containerType = TypeContainer
} else {
// NOTE: the pod here is only used by NetMetricExtractor,
// other pod info like CPU, Mem are dealt within in processPod.
containerType = TypePod

Fix

Release

The fix will be included in next release, the release date is not determined (yet).

@pingleig pingleig added bug Something isn't working component/container-insight area/k8s Kubernetes labels Mar 21, 2021
@pingleig pingleig self-assigned this Mar 21, 2021
@pingleig
Copy link
Member Author

pingleig commented Mar 22, 2021

Created a temp image based on #189 public.ecr.aws/p5m3p1a7/cwagent-k8s-containerd-pod:0.1 (the latest official release now contains this fix) and the daemonset yaml need to be udpated to mount /run/containerd/containerd.sock

NOTE: If you are using bottlerocket on eks, the socket on host is different due to bottlerocket-os/bottlerocket@91810c8 You need to (and only need to) replace the volumes part to pick the right sock on host. (Full snippet is at end of comment).

      volumes:
       # ... 
        - name: containerdsock
          hostPath:
            # path: /run/containerd/containerd.sock
            # bottlerocket does not mount containerd sock at normal place
            # https://github.com/bottlerocket-os/bottlerocket/commit/91810c85b83ff4c3660b496e243ef8b55df0973b
            path: /run/dockershim.sock

Default containerd path

When host (and kubelet) is using /run/containerd/containerd.sock

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: cloudwatch-agent
  namespace: amazon-cloudwatch
spec:
  selector:
    matchLabels:
      name: cloudwatch-agent
  template:
    metadata:
      labels:
        name: cloudwatch-agent
    spec:
      containers:
        - name: cloudwatch-agent
          image: public.ecr.aws/p5m3p1a7/cwagent-k8s-containerd-pod:0.1
          imagePullPolicy: Always
          #ports:
          #  - containerPort: 8125
          #    hostPort: 8125
          #    protocol: UDP
          resources:
            limits:
              cpu: 200m
              memory: 200Mi
            requests:
              cpu: 200m
              memory: 200Mi
          # Please don't change below envs
          env:
            - name: HOST_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.hostIP
            - name: HOST_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            - name: K8S_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            - name: CI_VERSION
              value: "k8s/1.3.0"
          # Please don't change the mountPath
          volumeMounts:
            - name: cwagentconfig
              mountPath: /etc/cwagentconfig
            - name: rootfs
              mountPath: /rootfs
              readOnly: true
            - name: dockersock
              mountPath: /var/run/docker.sock
              readOnly: true
            - name: varlibdocker
              mountPath: /var/lib/docker
              readOnly: true
            - name: containerdsock
              mountPath: /run/containerd/containerd.sock
              readOnly: true
            - name: sys
              mountPath: /sys
              readOnly: true
            - name: devdisk
              mountPath: /dev/disk
              readOnly: true
      volumes:
        - name: cwagentconfig
          configMap:
            name: cwagentconfig
        - name: rootfs
          hostPath:
            path: /
        - name: dockersock
          hostPath:
            path: /var/run/docker.sock
        - name: varlibdocker
          hostPath:
            path: /var/lib/docker
        - name: containerdsock
          hostPath:
            path: /run/containerd/containerd.sock
        - name: sys
          hostPath:
            path: /sys
        - name: devdisk
          hostPath:
            path: /dev/disk/
      terminationGracePeriodSeconds: 60
      serviceAccountName: cloudwatch-agent

Non default containerd path

NOTE: You only need to change the volumes, when mount into cloudwatch agent container, you should still put it at default path.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: cloudwatch-agent
  namespace: amazon-cloudwatch
spec:
  selector:
    matchLabels:
      name: cloudwatch-agent
  template:
    metadata:
      labels:
        name: cloudwatch-agent
    spec:
      # aws eks update-kubeconfig --name eks-pod-metric-missing --region us-west-2
      containers:
        - name: cloudwatch-agent
          image: public.ecr.aws/p5m3p1a7/cwagent-k8s-containerd-pod:0.1
          imagePullPolicy: Always
          #ports:
          #  - containerPort: 8125
          #    hostPort: 8125
          #    protocol: UDP
          resources:
            limits:
              cpu: 200m
              memory: 200Mi
            requests:
              cpu: 200m
              memory: 200Mi
          # Please don't change below envs
          env:
            - name: HOST_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.hostIP
            - name: HOST_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            - name: K8S_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            - name: CI_VERSION
              value: "k8s/1.3.0"
          # Please don't change the mountPath
          volumeMounts:
            - name: cwagentconfig
              mountPath: /etc/cwagentconfig
            - name: rootfs
              mountPath: /rootfs
              readOnly: true
            - name: dockersock
              mountPath: /var/run/docker.sock
              readOnly: true
            - name: varlibdocker
              mountPath: /var/lib/docker
              readOnly: true
            - name: containerdsock
              mountPath: /run/containerd/containerd.sock
              readOnly: true
            - name: sys
              mountPath: /sys
              readOnly: true
            - name: devdisk
              mountPath: /dev/disk
              readOnly: true
      volumes:
        - name: cwagentconfig
          configMap:
            name: cwagentconfig
        - name: rootfs
          hostPath:
            path: /
        - name: dockersock
          hostPath:
            path: /var/run/docker.sock
        - name: varlibdocker
          hostPath:
            path: /var/lib/docker
        - name: containerdsock
          hostPath:
            # path: /run/containerd/containerd.sock
            # bottle rocket does not mount containerd sock at normal place
            # https://github.com/bottlerocket-os/bottlerocket/commit/91810c85b83ff4c3660b496e243ef8b55df0973b
            path: /run/dockershim.sock
        - name: sys
          hostPath:
            path: /sys
        - name: devdisk
          hostPath:
            path: /dev/disk/
      terminationGracePeriodSeconds: 60
      serviceAccountName: cloudwatch-agent

pingleig added a commit to pingleig/amazon-cloudwatch-agent that referenced this issue Mar 22, 2021
@pingleig
Copy link
Member Author

Another known issue is because we are using cadvisor, pod level filesystem usage is ignored

    "container_filesystem_available",
    "container_filesystem_capacity",
    "container_filesystem_usage",
    "container_filesystem_utilization"

https://github.com/google/cadvisor/blob/291c215c5ddc5216659b5e793a98a0ba9f104afb/container/containerd/handler.go#L163-L167

func (h *containerdContainerHandler) GetSpec() (info.ContainerSpec, error) {
	// TODO: Since we dont collect disk usage stats for containerd, we set hasFilesystem
	// to false. Revisit when we support disk usage stats for containerd
	hasFilesystem := false
	spec, err := common.GetSpec(h.cgroupPaths, h.machineInfoFactory, h.needNet(), hasFilesystem)
	spec.Labels = h.labels
	spec.Envs = h.envs
	spec.Image = h.image

	return spec, err
}

@pingleig
Copy link
Member Author

pingleig commented Mar 23, 2021

NOTE: container file system usage is not provided after switching to containerd google/cadvisor#2785

Created another issue to track the container filesystem metrics #192

@pingleig
Copy link
Member Author

pingleig commented Mar 31, 2021

Reopen this issue since we are still in the release process, and the official container insight public doc plus sample manifest is not updated yet.

@pingleig
Copy link
Member Author

@fitchtech
Copy link

This needs fixed within the official helm charts for EKS
https://github.com/aws/eks-charts/blob/master/stable/aws-cloudwatch-metrics/templates/daemonset.yaml

@fitchtech
Copy link

fitchtech commented Aug 21, 2021

@pingleig I have tried applying the fix listed above exactly as is on EKS with the containerd runtime enabled. However, I'm still getting the same error messages:

2021-08-21T00:08:59Z I! [processors.ec2tagger] ec2tagger: Initial retrieval of tags succeded
2021-08-21T00:08:59Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started, finished initial retrieval of tags and Volumes
2021-08-21T00:09:00Z W! No pod metric collected, metrics count is still 5 is containerd socket mounted? #188
2021-08-21T00:09:05Z W! [outputs.cloudwatchlogs] Invalid SequenceToken used, will use new token and retry: The given sequenceToken is invalid. The next expected sequenceToken is: 49605661750447750614958043896578931231172344896032866930
2021-08-21T00:09:05Z W! [outputs.cloudwatchlogs] Retried 0 time, going to sleep 105.761168ms before retrying.

Support for containerd runtime on EKS was added in July when EKS 1.21 was released.
https://aws.amazon.com/blogs/containers/amazon-eks-1-21-released/

@pingleig
Copy link
Member Author

@fitchtech. The containerd socket on host is in a different path (same as bottlerocket). This is PR for EKS AMI https://github.com/awslabs/amazon-eks-ami/pull/698/files and the config file https://github.com/awslabs/amazon-eks-ami/blob/8450297eb2ef87fe5cbbce52a86ddcdc8b2e6716/files/containerd-config.toml#L1-L6

[grpc]
address = "/run/dockershim.sock"

You can follow non default path in #188 (comment)

          hostPath:
            # path: /run/containerd/containerd.sock
            # bottle rocket does not mount containerd sock at normal place
            # https://github.com/bottlerocket-os/bottlerocket/commit/91810c85b83ff4c3660b496e243ef8b55df0973b
            path: /run/dockershim.sock

cc @sethAmazon since both EKS EC2 and Bottlerocket are using /run/dockershim.sock we may change this to =default. Though I was testing using kops at that time, which uses /run/containerd/containerd.sock. I am not sure if it's possible to have one manifest that works for both in our example manifest. Though it should doable for helm.

@pingleig pingleig reopened this Aug 21, 2021
@pingleig pingleig pinned this issue Aug 21, 2021
@fitchtech
Copy link

fitchtech commented Aug 21, 2021

@pingleig that worked, thank you. One additional change I had to make is to enable hostNetwork, cause the EC2 instances in my EKS 1.21 node group has the Instance MetaData Service (IMDS) restricted per the EKS security best practices . You have to set hostNetwork: true for it to be able to start up. Once I did everything loaded in the ContainerInsights console.

With hostNetwork: false I get this

2021/08/21 07:23:59 I! Config has been translated into TOML /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml 
2021-08-21T07:23:59Z I! Starting AmazonCloudWatchAgent 1.247349.0
2021-08-21T07:23:59Z I! Loaded inputs: k8sapiserver cadvisor
2021-08-21T07:23:59Z I! Loaded aggregators: 
2021-08-21T07:23:59Z I! Loaded processors: ec2tagger k8sdecorator
2021-08-21T07:23:59Z I! Loaded outputs: cloudwatchlogs
2021-08-21T07:23:59Z I! Tags enabled: 
2021-08-21T07:23:59Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"ip-10-106-12-9.ec2.internal", Flush Interval:1s
2021-08-21T07:23:59Z I! [logagent] starting
2021-08-21T07:23:59Z I! [logagent] found plugin cloudwatchlogs is a log backend

With hostNetwork: true

2021/08/21 07:28:18 I! Config has been translated into TOML /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml 
2021-08-21T07:28:18Z I! Starting AmazonCloudWatchAgent 1.247349.0
2021-08-21T07:28:18Z I! Loaded inputs: cadvisor k8sapiserver
2021-08-21T07:28:18Z I! Loaded aggregators: 
2021-08-21T07:28:18Z I! Loaded processors: ec2tagger k8sdecorator
2021-08-21T07:28:18Z I! Loaded outputs: cloudwatchlogs
2021-08-21T07:28:18Z I! Tags enabled: 
2021-08-21T07:28:18Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"ip-10-106-12-9.ec2.internal", Flush Interval:1s
2021-08-21T07:28:18Z I! [logagent] starting
2021-08-21T07:28:18Z I! [logagent] found plugin cloudwatchlogs is a log backend
2021-08-21T07:28:18Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started initialization.
2021-08-21T07:28:18Z I! k8sapiserver Switch New Leader: ip-10-106-12-14.ec2.internal
2021-08-21T07:28:19Z I! [processors.ec2tagger] ec2tagger: Initial retrieval of tags succeded
2021-08-21T07:28:19Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started, finished initial retrieval of tags and Volumes
2021-08-21T07:28:26Z W! [outputs.cloudwatchlogs] Retried 0 time, going to sleep 137.608142ms before retrying.
2021-08-21T07:33:34Z I! [processors.ec2tagger] ec2tagger: Refresh is no longer needed, stop refreshTicker.

ec2tagger doesn't like not being able to access the instance metadata service and the containers will restart. Once I set hostNetwork to true I started seeing metrics flow into ContainerInsights. This was even though the DaemonSet is set to a service account that using IAM Roles for Service Accounts (IRSA) with a policy that give it ec2:DescribeVolumes & ec2:DescribeTags

Can an update be made that allows this to work without host network enabled on the daemonset?

@fitchtech
Copy link

Also, the IAM policy document attached to the IRSA role needs allow sts:AssumeRoleWithWebIdentity & sts:AssumeRole resource restricted to the IRSA role ARN or it will throw access denied errors and assume role API call.

@fitchtech
Copy link

@fitchtech. The containerd socket on host is in a different path (same as bottlerocket). This is PR for EKS AMI https://github.com/awslabs/amazon-eks-ami/pull/698/files and the config file https://github.com/awslabs/amazon-eks-ami/blob/8450297eb2ef87fe5cbbce52a86ddcdc8b2e6716/files/containerd-config.toml#L1-L6

[grpc]
address = "/run/dockershim.sock"

You can follow non default path in #188 (comment)

          hostPath:
            # path: /run/containerd/containerd.sock
            # bottle rocket does not mount containerd sock at normal place
            # https://github.com/bottlerocket-os/bottlerocket/commit/91810c85b83ff4c3660b496e243ef8b55df0973b
            path: /run/dockershim.sock

cc @sethAmazon since both EKS EC2 and Bottlerocket are using /run/dockershim.sock we may change this to =default. Though I was testing using kops at that time, which uses /run/containerd/containerd.sock. I am not sure if it's possible to have one manifest that works for both in our example manifest. Though it should doable for helm.

The official EKS helm charts for CloudWatch Metrics should be updated to do this instead of applying manifests so that you can use helm templates to conditionally set those based on values provided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/k8s Kubernetes bug Something isn't working component/container-insight
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants