Skip to content

Commit

Permalink
add alerts (single check for pod restarts)
Browse files Browse the repository at this point in the history
fix observability (don't remove all ServiceMonitor resources)

Signed-off-by: Michael Nairn <[email protected]>
  • Loading branch information
mikenairn committed Dec 4, 2024
1 parent 1409e70 commit e2cb6b8
Show file tree
Hide file tree
Showing 6 changed files with 97 additions and 6 deletions.
27 changes: 24 additions & 3 deletions config/observability/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -1,16 +1,37 @@
resources:
- ./metrics-server
- github.com/kuadrant/kuadrant-operator/config/observability?ref=main
- ./thanos
- github.com/kuadrant/kuadrant-operator/examples/dashboards?ref=main
- github.com/kuadrant/kuadrant-operator/examples/alerts?ref=main

patches:
- target:
- patch: |
$patch: delete
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: authorino-operator-metrics
namespace: kuadrant-system
- patch: |
$patch: delete
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
patch: |
metadata:
name: dns-operator-metrics-monitor
namespace: kuadrant-system
- patch: |
$patch: delete
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: kuadrant-operator-metrics
namespace: kuadrant-system
- patch: |
$patch: delete
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: ANY
name: limitador-operator-metrics
namespace: kuadrant-system
- path: k8s_prometheus_patch.yaml
24 changes: 24 additions & 0 deletions config/observability/metrics-server/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
resources:
- https://github.com/kubernetes-sigs/metrics-server/releases/download/v0.7.1/components.yaml
patches:
- patch: |-
- op: add
path: /spec/template/spec/containers/0/args/-
value: --kubelet-insecure-tls
target:
version: v1
kind: Deployment
name: metrics-server
namespace: kube-system
- patch: |
$patch: delete
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
name: v1beta1.metrics.k8s.io
- patch: |
$patch: delete
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: system:aggregated-metrics-reader
44 changes: 41 additions & 3 deletions test/scale/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@

## Setup local environment (kind)

Create a kind cluster with prometheus/thanos installed and configured
```shell
make local-setup
Expand All @@ -11,7 +13,7 @@ Forward port for prometheus
kubectl -n monitoring port-forward service/thanos-query 9090:9090
```

Forward port for graphana (Optional)
Forward port for grafana (Optional)
```shell
kubectl -n monitoring port-forward service/grafana 3000:3000
```
Expand All @@ -22,7 +24,43 @@ Tail all operator logs (Optional)
kubectl stern -l control-plane=dns-operator-controller-manager -A
```

Run default scale test(1 iteration using the inmemory provider)
## Run scale test

Export Environment variables:
```shell
#All
export PROMETHEUS_URL=http://127.0.0.1:9090
export PROMETHEUS_TOKEN=""
#AWS
export KUADRANT_AWS_ACCESS_KEY_ID=<my aws access key id>
export KUADRANT_AWS_SECRET_ACCESS_KEY=<my aws secret access key>
export KUADRANT_AWS_REGION=""
#GCP
export KUADRANT_GCP_GOOGLE_CREDENTIALS=<my gcp credentals json>
export KUADRANT_GCP_PROJECT_ID=<my gcp project id>
#Azure
export KUADRANT_AZURE_CREDENTIALS=<my azure credentials json>
```

### inmemory

```shell
make test-scale
```
### aws

```shell
make test-scale DNS_PROVIDER=aws KUADRANT_ZONE_ROOT_DOMAIN=<my aws hosted domain>
```

### gcp

```shell
PROMETHEUS_URL=http://127.0.0.1:9090 PROMETHEUS_TOKEN="" make test-scale
make test-scale DNS_PROVIDER=gcp KUADRANT_ZONE_ROOT_DOMAIN=<my gcp hosted domain>
```

### azure

```shell
make test-scale DNS_PROVIDER=azure KUADRANT_ZONE_ROOT_DOMAIN=<my azure hosted domain>
```
3 changes: 3 additions & 0 deletions test/scale/alerts.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
- expr: increase(kube_pod_container_status_restarts_total{container="manager", namespace=~"kuadrant-system|kuadrant-dns-operator-.*"}[5m]) > 0
description: manager pod restarts
severity: error
2 changes: 2 additions & 0 deletions test/scale/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@ metricsEndpoints:
token: {{ .PROMETHEUS_TOKEN }}
metrics:
- ./metrics.yaml
alerts:
- ./alerts.yaml
indexer:
type: local
metricsDirectory: ./metrics
Expand Down
3 changes: 3 additions & 0 deletions test/scale/metrics.yaml
Original file line number Diff line number Diff line change
@@ -1,2 +1,5 @@
- query: sum(rate(container_cpu_usage_seconds_total{container="",namespace=~"kuadrant-system|kuadrant-dns-operator-*|scale-test-.*"}[5m])) by(namespace)
metricName: namespaceCPU

- query: sum(rate(kube_pod_container_status_restarts_total{container="manager", namespace=~"kuadrant-system|kuadrant-dns-operator-.*"}[5m])) by(namespace)
metricName: managerPodRestarts

0 comments on commit e2cb6b8

Please sign in to comment.