container-startup-autoscaler (CSA) is a Kubernetes controller that modifies the CPU and/or memory resources of containers depending on whether they're starting up, according to the startup/post-startup settings you supply. CSA works at the pod level and is agnostic to how the pod is managed; it works with deployments, statefulsets, daemonsets and other workload management APIs.
CSA is implemented using controller-runtime.
CSA is built around Kube's In-place Update of Pod Resources
feature, which is currently in alpha state as of Kubernetes 1.31 and therefore requires the InPlacePodVerticalScaling
feature gate to be enabled. Beta/stable targets are indicated here.
The feature implementation (along with the corresponding implementation of CSA) is likely to change until it reaches
stable status. See CHANGELOG.md for details of CSA versions and Kubernetes version compatibility.
- container-startup-autoscaler 🚀
- Navigation
- Demo Video
- Docker Images
- Helm Chart
- Motivation
- How it Works
- Limitations
- Restrictions
- Scale Configuration
- Probes
- Status
- Events
- Logging
- Metrics
- Retry
- Encountering Unknown Resources
- CSA Configuration
- Pod Admission Considerations
- Container Scaling Considerations
- Best Practices
- Tests
- Running Locally
A local sandbox is provided for previewing CSA - this video shows fundamental CSA operation using the sandbox scripts:
csa.mp4
Versioned multi-arch Docker images are available via Docker Hub.
A CSA Helm chart is available - please see its README.md for more information.
The release of Kubernetes 1.27.0
introduced a new, long-awaited alpha feature:
In-place Update of Pod Resources.
This feature allows pod container resources (requests
and limits
) to be updated in-place, without the need to
restart the pod. Prior to this, any changes made to container resources required a pod restart to apply.
A historical concern of running workloads within Kubernetes is how to tune container resources for workloads that have very different resource utilization characteristics during two core phases: startup and post-startup. Given the previous lack of ability to change container resources in-place, there was generally a tradeoff for startup-heavy workloads between obtaining good (and consistent) startup times and overall resource wastage, post-startup:
Set limits
greater than requests
in the hope that resources beyond requests
are actually scavengeable during
startup.
- Startup time is unpredictable since it's dependent on cluster node loading conditions.
- Post-startup performance may also be unpredictable as additional scavengeable resources are volatile in nature (particularly with cluster consolidation mechanisms).
Set limits
the same as requests
, with startup time as the primary factor in determining the value.
- Startup time and post-startup performance is predictable but wastage may occur, particularly if the pod replica count is generally more than it needs to be.
Set limits
the same as requests
, with normal workload servicing performance as the primary factor in determining
the value.
- Post-startup performance is predictable and acceptable, but startup time is slower - this negatively affects desirable operational characteristics such as by elongating deployment durations and horizontal scaling reaction times.
The core motivation of CSA is to leverage the new In-place Update of Pod Resources
Kube feature to provide workload
owners with the ability to configure container resources for startup (in a guaranteed fashion) separately from normal
post-startup workload resources. In doing so, the tradeoffs listed above are eliminated and the foundations are laid
for:
- Reducing resource wastage by facilitating separate settings for two fundamental workload phases.
- Faster and more predictable workload startup times, promoting desirable operational characteristics.
CSA is able to target a single non-init/ephemeral container within a pod. Configuration such as the target container name and desired startup/post-startup resource settings are contained within a number of pod annotations.
CSA watches for changes in pods that are marked as eligible for scaling (via a label). Upon processing an eligible pod's changes, CSA examines the current state of the target container and takes one of several actions based on that state:
- Startup resource settings are commanded (the target container currently has its post-startup settings applied and isn't started).
- Post-startup resource settings are commanded (the target container currently has its startup settings applied and is started).
- The status of a previously commanded scale is determined and appropriately reported upon. If the commanded scale was successful, the scale is considered to be enacted.
CSA will react when the target container is initially created (by its pod) and if Kube restarts the target container.
CSA will not perform any scaling action if it doesn't need to - for example, if the target container repeatedly fails
to start prior to it becoming ready (with Kube reacting with restarts in a CrashLoopBackOff
manner), CSA will only
apply startup resources once.
CSA generates metrics and pod Kube events, along with a detailed status that's included within an annotation of the scaled pod.
The following limitations are currently in place:
- Originally admitted target container resources must be guaranteed (
requests
==limits
) to match the guaranteed nature of startup resources - Kube API currently rejects any change in resource QoS. This should be addressed as theIn-place Update of Pod Resources
feature matures. - Post-startup resources must be guaranteed (
requests
==limits
) to match the guaranteed nature of startup resources per above. - Failed target container scales are not re-attempted.
The following restrictions are currently in place and enforced where applicable:
- Only a single container of a pod can be targeted for scaling.
- The target pod must not be controlled by a VPA.
- The target container post-startup
requests
must be lower than startup resources. - The target container must specify
requests
for both CPU and memory. - The target container must specify the
NotRequired
resize policy for both CPU and memory. - The target container must specify a startup or readiness probe (or both).
The following labels must be present in the pod that includes your target container:
Name | Value | Description |
---|---|---|
csa.expediagroup.com/enabled |
"true" |
Indicates a container in the pod is eligible for scaling - must be "true" . |
The following annotations must be present in the pod that includes your target container:
Name | Example Value | Description |
---|---|---|
csa.expediagroup.com/target-container-name |
"mycontainer" |
The name of the container to target. |
csa.expediagroup.com/cpu-startup |
"500m" * |
Startup CPU (applied to both requests and limits ). |
csa.expediagroup.com/cpu-post-startup-requests |
"250m" * |
Post-startup CPU requests . |
csa.expediagroup.com/cpu-post-startup-limits |
"250m" * |
Post-startup CPU limits . |
csa.expediagroup.com/memory-startup |
"500M" * |
Startup memory (applied to both requests and limits ). |
csa.expediagroup.com/memory-post-startup-requests |
"250M" * |
Post-startup memory requests . |
csa.expediagroup.com/memory-post-startup-limits |
"250M" * |
Post-startup memory limits . |
* Any CPU/memory form listed here can be used.
CSA needs to know when the target container is starting up and therefore requires you to specify an appropriately configured startup or readiness probe (or both).
If the target container specifies a startup probe, CSA always uses Kube's started
signal of the container's status to
determine whether the container is started. Otherwise, if only a readiness probe is specified, CSA primarily uses the
ready
signal of the container's status to determine whether the container is started.
It's preferable to have a startup probe defined since this unambiguously indicates whether a container is started whereas only a readiness probe may indicate other conditions that will cause unnecessary scaling (e.g. the readiness probe transiently failing post-startup).
Kube's container status started
and ready
signal behavior is as follows:
When only a startup probe is present:
started
isfalse
when the container is (re)started andtrue
when the startup probe succeeds.ready
isfalse
when the container is (re)started andtrue
whenstarted
istrue
.
When only a readiness probe is present:
started
isfalse
when the container is (re)started andtrue
when the container is running and has passed thepostStart
lifecycle hook.ready
isfalse
when container is (re)started andtrue
when the readiness probe succeeds.
When both startup and readiness probes are present:
started
isfalse
when container is (re)started andtrue
when the startup probe succeeds.ready
isfalse
when container is (re)started andtrue
when the readiness probe succeeds.
CSA reports its status in JSON via the csa.expediagroup.com/status
annotation. You can retrieve and format the status
using kubectl
and jq
as follows:
kubectl get pod <name> -n <namespace> -o=jsonpath='{.items[0].metadata.annotations.csa\.expediagroup\.com\/status}' | jq
Example output:
{
"status": "Post-startup resources enacted",
"states": {
"startupProbe": "true",
"readinessProbe": "true",
"container": "running",
"started": "true",
"ready": "false",
"resources": "poststartup",
"allocatedResources": "containerrequestsmatch",
"statusResources": "containerresourcesmatch"
},
"scale": {
"lastCommanded": "2023-09-14T08:18:44.174+0000",
"lastEnacted": "2023-09-14T08:18:45.382+0000",
"lastFailed": ""
},
"lastUpdated": "2023-09-14T08:18:45+0000"
}
Explanation of status items:
Item | Sub Item | Description |
---|---|---|
status |
- | Human-readable status. Any validation errors are indicated here. |
states |
- | The states of the target container. |
states |
startupProbe |
Whether a startup probe exists. |
states |
readinessProbe |
Whether a readiness probe exists. |
states |
container |
The container status e.g. waiting , running . |
states |
started |
Whether the container is signalled as started by Kube. |
states |
ready |
Whether the container is signalled as ready by Kube. |
states |
resources |
The type of resources (startup/post-startup) that are currently applied (but not necessarily enacted). |
states |
allocatedResources |
How the reported container allocated resources relate to container requests. |
states |
statusResources |
How the reported currently allocated resources relate to container resources. |
scale |
- | Information around scaling activity. |
scale |
lastCommanded |
The last time a scale was commanded (UTC). |
scale |
lastEnacted |
The last time a scale was enacted (UTC; empty if failed). |
scale |
lastFailed |
The last time a scale failed (UTC; empty if enacted). |
lastUpdated |
- | The last time this status was updated. |
The following Kube events for the pod that houses the target container are generated:
Trigger | Reason |
---|---|
Startup resources are commanded. | Scaling |
Startup resources are enacted. | Scaling |
Post-startup resources are commanded. | Scaling |
Post-startup resources are enacted. | Scaling |
Trigger | Reason |
---|---|
Validation failure. | Validation |
Failed to scale commanded startup resources. | Scaling |
Failed to scale commanded post-startup resources. | Scaling |
CSA uses the logr API with zerologr to log
JSON-based error
-, info
-, debug
- and trace
-level messages.
When configuring verbosity, info
-level messages have a verbosity (v
) of 0,
debug
-level messages have a v
of 1, and debug
-level messages have a v
of 2 - this is mapped via zerologr.
Regardless of configured logging verbosity, error
-level messages are always emitted.
Example info
-level log:
{
"level": "info",
"controller": "container-startup-autoscaler",
"namespace": "echoserver",
"name": "echoserver-5f65d8f65d-mvqt8",
"reconcileID": "6157dd49-7aa9-4cac-bbaf-a739fa48cc61",
"targetname": "echoserver",
"targetstates": {
"startupProbe": "true",
"readinessProbe": "true",
"container": "running",
"started": "true",
"ready": "false",
"resources": "poststartup",
"allocatedResources": "containerrequestsmatch",
"statusResources": "containerresourcesmatch"
},
"caller": "container-startup-autoscaler/internal/pod/targetcontaineraction.go:472",
"time": 1694681974425,
"message": "post-startup resources enacted"
}
Each message includes a number of keys that originate from controller-runtime and zerologr. CSA-added values include:
targetname
: the name of the container to target.targetstates
: the states of the target container, per status.
Regardless of configured logging verbosity, error
-level messages are always displayed.
Additional CSA-specific metrics are registered to the Prometheus registry exposed by controller-runtime and exposed
on port 8080 and path /metrics
e.g. http://localhost:8080/metrics
. CSA metrics are not pre-initialized with 0
values.
Prefixed with csa_reconciler_
:
Metric Name | Type | Labels | Description |
---|---|---|---|
skipped_only_status_change |
Counter | controller |
Number of reconciles that were skipped because only the scaler controller status changed. |
existing_in_progress |
Counter | controller |
Number of attempted reconciles where one was already in progress for the same namespace/name (results in a requeue). |
failure_unable_to_get_pod |
Counter | controller |
Number of reconciles where there was a failure to get the pod (results in a requeue). |
failure_pod_doesnt_exist |
Counter | controller |
Number of reconciles where the pod was found not to exist (results in failure). |
failure_validation |
Counter | controller |
Number of reconciles where there was a failure to validate (results in failure). |
failure_states_determination |
Counter | controller |
Number of reconciles where there was a failure to determine states (results in failure). |
failure_states_action |
Counter | controller |
Number of reconciles where there was a failure to action the determined states (results in failure). |
Labels:
controller
: the CSA controller name.
Prefixed with csa_scale_
:
Metric Name | Type | Labels | Description |
---|---|---|---|
failure |
Counter | controller , direction , reason |
Number of scale failures. |
commanded_unknown_resources |
Counter | controller |
Number of scales commanded upon encountering unknown resources (see here). |
duration_seconds |
Histogram | controller , direction , outcome |
Scale duration (from commanded to enacted). |
Labels:
controller
: the CSA controller name.direction
: the direction of the scale -up
/down
.reason
: the reason why the scale failed.outcome
: the outcome of the scale -success
/failure
.
Prefixed with csa_retrykubeapi_
:
Metric Name | Type | Labels | Description |
---|---|---|---|
retry |
Counter | controller , reason |
Number of Kube API retries. |
Labels:
controller
: the CSA controller name.reason
: the Kube API response that caused a retry to occur.
See below for more information on retries.
Unless Kube API reports that a pod is not found upon trying to retrieve it, all Kube API interactions are subject to retry according to CSA retry configuration.
CSA handles situations where Kube API reports a conflict upon a pod update. In this case, CSA retrieves the latest version of the pod and reapplies the update, before trying again (subject to retry configuration).
By default, CSA will yield an error if it encounters resources applied to a target container that it doesn't recognize
i.e. resources other than those specified within the pod startup or post-startup resource annotations. This may
occur if resources are updated by an actor other than CSA. To allow corrective scaling upon encountering such a
condition, set the --scale-when-unknown-resources
configuration flag to true
.
When enabled and upon encountering such conditions, CSA will:
- Command startup/post-startup resources according to whether the container is started.
- Append the Kube startup/post-startup resources commanded event reason and log message
with
(unknown resources applied)
- Increment the
commanded_unknown_resources
metric. - Treat enacted startup resources as directionally scaled
down
within thefailure
andduration_seconds
(as applicable) metrics. - Treat enacted post-startup resources as directionally scaled
up
within thefailure
andduration_seconds
(as applicable) metrics.
CSA uses the Cobra CLI library and exposes a number of optional configuration flags. All configuration flags are always logged upon CSA start.
Flag | Type | Default Value | Description |
---|---|---|---|
--kubeconfig |
String | - | Absolute path to the cluster kubeconfig file (uses in-cluster configuration if not supplied). |
--leader-election-enabled |
Boolean | true |
Whether to enable leader election. |
--leader-election-resource-namespace |
String | - | The namespace to create resources in if leader election is enabled (uses current namespace if not supplied). |
--cache-sync-period-mins |
Integer | 60 |
How frequently the informer should re-sync. |
--graceful-shutdown-timeout-secs |
Integer | 10 |
How long to allow busy workers to complete upon shutdown. |
--requeue-duration-secs |
Integer | 3 |
How long to wait before requeuing a reconcile. |
--max-concurrent-reconciles |
Integer | 10 |
The maximum number of concurrent reconciles. |
--scale-when-unknown-resources |
Boolean | false |
Whether to scale when unknown resources are encountered. |
Flag | Type | Default Value | Description |
---|---|---|---|
--standard-retry-attempts |
Integer | 3 |
The maximum number of attempts for a standard retry. |
--standard-retry-delay-secs |
Integer | 1 |
The number of seconds to wait between standard retry attempts. |
Flag | Type | Default Value | Description |
---|---|---|---|
--log-v |
Integer | 0 |
Log verbosity level (0: info, 1: debug, 2: trace) - 2 used if invalid. |
--log-add-caller |
Boolean | false |
Whether to include the caller within logging output. |
Upon pod cluster admission, CSA will attempt to upscale the target container to its startup configuration. Upscaling success depends on node loading conditions - it's therefore possible that the scale is delayed or fails altogether, particularly if a cluster consolidation mechanism is employed.
In order to mitigate the effects of initial startup upscaling, it's recommended to admit pods with the target container startup configuration already applied - CSA will not need to initially upscale in this case. Once startup has completed, the subsequent downscale to apply post-startup resources is significantly less likely to fail since it's not subject to node loading conditions. In addition, any failure mode results in overall resource over-provisioning rather than startup under-provisioning.
It's important to note that in either case, CSA will need to upscale if Kube restarts the target container.
apiVersion: apps/v1
kind: Deployment
spec:
template:
metadata:
labels:
csa.expediagroup.com/enabled: "true"
annotations:
csa.expediagroup.com/target-container-name: target-container
csa.expediagroup.com/cpu-startup: 500m
csa.expediagroup.com/cpu-post-startup-requests: 100m
csa.expediagroup.com/cpu-post-startup-limits: 100m
csa.expediagroup.com/memory-startup: 500M
csa.expediagroup.com/memory-post-startup-requests: 100M
csa.expediagroup.com/memory-post-startup-limits: 100M
spec:
containers:
- name: target-container
resources:
limits:
cpu: 500m # Admitted with csa.expediagroup.com/cpu-startup value
memory: 500M # Admitted with csa.expediagroup.com/memory-startup value
requests:
cpu: 500m # Admitted with csa.expediagroup.com/cpu-startup value
memory: 500M # Admitted with csa.expediagroup.com/memory-startup value
Please consider carefully whether it's appropriate to scale memory during execution of your container. Memory management differs between runtimes, and it's not necessarily possible to change any runtime configuration (e.g. limits) set at the point of admission without restarting the container. Some runtimes may also default memory management settings based on available resources, which may no longer be optimal when memory is scaled.
In addition, some languages/frameworks may default configuration of concurrency mechanisms (e.g. thread pools) based on available CPU resources - this should be taken into consideration if applicable.
- Define a startup probe since this unambiguously indicates whether a container is started.
- Admit pods with target container startup resources specified.
- Try to minimize restarts of target containers for causes within your control.
- Try to minimize the startup time of your workload through profiling and optimization where possible.
- Try to minimize the difference between startup resources and post-startup resources - in general, the bigger the difference, the less likely an upscale is to succeed (particularly when a cluster consolidation mechanism is employed).
Unit tests can be run by executing make test-run-unit
from the root directory.
Integration tests can be run by executing make test-run-int
or make test-run-int-verbose
(verbose logging) from the
root directory. Please ensure you're using a version of Go that's at least that of the version that's indicated at the
top of go.mod.
Integration tests are implemented as Go tests and located in test/integration
. During initialization of the tests, a
kind cluster is created (with a specific name); CSA is built via Docker and run via
Helm. Tools are not bundled with the tests, so you must have the following installed locally:
- Docker
- Helm
- kind (at least 0.24.0)
- kubectl
The integration tests use echo-server for containers. Note: the very first execution might take some time to complete.
A number of environment variable-based configuration options are available:
Name | Default | Description |
---|---|---|
KUBE_VERSION |
- | The major.minor version of Kube to run tests against e.g. 1.31 . |
MAX_PARALLELISM |
4 |
The maximum number of tests that can run in parallel. |
REUSE_CLUSTER |
false |
Whether to reuse an existing CSA kind cluster (if it already exists). KUBE_VERSION has no effect if an existing cluster is reused. |
INSTALL_METRICS_SERVER |
false |
Whether to install metrics-server. |
KEEP_CSA |
false |
Whether to keep the CSA installation after tests finish. |
KEEP_CLUSTER |
false |
Whether to keep the CSA kind cluster after tests finish. |
DELETE_NS_AFTER_TEST |
true |
Whether to delete namespaces created by tests after they conclude. |
Integration tests are executed in parallel due to their long-running nature. Each test operates within a separate Kube
namespace (but using the same single CSA installation). If local resources are limited, reduce MAX_PARALLELISM
accordingly and ensure DELETE_NS_AFTER_TEST
is true
. Each test typically spins up 2 pods, each with 2 containers;
see source for resource allocations.
A number of Bash scripts are supplied in the scripts/sandbox
directory that allow you to try out CSA using
echo-server. The scripts are similar in nature to the setup/teardown work
performed in the integration tests and have the same local tool requirements. Please ensure you're
using a version of Go that's at least that of the version that's indicated at the top of go.mod. Note: the
kind cluster created by the scripts is named differently to the integration tests such that both can
exist in parallel, if desired.
Executing csa-install.sh
:
- Removes any pre-existing CSA kind cluster.
- Installs a CSA kind cluster with the latest version of Kubernetes certified as compatible with CSA.
- Creates a new, separate CSA kind cluster kubeconfig file under
$HOME/.kube/
. - Pulls metrics-server, loads the image into the CSA kind cluster and installs.
- Pulls echo-server and loads the image into the CSA kind cluster.
- Builds CSA and loads the image into the CSA kind cluster.
- Runs CSA via the Helm chart.
- Leader election is enabled; 2 pods are created.
- Log verbosity level is
2
(trace).
Note: the very first execution might take some time to complete.
Executing csa-tail-logs.sh
tails logs from the current CSA leader pod.
Executing csa-get-metrics.sh
gets metrics from the current CSA leader pod.
Executing echo-watch.sh
watch
es the CSA status for the pod created below along with the target
container's enacted resources.
Execute echo-reinstall.sh
to (re)install echo-service with a specific probe configuration contained within
the echo
directory structure:
Admit with post-startup resources (initial upscale required):
echo-reinstall.sh echo/post-startup-resources/startup-probe.yaml
: single replica/container deployment with startup probe only.echo-reinstall.sh echo/post-startup-resources/readiness-probe.yaml
: single replica/container deployment with readiness probe only.echo-reinstall.sh echo/post-startup-resources/both-probes.yaml
: single replica/container deployment with both startup and readiness probes.
Admit with startup resources (initial upscale not required):
echo-reinstall.sh echo/startup-resources/startup-probe.yaml
: single replica/container deployment with startup probe only.echo-reinstall.sh echo/startup-resources/readiness-probe.yaml
: single replica/container deployment with readiness probe only.echo-reinstall.sh echo/startup-resources/both-probes.yaml
: single replica/container deployment with both startup and readiness probes.
To simulate workload startup/readiness, initialDelaySeconds
is set as follows in all configurations:
Configuration | Startup Probe | Readiness Probe |
---|---|---|
startup-probe.yaml | 15 |
N/A |
readiness-probe.yaml | N/A | 15 |
both-probes.yaml | 15 |
30 |
You can also cause a validation failure by executing echo-reinstall.sh echo/validation-failure/cpu-config.yaml
. This
will yield the cpu post-startup requests (...) is greater than startup value (...)
status message.
Execute echo-cause-container-restart.sh
to cause the echo-service container to restart. Note: CrashLoopBackoff
may be triggered upon executing this multiple times in succession.
Executing echo-delete.sh
deletes the echo-server namespace (including pod).
Executing csa-uninstall.sh
uninstalls the CSA kind cluster.
First establish a watch on CSA status and enacted container resources and optionally tail CSA logs. You may also want to observe CSA metrics.
- Install echo-server with
echo/post-startup-resources/startup-probe.yaml
and watch as CSA upscales the container for startup, then downscales once the container is started. - Install echo-server with
echo/startup-resources/startup-probe.yaml
and watch as CSA only downscales once the container is started - note the CSAlastCommanded
andlastEnacted
status is not populated until downscale. - Repeat 1) and 2) above with a readiness probe only (
echo/*/readiness-probe.yaml
) and watch as CSA only reacts to the container'sready
status i.e. notstarted
. - Repeat 1) and 2) above with both probes (
echo/*/both-probes.yaml
) and watch as CSA only reacts to the container'sstarted
status i.e. notready
. - Cause a container restart after post-startup resources are enacted and watch as CSA (re)upscales the container for startup, then downscales once the container is started.
- Cause a container restart repeatedly after startup resources are enacted and watch as CSA doesn't take any action until downscaling after the container is started.
- Install echo-server with
echo/validation-failure/cpu-config.yaml
and observe CSA status when a validation failure occurs.