Proposal: Add Error JobConditionType to reflect controller error (Resource Quota Error) into the status #47

CJZheng91 · 2019-10-02T12:30:37Z

Problem: Currently JobConditionType constants, there doesn't exist a a status that indicates error happening in the controller. This cause problems, e.g. when pod creation fails because it exceeds resource quota. Take TFJob as an example (https://github.com/kubeflow/tf-operator/blob/5adee6f30c86484897db33188af591d5976d1cd2/pkg/control/pod_control.go#L138), if pod creation here exceeds resource quota, the Create func will return an error. This information is only exposed to the higher-level jobs as an event.

Solution: Add an Error JobConditionType in kubeflow common api. Also show pod creation error (or other types of error if necessary) into the JobStatus, e.g. in the updateStatusSingle func for TFJob (https://github.com/kubeflow/tf-operator/blob/5adee6f30c86484897db33188af591d5976d1cd2/pkg/controller.v1/tensorflow/status.go#L61). This Error status will just indicate a retriable error that happened in the controller.

The text was updated successfully, but these errors were encountered:

issue-label-bot · 2019-10-02T12:30:40Z

Issue-Label Bot is automatically applying the label feature_request to this issue, with a confidence of 0.89. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

czheng94 · 2020-04-23T14:20:41Z

Any thoughts on this issue?

Some updates supporting the necessity of the Error ConditionType:

Similar "Error" conditions do exist in other Kubernetes components, e.g. in Deployment and ReplicaSet.

ReplicaSet has ReplicaSetReplicaFailure as a condition type:
https://github.com/kubernetes/kubernetes/blob/abe6321296123aaba8e83978f7d17951ab1b64fd/pkg/apis/apps/types.go#L788-L793

const (
	// ReplicaSetReplicaFailure is added in a replica set when one of its pods fails to be created
	// due to insufficient quota, limit ranges, pod security policy, node selectors, etc. or deleted
	// due to kubelet being down or finalizers are failing.
	ReplicaSetReplicaFailure ReplicaSetConditionType = "ReplicaFailure"
)

Deployment has DeploymentReplicaFailure as a condition type:
https://github.com/kubernetes/kubernetes/blob/abe6321296123aaba8e83978f7d17951ab1b64fd/pkg/apis/apps/types.go#L460-L473

const (
	// Available means the deployment is available, ie. at least the minimum available
	// replicas required are up and running for at least minReadySeconds.
	DeploymentAvailable DeploymentConditionType = "Available"
	// Progressing means the deployment is progressing. Progress for a deployment is
	// considered when a new replica set is created or adopted, and when new pods scale
	// up or old pods scale down. Progress is not estimated for paused deployments or
	// when progressDeadlineSeconds is not specified.
	DeploymentProgressing DeploymentConditionType = "Progressing"
	// ReplicaFailure is added in a deployment when one of its pods fails to be created
	// or deleted.
	DeploymentReplicaFailure DeploymentConditionType = "ReplicaFailure"
)

Deployment controller will check for underlying ReplicaSet condition type and propagate ReplicaSetReplicaFailure up into Deployment status:
https://github.com/kubernetes/kubernetes/blob/abe6321296123aaba8e83978f7d17951ab1b64fd/pkg/controller/deployment/progress.go#L101-L106

	// Move failure conditions of all replica sets in deployment conditions. For now,
	// only one failure condition is returned from getReplicaFailures.
	if replicaFailureCond := dc.getReplicaFailures(allRSs, newRS); len(replicaFailureCond) > 0 {
		// There will be only one ReplicaFailure condition on the replica set.
		util.SetDeploymentCondition(&newStatus, replicaFailureCond[0])
	} else {
		util.RemoveDeploymentCondition(&newStatus, apps.DeploymentReplicaFailure)
	}

Right now, pod creation errors due to resource quota are only exposed as events in KubeFlow jobs, which is not a very good channel to expose and especially to propagate errors. Enabling this Error condition in KubeFlow job CR status will enable much easier error propagation for users who are building custom CR on top of KubeFlow jobs.

terrytangyuan · 2020-04-23T15:05:16Z

cc @Jeffwan @gaocegege @johnugeorge @ywskycn @merlintang WDYT? Any objections on bringing this type of errors to Kubeflow CR level? It would be convenient to surface this at CR level status so users don't have to check each individual subresources but my main concern is that it's hard to define the types of errors that's retriable because this depends on the specific controller, use cases, as well as the lower-level mechanism to handle/propagate errors in different clusters.

czheng94 · 2020-04-23T16:02:34Z

@terrytangyuan

but my main concern is that it's hard to define the types of errors that's retriable because this depends on the specific controller,

I don't think we need to define these specific error types in kubeflow/common. They can just all be the "Error" JobConditionType, then in each specific controller, we can use JobCondition.Reason and JobCondition.Message to expose the specific error.

Co-authored-by: depfu[bot] <23717796+depfu[bot]@users.noreply.github.com> Co-authored-by: Alexander Graf <[email protected]>

issue-label-bot bot added the feature_request label Oct 2, 2019

georgkaleido pushed a commit to georgkaleido/common that referenced this issue Jun 9, 2022

Update Node.js to version 14.18.1 (kubeflow#47)

4e039d5

Co-authored-by: depfu[bot] <23717796+depfu[bot]@users.noreply.github.com> Co-authored-by: Alexander Graf <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Add Error JobConditionType to reflect controller error (Resource Quota Error) into the status #47

Proposal: Add Error JobConditionType to reflect controller error (Resource Quota Error) into the status #47

CJZheng91 commented Oct 2, 2019

issue-label-bot bot commented Oct 2, 2019

czheng94 commented Apr 23, 2020 •

edited

Loading

terrytangyuan commented Apr 23, 2020

czheng94 commented Apr 23, 2020

Proposal: Add Error JobConditionType to reflect controller error (Resource Quota Error) into the status #47

Proposal: Add Error JobConditionType to reflect controller error (Resource Quota Error) into the status #47

Comments

CJZheng91 commented Oct 2, 2019

issue-label-bot bot commented Oct 2, 2019

czheng94 commented Apr 23, 2020 • edited Loading

terrytangyuan commented Apr 23, 2020

czheng94 commented Apr 23, 2020

czheng94 commented Apr 23, 2020 •

edited

Loading