-
Notifications
You must be signed in to change notification settings - Fork 73
Proposal: Add Error JobConditionType to reflect controller error (Resource Quota Error) into the status #47
Comments
Issue-Label Bot is automatically applying the label Links: app homepage, dashboard and code for this bot. |
Any thoughts on this issue? Some updates supporting the necessity of the Error ConditionType: Similar "Error" conditions do exist in other Kubernetes components, e.g. in Deployment and ReplicaSet.
Right now, pod creation errors due to resource quota are only exposed as events in KubeFlow jobs, which is not a very good channel to expose and especially to propagate errors. Enabling this Error condition in KubeFlow job CR status will enable much easier error propagation for users who are building custom CR on top of KubeFlow jobs. |
cc @Jeffwan @gaocegege @johnugeorge @ywskycn @merlintang WDYT? Any objections on bringing this type of errors to Kubeflow CR level? It would be convenient to surface this at CR level status so users don't have to check each individual subresources but my main concern is that it's hard to define the types of errors that's retriable because this depends on the specific controller, use cases, as well as the lower-level mechanism to handle/propagate errors in different clusters. |
I don't think we need to define these specific error types in kubeflow/common. They can just all be the "Error" JobConditionType, then in each specific controller, we can use |
Co-authored-by: depfu[bot] <23717796+depfu[bot]@users.noreply.github.com> Co-authored-by: Alexander Graf <[email protected]>
Problem: Currently JobConditionType constants, there doesn't exist a a status that indicates error happening in the controller. This cause problems, e.g. when pod creation fails because it exceeds resource quota. Take TFJob as an example (https://github.com/kubeflow/tf-operator/blob/5adee6f30c86484897db33188af591d5976d1cd2/pkg/control/pod_control.go#L138), if pod creation here exceeds resource quota, the
Create
func will return an error. This information is only exposed to the higher-level jobs as an event.Solution: Add an Error JobConditionType in kubeflow common api. Also show pod creation error (or other types of error if necessary) into the JobStatus, e.g. in the
updateStatusSingle
func for TFJob (https://github.com/kubeflow/tf-operator/blob/5adee6f30c86484897db33188af591d5976d1cd2/pkg/controller.v1/tensorflow/status.go#L61). This Error status will just indicate a retriable error that happened in the controller.The text was updated successfully, but these errors were encountered: