Skip to content
This repository has been archived by the owner on Sep 12, 2023. It is now read-only.

minor errors in pod.go #100

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions pkg/controller.v1/common/pod.go
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
log "github.com/sirupsen/logrus"
"k8s.io/api/core/v1"
v1 "k8s.io/api/core/v1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is unnecessary

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is created by gofmt I think, maybe we can keep it.

"k8s.io/apimachinery/pkg/api/errors"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/labels"
Expand Down Expand Up @@ -86,7 +86,7 @@ func (jc *JobController) AddPod(obj interface{}) {
logger := commonutil.LoggerForPod(pod, jc.Controller.GetAPIGroupVersionKind().Kind)

if job == nil {
if pod.Labels[apiv1.GroupNameLabel] == jc.Controller.GetGroupNameLabelValue() {
if pod.Labels[apiv1.GroupNameLabel] != jc.Controller.GetGroupNameLabelValue() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch up

Copy link
Member

@Jeffwan Jeffwan Aug 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xfate123

Please check logics here.

func (jc *JobController) resolveControllerRef(namespace string, controllerRef *metav1.OwnerReference) metav1.Object {
// We can't look up by UID, so look up by Name and then verify UID.
// Don't even try to look up by Name if it's the wrong Kind.
if controllerRef.Kind != jc.Controller.GetAPIGroupVersionKind().Kind {
return nil
}
job, err := jc.Controller.GetJobFromInformerCache(namespace, controllerRef.Name)
if err != nil {
return nil
}
if job.GetUID() != controllerRef.UID {
// The controller we found with this Name is not the same one that the
// ControllerRef points to.
return nil
}
return job

If we can not find the job, we will end reconcile loop and directly return. The only thing matters is if we want to have some meaningful log.

There're several reason job == nil here

  1. Kind doesn't match
  2. GroupNameLabel doesn't match
  3. Job doesn't exist. In this case, the pod is an orphan pod.

the 3rd one is the only case we want to persist the log. Does it make sense?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jeffwan
Thank you so much Jeff. Much clearer after your patient explanation.
But I am still a little confused
I think the three cases should be:
1.Kind doesn't match
2.Name unmatched
3. Name matched but UID unmatched
My understanding is that third case cannot prove this pod is an orphan pod. It can only prove that the controllerRef point s to a different job.
Thank you again. Really appreciate your help
Thank you

Copy link
Member

@Jeffwan Jeffwan Aug 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xfate123 You are right. I didn't follow logic in resolveControllerRef. UID mismatch is one of the cases. We check GroupNameLabel again in caller side which I think is unnecessary. We use fixed value kubeflow.org for most of the operators.

My understanding is that third case cannot prove this pod is an orphan pod. It can only prove that the controllerRef point s to a different job.

I may not explain this clearly. you are right. It could be an orphan pod (job has been deleted - case 2) or a pod point to different job (UID mismatch - case 3). case 1 seems match the criterion as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jeffwan Thank you so much for your clear explanation

logger.Info("This pod's job does not exist")
}
return
Expand Down Expand Up @@ -392,9 +392,9 @@ func (jc *JobController) ReconcilePods(
}
// Check if the pod is retryable.
if spec.RestartPolicy == apiv1.RestartPolicyExitCode {
if pod.Status.Phase == v1.PodFailed && trainutil.IsRetryableExitCode(exitCode) {
if pod.Status.Phase == v1.PodFailed && !trainutil.IsRetryableExitCode(exitCode) {
Copy link
Member

@terrytangyuan terrytangyuan Aug 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was intended and correct but we probably need to improve the log message here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original logic here is to provide flexibility to define your own retryable exit code. This should not be changed.

Since kubernetes doesn't have restart method, That's the reason we delete the pod, and wait for next reconcile loop to create a new one. It's asynchronous so restart here may be confused to users.

failedPodsCount.Inc()
logger.Infof("Need to restart the pod: %v.%v", pod.Namespace, pod.Name)
logger.Infof("Need to delete the pod: %v.%v", pod.Namespace, pod.Name)
if err := jc.PodControl.DeletePod(pod.Namespace, pod.Name, runtimeObject); err != nil {
return err
}
Expand Down