Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: overcommit-plugin enhancements #3634

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

googs1025
Copy link
Member

@googs1025 googs1025 commented Jul 27, 2024

Fix: #3635

overcommit-plugin

@googs1025; Jul. 29, 2024

Background:

The overcommit-plugin is used to amplify node resources to achieve resource allocation.

Objective:

Use different amplification factors based on different resource types.

Introduction

Currently, the overcommit-plugin enhances the Allocatable resources of a node to achieve the functionality of AddJobEnqueuedFn. However, different resources should have different factors, so using the same overcommit-factor is not appropriate.
a

  • For example:

The Binpack plugin assigns different weights to different resources as well.

actions: "enqueue, reclaim, allocate, backfill, preempt"
tiers:
- plugins:
  - name: binpack
    arguments:
      binpack.weight: 10
      binpack.cpu: 5
      binpack.memory: 1
      binpack.resources: nvidia.com/gpu, example.com/foo
      binpack.resources.nvidia.com/gpu: 2
      binpack.resources.example.com/foo: 3

Solution

We can further break down the overcommit-factor into more granular components: overcommit-factor.<resource name>.

For example: overcommit-factor.cpu overcommit-factor.memory overcommit-factor.pods overcommit-factor.ephemeral-storage overcommit-factor.nvidia.com/gpu

To maintain compatibility with the existing approach, we will retain the original overcommit-factor field and we will keep the original overcommit-factor field and introduce an optional field of overcommit-factor.<resource name>.

factors

The priority of these fields will be from low to high:

defaultOverCommitFactor -> overcommit-factor -> overcommit-factor.<resorce name>

  • overcommitPlugin struct
// overcommitFactors defines the resource overCommit factors
type overcommitFactors struct {
	// factorMaps defines the resource overCommit factors
    // key: resource, example: "cpu", "memory", "ephemeral-storage", "nvidia.com/gpu"
    // value: overCommit factors
    factorMaps map[string]float64
}

type overcommitPlugin struct {
    // Arguments given for the plugin
    pluginArguments  framework.Arguments
    totalResource    *api.Resource
    idleResource     *api.Resource
    inqueueResource  *api.Resource
    // overCommitFactor is the different resource overCommit factors
    overCommitFactors *overcommitFactors
}

Example

Example 1:
Explicitly specify all the overcommit factors

actions: "enqueue, allocate, backfill"
tiers:
- plugins:
  - name: overcommit
    arguments:
    overcommit-factor.cpu: 1.2
    overcommit-factor.memory: 1.0
    overcommit-factor.ephemeral-storage: 1.2
    overcommit-factor.pods: 1.2
    overcommit-factor.nvidia.com/gpu: 1.2

Example 2:
Specifying only the overcommit-factor implies that all factors are the same.

actions: "enqueue, allocate, backfill"
tiers:
- plugins:
  - name: overcommit
    arguments:
    overcommit-factor: 1.3

Example 3:
Specifying overcommit-factor.cpu, overcommit-factor.nvidia.com/gpu are set, along with specifying overcommit-factor: indicates that the resource uses a specific value, while other values use the overcommit-factor field.

actions: "enqueue, allocate, backfill"
tiers:
- plugins:
  - name: overcommit
    arguments:
    overcommit-factor.cpu: 1.2
    overcommit-factor.nvidia.com/gpu: 1.3
    overcommit-factor: 1.0

Example 4:
Specifying any one of overcommit-factor.cpu is set: indicates that the resource uses a specific value, while other values use the defaultOverCommitFactor default value.

actions: "enqueue, allocate, backfill"
tiers:
- plugins:
  - name: overcommit
    arguments:
    overcommit-factor.cpu: 1.2

Example 5:
Not specifying will default to the defaultOverCommitFactor value.

actions: "enqueue, allocate, backfill"
tiers:
- plugins:
  - name: overcommit
    arguments:

@volcano-sh-bot volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 27, 2024
@googs1025 googs1025 force-pushed the overcommit-plugin_enhancements branch from 87eccf9 to 5a36cdd Compare July 27, 2024 13:54
@hwdef
Copy link
Member

hwdef commented Jul 28, 2024

Why not put it in the overcommit document, but create a new document?

@googs1025
Copy link
Member Author

Why not put it in the overcommit document, but create a new document?

I would like to, but I can't seem to find any design documentation related to overcommit plugins

@hwdef
Copy link
Member

hwdef commented Jul 28, 2024

You can name your document overcommit-plugin.md

@googs1025
Copy link
Member Author

enhancements

I can modify it. I named it overcommit-plugin-enhancements because overcommit-plugin itself was not designed by me and I may not be sure of many backgrounds.

@googs1025
Copy link
Member Author

/kind docs
/kind feature

@volcano-sh-bot volcano-sh-bot added kind/docs kind/feature Categorizes issue or PR as related to a new feature. labels Jul 29, 2024
@hwdef
Copy link
Member

hwdef commented Jul 29, 2024

enhancements

I can modify it. I named it overcommit-plugin-enhancements because overcommit-plugin itself was not designed by me and I may not be sure of many backgrounds.

Never mind, overcommit plugin is simple, and I believe you can understand it completely. This is also a supplement to the missing documentation in the community.

@googs1025 googs1025 force-pushed the overcommit-plugin_enhancements branch from 5a36cdd to 3fe8b2e Compare July 29, 2024 11:42
@googs1025
Copy link
Member Author

enhancements

I can modify it. I named it overcommit-plugin-enhancements because overcommit-plugin itself was not designed by me and I may not be sure of many backgrounds.

Never mind, overcommit plugin is simple, and I believe you can understand it completely. This is also a supplement to the missing documentation in the community.

thanks! done

@googs1025
Copy link
Member Author

If having time, please take a look at this issue. @lowang-bh @Monokaix thanks a lot

arguments:
cpu-overcommit-factor: 1.2
mem-overcommit-factor: 1.0
other-overcommit-factor: 1.2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which scene, other resource will have a overcommit request ? Does gpu support overcommit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my understanding, GPU resources should not be over-resolved. (Please correct me if I am wrong.) Originally, this plugin multiplied all resources by a fixed overcommit factor. This proposal is to enhance this. We should not make all configurations consistent. The overcommit factor should be configured more flexibly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This effectively separates incompressible resources from compressible resources and sets them differently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overcommit plugin is activated when a job is enqueued, and it allows more jobs to enter the Inqueue state by amplifying the factor.

@lowang-bh
Copy link
Member

Please refer to predicate.Proportional function in predicate plugin.

@googs1025 googs1025 force-pushed the overcommit-plugin_enhancements branch from 3fe8b2e to 9f1c3c1 Compare August 2, 2024 03:52
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign shinytang6
You can assign the PR to them by writing /assign @shinytang6 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/docs kind/feature Categorizes issue or PR as related to a new feature. retest-not-required-docs-only size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

docs: overcommit-plugin enhancements proposal
4 participants