Releases
v1.2.0
Highlights
RayCluster CRD status observability improvement : design doc
Support retry in RayJob : #2192
Coding style improvement
RayCluster
[RayCluster][Fix] evicted head-pod can be recreated or restarted (#2217 , @JasonChen86899 )
[Test][RayCluster] Add tests for RestartPolicyOnFailure for eviction (#2302 , @MortalHappiness )
kuberay autoscaler pod use same command and args as ray head container (#2268 , @cswangzheng )
Updated default timeout seconds for probes (#2265 , @HarshAgarwal11 )
Buildkite autoscaler e2e (#2199 , @rueian )
[Test][Autoscaler][2/n] Add Ray Autoscaler e2e tests for GPU workers (#2181 , @rueian )
[Test][Autoscaler][1/n] Add Ray Autoscaler e2e tests (#2168 , @kevin85421 )
[Bug] Fix RayCluster with an overridden app.kubernetes.io/name (#2147 ) (#2166 , @rueian )
[Feat][RayCluster] Make the Head service headless (#2117 , @rueian )
[Refactor][RayCluster] Make ray.io/group=headgroup be constant (#1970 , @rueian )
[Feature][autoscaler v2] Set RAY_NODE_TYPE_NAME when starting ray node (#1973 , @kevin85421 )
feat: add RayCluster.status.readyWorkerReplicas
(#1930 , @davidxia )
[Chore][Samples] Rename ray-cluster.mini.yaml and add workerGroupSpecs (#2100 , @MortalHappiness )
[Chore] Delete redundant pod existance checking (#2113 , @MortalHappiness )
[Autoscaler V2] Polish Autoscaler V2 YAML (#2064 , @kevin85421 )
[Refactor] Use RayClusterHeadPodsAssociationOptions to replace MatchingLabels (#2056 , @evalaiyc98 )
[Sample][autoscaler v2] Add sample yaml for autosclaer v2 (#1974 , @rickyyx )
Allow configuration of restartPolicy (#2197 , @c0dearm )
[Chore][Log] Delete error loggings right before returned errors (#2103 , @MortalHappiness )
[Refactor] Follow-up for PR 1930 (#2124 , @MortalHappiness )
[Test] Move StateTransitionTimes envtest to a better place (#2111 , @kevin85421 )
support using proxy subresources when connecting to Ray head node (#1980 , @andrewsykim )
[Bug] All worker Pods are deleted if using KubeRay v1.0.0 CRD with KubeRay operator v1.1.0 image (#2087 , @kevin85421 )
[Bug] KubeRay operator failed to watch endpoint (#2080 , @kevin85421 )
[Refactor] Remove cleanupInvalidVolumeMounts
(#2104 , @kevin85421 )
support using proxy subresources when connecting to Ray head node (#1980 , @andrewsykim )
[Chore] Run operator outside the cluster (#2090 , @MortalHappiness )
[Feat] Deprecate ForcedClusterUpgrade (#2075 , @MortalHappiness )
[Bug] Ray operator crashes when specifying RayCluster with resources.limits but no resources.requests (#2077 , @kevin85421 )
RayCluster CRD status improvement
RayClusterProvisioned status should be set while cluster is being provisioned for the first time (#2304 , @andrewsykim )
Add RayClusterProvisioned Condition Type (#2301 , @Yicheng-Lu-llll )
[Test][RayCluster] Add envtests for RayCluster conditions (#2283 , @MortalHappiness )
[Fix][RayCluster] Make the RayClusterReplicaFailureReason to capture the correct reason (#2282 , @rueian )
Add RayClusterReady Condition Type (#2271 , @Yicheng-Lu-llll )
[Feature][RayCluster]: Implement the HeadReady condition (#2261 , @cchen777 )
[Feature] REP 54: Add PodName to the HeadInfo (#2266 , @rueian )
[Feat][RayCluster] Use a new RayClusterReplicaFailure condition to reflect the result of reconcilePods (#2259 , @rueian )
Don’t assign the rayv1.Failed to the State field (#2258 , @Yicheng-Lu-llll )
[Refactor][RayCluster] Unify status update to single place (#2249 , @MortalHappiness )
[Feat][RayCluster] Introduce the RayClusterStatus.Conditions field (#2214 , @rueian )
[Test][Autoscaling] Add custom resource test (#2193 , @MortalHappiness )
feat: record last state transition times (#2053 , @davidxia )
[RayCluster] Add serviceName to status.headInfo (#2089 , @andrewsykim )
[RayCluster][Status][1/n] Remove ClusterState Unhealthy (#2068 , @kevin85421 )
Coding style improvement
[Style] Fix golangci-lint rule: govet (#2144 , @MortalHappiness )
[Chore] Fix golangci-lint rule: gosec (#2163 , @MortalHappiness )
[Style] Fix golangci-lint rule: nolintlint (#2196 , @MortalHappiness )
[Style] Fix golangci-lint rule: unparam (#2195 , @MortalHappiness )
[Fix][CI] Fix revive error (#2183 , @MortalHappiness )
[Style] Fix golangci-lint rule: revive (#2167 , @MortalHappiness )
[Style] Fix golangci-lint rule: ginkgolinter (#2164 , @MortalHappiness )
[Style] Fix golangci-lint rule: errorlint (#2141 , @MortalHappiness )
[Chore] Use new golangci-lint rules only for ray-operator (#2152 , @MortalHappiness )
[Docs][Development] Delete linting docs (#2145 , @MortalHappiness )
[Style] Fix golangci-lint rule: unconvert (#2143 , @MortalHappiness )
[Style] Fix golangci-lint rule: noctx (#2142 , @MortalHappiness )
[Fix][precommit] Fix pre-commit golangci-lint always succeed (#2140 , @MortalHappiness )
[N/N][Chore] Add golangci-lint rules (#2128 , @MortalHappiness )
[Chore] Turn off no-commit-to-branch rule (#2139 , @MortalHappiness )
[5/N][Refactor] Run golangci-lint for all files (only autofix rules) (#2133 , @MortalHappiness )
[4/N][Chore] Turn off golangci-lint rules except ray-operator (#2138 , @MortalHappiness )
[3/N][CI] Replace lint CI with pre-commit (#2129 , @MortalHappiness )
[2/N][Refactor] Run pre-commit for all files (without golangci-lint) (#2130 , @MortalHappiness )
[1/N][Chore] Add pre-commit hooks (#2127 , @MortalHappiness )
RayJob
[RayJob] allow create verb for services/proxy, which is required for HTTPMode (#2321 , @andrewsykim )
[Fix][Sample-Yaml] Increase ray head CPU resource for pytorch minst (#2330 , @MortalHappiness )
Support Apache YuniKorn as one batch scheduler option (#2184 , @yangwwei )
[RayJob] add RayJob pass Deadline e2e-test with retry (#2241 , @karta1502545 )
add feature gate mechanism to ray-operator (#2219 , @andrewsykim )
[RayJob] add Failing RayJob in HTTPMode e2e test for rayjob with retry (#2242 , @tinaxfwu )
[Feat][RayJob] Delete RayJob CR after job termination (#2225 , @MortalHappiness )
reconcile concurrency flag should apply for RayJob and RayService controllers (#2228 , @andrewsykim )
[RayJob] add Failing submitter K8s Job e2e test for rayjob with retry (#2226 , @cchen777 )
[Chore] Create example Modin RayJob (#2221 , @MortalHappiness )
[RayJob] Unified checkBackoffLimitAndUpdateStatusIfNeeded codepath and add an e2e test for retry (#2215 , @kevin85421 )
[RayJob] Add spec.backoffLimit for retrying RayJobs with new clusters (#2192 , @andrewsykim )
fix: update ray-job.pytorch-image-classifier.yaml (#2178 , @davidxia )
Add test for configurable k8s job backoff limit (#2134 , @jjyao )
Init dashboardClientFunc and httpProxyClientFunc by the config arg (#2092 , @bugorz )
[RayJob] Add Tests for Atomic Suspend Operation (#2050 , @Yicheng-Lu-llll )
Show cluster name in kubectl get rayjob (#2065 , @jjyao )
[RayJob] Add Cluster Name For Rayjob. (#2046 , @slfan1989 )
Make k8s job backoff limit configurable for RayJob (#2091 , @jjyao )
[Refactor] Renaming RayHttpProxyClient attribute UseProxy #1980 (#2093 , @Xiao75896453 )
RayService
Generate RayCluster Hash on KubeRay Version Change (#2320 , @ryanaoleary )
[release blocker][v1.2.0] Fix HA OOM issue (#2313 , @kevin85421 )
add vLLM + RayService sample (#2289 , @andrewsykim )
Fix logging issue for FetchHeadServiceURL (#2216 , @tinaxfwu )
Add RayService Manifests for Stable Diffusion TPU Examples (#2198 , @ryanaoleary )
[Test][HA] RayService high-availability test without autoscaling enabled (#2176 , @kevin85421 )
RayService: Omits Min and Max replicas from hash calculation (#2172 , @kpitzen )
Ray serve gke gateway ingress (#1978 , @ravishtiwari )
[RayService] Add RayService High Availability Test Doc (#1986 , @Yicheng-Lu-llll )
[Refactor][RayService] Add GetRayClusterWithRayServiceAssociationOptions (#2070 , @evalaiyc98 )
[Hotfix] Increase the timeout of the ProxyActor health check (#2082 , @kevin85421 )
KubeRay kubectl plugin
Helm
[Helm] Enable leader election when leaderElectionEnabled is not set (#2284 , @kevin85421 )
Support disable leader election for manager go binary via Values.yaml to mitigate kuberay restarts (#2262 , @aviadshimoni )
Change the rules in role.yaml
and multiple_namespaces_role.yaml
to use the same template in _helpers.tpl
to ensure consistency. (#2244 , @LeoLiao123 )
fix: Add deletecollection for multi-namespace role to helm charts (#2231 , @spencer-barton-klaviyo )
[Chore] Use safe YAML for helm-chart-verify-rbac (#2230 , @spencer-p )
Properly set env field based on containerEnv values (#2175 , @arueth )
add priority and priorityClassName to ray-cluster template (#2171 , @walterddr )
Added Pod securityContext value to Helm charts (#2160 , @arueth )
[Fix][Helm chart] Move service.headService -> head.headService in values.yaml (#1998 , @jjaniec )
Add NumOfHosts to RayCluster helm-chart template (#1969 , @ryanaoleary )
Benchmark
[perf-tests] make the bucket name and prefix configurable for ray data image resize job (#2156 , @andrewsykim )
[perf-test] update 100 RayJob perf tests to use PyTorch trainer and Ray Data examples (#2149 , @andrewsykim )
Add RayJob training example using pytorch resnet image classifier (#2107 , @andrewsykim )
[Perf] Add a CPU-based image resizing workload using Ray Data (#2135 , @kevin85421 )
[Perf] Add NUM_WORKERS and CPUS_PER_WORKER env to the mnist workload (#2126 , @rueian )
[Perf] Add a CPU-based training workload (#2116 , @kevin85421 )
[Perf] Improve perf-test YAMLs and README (#2110 , @kevin85421 )
add initial perf tests for 100 RayCluster and 100 RayJob (#2102 , @andrewsykim )
Others
[release v1.2.0] Update tags and versions (#2342 , @kevin85421 )
[release v1.2.0-rc.1] Update tags and versions (#2341 , @kevin85421 )
Bump go to 1.22.4 to fix ray-operator vulnerabilities (#2325 , @ryanaoleary )
[Telemetry][v1.2.0] Update KUBERAY_VERSION (#2309 , @kevin85421 )
[release v1.2.0-rc.0] Update tags and versions (#2308 , @kevin85421 )
[release] Update Ray image to 2.34.0 (#2303 , @kevin85421 )
[Minor][Chore] Fix wrong comment (#2294 , @MortalHappiness )
[Fix][Envtest] Decorate container nodes with Ordered (#2285 , @MortalHappiness )
[Minor] Remove redundant variable (#2281 , @MortalHappiness )
[Bug] Issue with glibc version GLIBC_2.34 and GLIBC_2.32 not found in earlier operator tags (#2272 , @kevin85421 )
Bump google.golang.org/grpc from 1.64.0 to 1.64.1 in /experimental (#2248 , @dependabot [bot])
Bump google.golang.org/grpc from 1.64.0 to 1.64.1 in /cli (#2229 , @dependabot [bot])
[core] Bump Go Dependencies (#2205 , @thomasdesr )
[Fix] Use go 1.22 on Buildkite autoscaler e2e tests (#2211 , @rueian )
fix invalid link in docs/index.md (#2179 , @yf4n )
added autoscaling support to Python APIs (#2159 , @blublinsky )
[CI] Remove unnecessary sample YAML symbolic links (#2118 , @kevin85421 )
[release][v1.1.0] Improve release doc and update KubeRay API server chart's repository (#1960 , @kevin85421 )
Post release 1.1.0 (#2040 , @kevin85421 )
[Post v1.1.0] Run the sample YAML tests with KubeRay v1.1.0 (#2039 , @kevin85421 )
[Release] Update release Makefile (#2037 , @kevin85421 )
[Hotfix][CI] Pin setup-envtest dep (#2038 , @kevin85421 )
[Grafana] Update Grafana dashboard (#2106 , @kevin85421 )
[CI] Pin kustomize to v5.3.0 (#2067 , @kevin85421 )
[Doc] Fix Doc Typos (#2060 , @slfan1989 )
[Doc] Fix Yaml Typos (#2049 , @slfan1989 )
Remove extranous arguments from examples (#2051 , @thomasdesr )
Support for Image pull policy (#2101 , @blublinsky )
CVE fix - Upgrade golang.org/x/net (#2081 , @ChristianZaccaria )
[Refactor] Rename raycluster_controller_fake_test.go to XXX_unit_test.go (#2074 , @MortalHappiness )
[Bug] Change image repository for make deploy
(#2059 , @kevin85421 )
Include KUBERAY_VERSION in the user-agent (#2042 , @andrewsykim )
ray-operator coding style change (#2096 , @LeoLiao123 )
You can’t perform that action at this time.