Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] RayService HA test - GCS fault tolerance + kill GCS process #2590

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

CheyuWu
Copy link
Contributor

@CheyuWu CheyuWu commented Dec 2, 2024

Why are these changes needed?

Description

  • Create a RayService with GCS FT enabled. No Ray Serve replica should be deployed on the head Pod.
  • Kill the GCS process on the head Pod pkill gcs_server.
  • Wait until the head Pod is removed from the K8s serve service.
  • Use locust to submit requests until the new Ray head is running and ready for 30 seconds.
  • No request should be dropped.

Related issue number

Closes #2577

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@CheyuWu
Copy link
Contributor Author

CheyuWu commented Dec 2, 2024

@kevin85421 @MortalHappiness PTAL

The request is currently being failed. I'm not sure if it's a problem with the testing code or a bug.

Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
        --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
        POST     /                                                                               1421    73(5.14%) |     58       0    3369     14 |    9.70        0.00
        --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
                 Aggregated                                                                      1421    73(5.14%) |     58       0    3369     14 |    9.70        0.00

@CheyuWu CheyuWu changed the title feat: ray serve ha test - fault tolerance [Feature] RayService HA test - GCS fault tolerance + kill GCS process Dec 2, 2024
@MortalHappiness
Copy link
Member

Possibly related to #2593

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] RayService HA test - GCS fault tolerance + kill GCS process
3 participants