-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/evict req on client disconnect streaming case #223
base: main
Are you sure you want to change the base?
Feat/evict req on client disconnect streaming case #223
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #223 +/- ##
===================================
- Coverage 95% 95% -0%
===================================
Files 18 18
Lines 1173 1185 +12
===================================
+ Hits 1112 1122 +10
- Misses 61 63 +2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @bhimrazy! looking good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will run some perf test for streaming before merging this PR!
for more information, see https://pre-commit.ci
Hi @aniketmaurya, I’ve made some modifications to check at specific intervals rather than on each output, hoping this approach might minimize any impact. However, if this PR is affecting performance, I’m more than willing to close it. We can then explore alternative solutions that might be more effective. |
Hey @bhimrazy thanks for the patience, we're going to benchmark soon! |
for more information, see https://pre-commit.ci
…improve code readability
check interval 10 is passing. Increasing to 50.
check_interval = 50 | ||
for index, y_enc in enumerate(y_enc_gen): | ||
if index % check_interval == 0 and request_evicted_status.get(uid): | ||
request_evicted_status.pop(uid) | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking the request_evicted_status
for each token appears to have a significant impact, reducing performance from 3600 to around 3100. However, it may not be necessary to perform this check on every token.
While adding a check interval helps reduce the overhead and brings the performance closer to that of the main branch, but it still doesn't feel like an ideal solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for your patience with the PR and checking the speed issue @bhimrazy 🙌 .
yeah, and in case when the time-to-first-token is large but rest of the token stream speed is fast, it doesn't help much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just single worker. with multiple workers it might impact even more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the overall design is correct, we are just way too aggressive checking the distributed dict and we get into contention problems.
One alternative that could reduce contention is getting a snapshot of the disconnected dictionary in every worker loop: so not use a managed dict but a shared value that the server publishes and that gets read as a whole by each worker periodically (every N seconds - we don't need a thread, we just check the time at every loop). This way every worker has a semi-up to date local dictionary that it can check as often as we want.
Having semi-up to date info on who disconnected every N seconds is totally fine, we don't need to react immediately.
This design also helps with ignoring items in the queue that come from clients that have been disconnected. For those we necessarily have to check at every request. If the local dictionary is not up to date we'll run some requests for nothing, but that's ok. One caveat is making sure the responses don't accumulate in the response dictionary on the webserver process, in this case (let's remember about this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @lantiga, for the valuable insights. This approach seems promising. I'll take some time to study the concept and work on the implementation shortly.
Before submitting
How does this PR impact the user?
As a user, I want the server to stop processing requests that disconnect from the client before finishing. This PR focuses on tracking disconnected requests (specifically for non-batched streaming mode) and stops those running tasks, saving resources and freeing up space for handling other requests.
What does this PR do?
Partially fixes #165.
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃