-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: include tokens usage for streamed output #4282
Conversation
…to get the proper usage data in reply streaming mode at the last [DONE] frame
✅ Deploy Preview for localai ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
@@ -39,11 +39,15 @@ func ChatEndpoint(cl *config.BackendConfigLoader, ml *model.ModelLoader, startup | |||
responses <- initialMessage | |||
|
|||
ComputeChoices(req, s, config, startupOptions, loader, func(s string, c *[]schema.Choice) {}, func(s string, usage backend.TokenUsage) bool { | |||
choices := []schema.Choice{} | |||
if s != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure about this as I didn't checked what OpenAI does in this case (and hard to replicate).
If the token is empty does it send back an empty choice, or just skips it? Probably doesn't matter either, just making sure we keep an eye to not break any user flow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't really know, but that token callback function is used only for streaming reply mode
and in that mode we chop runes of the token, and if there is some - we calling that function, so before that commit in that specific case the tokenCallback never will be called with empty first argument which i check here
the problem is the real usage is sent by llama as final reply where message field bytes is empty
and in that specific case i explicitly call tokenCallback just to capture the usage part
next, about choices - they are always was strictly set to a slice with exactly one element before
so i just used it as an opportunity to not rewrite much of the code and somehow flag out that everything except usage part is need to be ignored, so the real final reply can be constructed after receiving from backend depending on the tools stuff e.t.c.
Description
This PR fixes didn't find such issue
i found out that usage body part is always zeroed when localai is used through webui, e.g. the output is streamed
the issue is grpc server is supposed to return a stream of Reply structures, but for some reason they was used to retrieve just message bytes every chunk and nothing more
But, the logic for holding mysterious empty usage data for every chunk is present (even it have sent only at the last closing chunk, that's why i was forced to do a hack with empty choices to retrieve just usage data through the tokenCallback)
i don't really know about testing the tools part (and what the hell it is, lol), but since it uses usage data returned from ComputeChoices function - i have verified that it indeed returns non-empty usage result
at the same time just Predict (not streaming) part using protocol.Reply all the way down through the layers of abstractions and correctly showing the token usage data