-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(vLLM): support async generator #746
base: dev
Are you sure you want to change the base?
Conversation
b839200
to
feb9aad
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Please note that vLLM cannot and will never replace the original infer code.
- You should open a new PR for the change of
examples/api
but not merge them together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not change the default behavior unless you must do it. If you have the reason of changing the behavior, please explain.
Note: you shouldn't refer the content written in vLLM code and port
those contents into main, because the code in vLLM is contributed by the community and hasn't been fully checked.
这一个版本太强大了,我用4090测,流式api第一个chunk 65毫秒,而且音色能固定,太快了,太快了,怀疑电脑坏了 |
感谢大佬分享。这个版本能和原版本保持相同音色吗,这个issue中有提到 #640 |
It is fully compatible in principle. |
4090上使用compile=True耗时3.3s的文本,用main分支vllm加速是1.6s,不能调整音色,用这个pr的版本可以调整,但速度只有2.6s左右 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me say it again. Your modification on examples/api
is about another function so you SHOULD move those changes to ANOTHER PR. Thanks for your comprehension.
# Filter both rows and columns using slicing | ||
yield new_wavs[:][:, keep_cols] | ||
# Hacker:Check if there are any silent segments; if so, take the last segment. Otherwise, try waiting for another loop. | ||
import librosa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to introduce librosa
. Modifying the original code makes sense. ex. replace
keep_cols = np.sum(new_wavs != 0, axis=0) > 0
with
# pseudo code without testing, just a hint
keep_cols = np.sum(abs(new_wavs) > 1e-6, axis=0) > 0
), | ||
] | ||
|
||
emb = self.embed(input_ids, text_mask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you move a line up, don't change the variable name when it's ok to remain the original one. It will make it difficult to see your changes.
1a60b69
to
ecc7b52
Compare
5b7863f
to
1e2b671
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that you got some trouble moving your modification of examples/api
and tools/audio
into another PR 😂. Here's a brief instruction.
- Add a new commit in this PR, removing
examples/api/main.py
andtools/audio/np.py
- Add a new branch based on the
dev
branch of THIS MAIN REPO. - Apply your modification of
examples/api/main.py
andtools/audio/np.py
on this new branch. - Create a new PR based on this branch to
dev
.
I am curious that why we choose move vllm codes to this project, instead of supporting the llm parts in vllm. also @fumiama for this question |
This part of code is contributed by the community and you can ask @ylzz1997 about that. I'm sorry but I'm not familiar with vLLM 😂. |
Because Vllm don't suport some feature:
In order for the sampler to support vllm features such as continuous batch and streaming generate, it is necessary to package the post model (lm_cead and others) into vllm schedular. This requires rewriting the VLLM part of the code. That's all |
Thanks for your explanation. |
Yes, we cannot use vllm directly as it requires some code changes. I have implemented a solution here, I am new to Torch and Python, so feel free for any comments. For the issues you list above:
One of the main challenges is that vllm assumes all the model output is a single token, which is just an int value. However, the TTS system, whether chattts or fishtts, generates multi-head tokens in one decoding step. This means the model output is a token list, breaking the fundamental design. I had to use many if/else statements to ensure the whole pipeline still works. Overall, compared to moving vllm codes here, implementing a model in vllm will save effort for other features than core model inference, like sampling and scheduling, continual batch processing, etc. |
Thanks for your great effort. We welcome you to contribute your code into this repo if you like. |
I have provided a draft version of the VLLM, with the following key changes:
These changes may have a significant impact. Feel free to leave comments to guide me in further improvements.