-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Keep-Alive Functionality for GPU Resource Optimization in LitServe #304
Comments
hi @skyking363, thank you for your interest in LitServe and for suggesting a new feature. LitServe is designed for serving high-throughput servers at scale, while Ollama is intended to run LLMs on personal devices.
Tagging @lantiga @williamFalcon to hear their thoughts. |
I am looking for this features too. I was trying to delete the model or empty the gpu. But it can't works. maybe it can be an option. That would be great. Thank you for reading and replying this |
hi @aceliuchanghong, thank you for adding in to the discussion. Few questions:
|
@skyking363 @aceliuchanghong thanks for your requests! can you explain the motivation a bit more clearly with a concrete example?
etc…. basically i have about a million questions here haha. So, it would be better to understand concretely based on a real-world example that shows what problem you want to solve and how this would solve it (a lifecycle diagram might help too). |
it's happens when i use a visual model to ocr some complicated image. but i just use it occasionally.so i want to the gpu can be free when most time i don't use it. thankyou for replying~ add..cause i only have one machine with 4 L20,SO there are many service on it all time...xd |
@aniketmaurya Would it make sense to introduce some kind of model unloading if the server has a certain amount of time without any request? And then lazy-loading it back to memory - similar to some idle-state let's say? I did not think of an implementation scenario yet, but like @aceliuchanghong mentions, some people run many services on one machine.. |
I think the main question here would be that in a production environment do you do this? |
yeah.we use litserve in production env or that's why i don't use fastapi or something else cause it support llm(etc.) very well |
Thank you for your reply. I currently choose to use LitServe instead of Ollama for two main reasons: LitServe offers more flexibility compared to Ollama, such as the ability to return both sparse and dense embedding vectors during the embedding process, something that Ollama cannot do. Thank you again for your suggestions and support! |
Thank you for your reply, @williamFalcon ! To provide a more concrete example of my use case: I am running multiple services on a machine with 8 A100 GPUs. These services involve running multiple LLMs simultaneously (e.g., Llama 3.1 405B, Llama 3.2 90B, etc.), which are either used for user chat interactions or periodic tasks (such as ingesting data into a database). Additionally, I have some long-running API services that utilize multiple models, including visual models for tasks like optical character recognition (OCR). However, these models are not always in use—there are often long idle periods between requests. My goal is to release GPU memory during these idle periods so that other services can utilize the resources without shutting down the API service itself. Ideally, the models would automatically load when a request comes in and unload after a prolonged period of inactivity. This way, we can more efficiently utilize GPU resources without manual management or service restarts. This mechanism would allow us to manage limited GPU resources more flexibly and efficiently, especially when running services involving RAG or multi-model combinations. Of course, I understand that in more complex production environments, automatic unloading may not always be appropriate, but in scenarios where models are only used at specific times, this feature could be extremely beneficial. Thank you again for your detailed response and suggestions! I will consider using a lifecycle diagram to further clarify how this functionality could be implemented. |
This feature is already available in the lightning studio (scale to zero). Can you add a simpler version of it to the litserve? |
thank you for the detailed response @skyking363!! We will be taking this feature request and keep you updated. |
🚀 Feature
I would like to propose adding a feature to LitServe that enables models to be deployed with a keep-alive functionality, similar to what Ollama provides. This feature would allow the model to be unloaded from GPU memory when not in use and automatically loaded back when required.
Motivation
This feature would be helpful for users working with limited GPU resources. Currently, the GPU can become a bottleneck when multiple models are deployed. By releasing the GPU resources when a model is idle and reloading them on demand, we could improve efficiency and free up resources for other tasks.
Pitch
The main objective is to add a mechanism, perhaps through environment variables, that allows the system to automatically unload models when idle and reload them when needed, similar to Ollama's keep-alive functionality.
Alternatives
An alternative solution could involve manually managing GPU resources at the deployment level, but this can be cumbersome and error-prone. Automation via LitServe would streamline this process.
Additional context
This idea is inspired by a similar feature discussed in the Ollama repository: Ollama keep-alive environment variables. It could significantly optimize resource usage in environments where GPUs are scarce.
The text was updated successfully, but these errors were encountered: