-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue]: Global Search takes a long time to respond for large datasets #928
Comments
Can you share some more info about your dataset and the time it took to ingest? I am struggling a bit with larget datasets too. |
I am working with a dataset comprised of 40k emails, which I added into a csv file with one email per row. For ingesting it took around 65 hours, where the vast majority of time was taken by the LLM calls, and 1-2 hours were needed to create the graph and other parquet files needed. |
I don't have any great suggestions for you at the moment. Global search is time consuming, because it summarizes every community to find the best answers to your question. You are already using level 0, which will be the fewest communities summarized. We are investigating ways to rank the communities so you can set a threshold, but I don't have a concrete timeline for a real implementation. If you think your domain/content can be filtered, you could filter down the create_final_communities.parquet rows based on criteria in there. You could do this as a post-indexing step if it can be static, or at runtime by implementing your own search context builder (see this comment: #917 (comment)) |
can i ask about the methods that can help with shorten the time cost by the searching task? |
I have the same problem. Is it possible to use some filters on the communities before the map stage? (Just my personal suggestions. Hope the authors of this great project can provide better solutions) |
Yes, those could be helpful. There is a balance between the high-level thematic capabilities of global search versus specific questions. For example, if you ask a very broad question such as "what are the top themes in the dataset", it is very difficult to filter because semantic search should not get good hits. However, if you add small qualifiers such as "what are the top political themes in the dataset", you have now added a keyword that would allow semantic search to return ranked community summaries that discuss politics. We are investigating exactly this right now, and hope to have some solutions soon. Again, no firm timeline, but we're looking very closely at this. |
i changed the code to implement the topK strategy, which sort the community reports based on occurrence_weight and select the top K reports from the pool. Help speed up the searching process. |
Is there an existing issue for this?
Describe the issue
Dear Team, I am working with a csv type dataset that has 40k rows, with an average of 250 tokens per row. After indexing the dataset, I used the Global Search example and it took almost 5 minutes to get an answer back. I have also tried using the CLI for global querying and it answered in a bit less than 2 minutes, which is better, but still slow. In both cases, I used community level 0, as any other higher level would just take ages.
I know that the larger the dataset, the longer it takes to iterate through the communities. However, the current duration of the response is too long. Is there any way to speed the Global Search query up?
Steps to reproduce
No response
GraphRAG Config Used
No response
Logs and screenshots
No response
Additional Information
No response
The text was updated successfully, but these errors were encountered: