Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: Global Search takes a long time to respond for large datasets #928

Open
2 tasks done
andreiionut1411 opened this issue Aug 14, 2024 · 7 comments
Open
2 tasks done
Labels
awaiting_response Maintainers or community have suggested solutions or requested info, awaiting filer response

Comments

@andreiionut1411
Copy link

Is there an existing issue for this?

  • I have searched the existing issues
  • I have checked #657 to validate if my issue is covered by community support

Describe the issue

Dear Team, I am working with a csv type dataset that has 40k rows, with an average of 250 tokens per row. After indexing the dataset, I used the Global Search example and it took almost 5 minutes to get an answer back. I have also tried using the CLI for global querying and it answered in a bit less than 2 minutes, which is better, but still slow. In both cases, I used community level 0, as any other higher level would just take ages.
I know that the larger the dataset, the longer it takes to iterate through the communities. However, the current duration of the response is too long. Is there any way to speed the Global Search query up?

Steps to reproduce

No response

GraphRAG Config Used

No response

Logs and screenshots

No response

Additional Information

No response

@andreiionut1411 andreiionut1411 added the triage Default label assignment, indicates new issue needs reviewed by a maintainer label Aug 14, 2024
@COPILOT-WDP
Copy link

Can you share some more info about your dataset and the time it took to ingest? I am struggling a bit with larget datasets too.

@andreiionut1411
Copy link
Author

andreiionut1411 commented Aug 19, 2024

Can you share some more info about your dataset and the time it took to ingest? I am struggling a bit with larget datasets too.

I am working with a dataset comprised of 40k emails, which I added into a csv file with one email per row. For ingesting it took around 65 hours, where the vast majority of time was taken by the LLM calls, and 1-2 hours were needed to create the graph and other parquet files needed.

@natoverse
Copy link
Collaborator

natoverse commented Aug 19, 2024

I don't have any great suggestions for you at the moment. Global search is time consuming, because it summarizes every community to find the best answers to your question. You are already using level 0, which will be the fewest communities summarized.

We are investigating ways to rank the communities so you can set a threshold, but I don't have a concrete timeline for a real implementation.

If you think your domain/content can be filtered, you could filter down the create_final_communities.parquet rows based on criteria in there. You could do this as a post-indexing step if it can be static, or at runtime by implementing your own search context builder (see this comment: #917 (comment))

@natoverse natoverse added awaiting_response Maintainers or community have suggested solutions or requested info, awaiting filer response and removed triage Default label assignment, indicates new issue needs reviewed by a maintainer labels Aug 19, 2024
@yuan-head
Copy link

I don't have any great suggestions for you at the moment. Global search is time consuming, because it summarizes every community to find the best answers to your question. You are already using level 0, which will be the fewest communities summarized.

We are investigating ways to rank the communities so you can set a threshold, but I don't have a concrete timeline for a real implementation.

If you think your domain/content can be filtered, you could filter down the create_final_communities.parquet rows based on criteria in there. You could do this as a post-indexing step if it can be static, or at runtime by implementing your own search context builder (see this comment: #917 (comment))

can i ask about the methods that can help with shorten the time cost by the searching task?

@mzh1996
Copy link

mzh1996 commented Aug 23, 2024

I have the same problem.

Is it possible to use some filters on the communities before the map stage?
For example, only the communities sharing same entities with the query are kept.
Or, we can only use the topK communities that are most similar to the query in the map stage, and the similarity can be measured by the cosine similarity between the report embedding and the query embedding.

(Just my personal suggestions. Hope the authors of this great project can provide better solutions)

@natoverse
Copy link
Collaborator

I have the same problem.

Is it possible to use some filters on the communities before the map stage? For example, only the communities sharing same entities with the query are kept. Or, we can only use the topK communities that are most similar to the query in the map stage, and the similarity can be measured by the cosine similarity between the report embedding and the query embedding.

(Just my personal suggestions. Hope the authors of this great project can provide better solutions)

Yes, those could be helpful. There is a balance between the high-level thematic capabilities of global search versus specific questions. For example, if you ask a very broad question such as "what are the top themes in the dataset", it is very difficult to filter because semantic search should not get good hits. However, if you add small qualifiers such as "what are the top political themes in the dataset", you have now added a keyword that would allow semantic search to return ranked community summaries that discuss politics. We are investigating exactly this right now, and hope to have some solutions soon. Again, no firm timeline, but we're looking very closely at this.

@disperaller
Copy link

I have the same problem.

Is it possible to use some filters on the communities before the map stage? For example, only the communities sharing same entities with the query are kept. Or, we can only use the topK communities that are most similar to the query in the map stage, and the similarity can be measured by the cosine similarity between the report embedding and the query embedding.

(Just my personal suggestions. Hope the authors of this great project can provide better solutions)

i changed the code to implement the topK strategy, which sort the community reports based on occurrence_weight and select the top K reports from the pool. Help speed up the searching process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting_response Maintainers or community have suggested solutions or requested info, awaiting filer response
Projects
None yet
Development

No branches or pull requests

6 participants