[Issue]: Global Search takes a long time to respond for large datasets #928

andreiionut1411 · 2024-08-14T09:12:25Z

Is there an existing issue for this?

I have searched the existing issues
I have checked #657 to validate if my issue is covered by community support

Describe the issue

Dear Team, I am working with a csv type dataset that has 40k rows, with an average of 250 tokens per row. After indexing the dataset, I used the Global Search example and it took almost 5 minutes to get an answer back. I have also tried using the CLI for global querying and it answered in a bit less than 2 minutes, which is better, but still slow. In both cases, I used community level 0, as any other higher level would just take ages.
I know that the larger the dataset, the longer it takes to iterate through the communities. However, the current duration of the response is too long. Is there any way to speed the Global Search query up?

Steps to reproduce

No response

GraphRAG Config Used

No response

Logs and screenshots

No response

Additional Information

No response

COPILOT-WDP · 2024-08-18T18:39:14Z

Can you share some more info about your dataset and the time it took to ingest? I am struggling a bit with larget datasets too.

andreiionut1411 · 2024-08-19T10:19:30Z

Can you share some more info about your dataset and the time it took to ingest? I am struggling a bit with larget datasets too.

I am working with a dataset comprised of 40k emails, which I added into a csv file with one email per row. For ingesting it took around 65 hours, where the vast majority of time was taken by the LLM calls, and 1-2 hours were needed to create the graph and other parquet files needed.

natoverse · 2024-08-19T22:29:55Z

I don't have any great suggestions for you at the moment. Global search is time consuming, because it summarizes every community to find the best answers to your question. You are already using level 0, which will be the fewest communities summarized.

We are investigating ways to rank the communities so you can set a threshold, but I don't have a concrete timeline for a real implementation.

If you think your domain/content can be filtered, you could filter down the create_final_communities.parquet rows based on criteria in there. You could do this as a post-indexing step if it can be static, or at runtime by implementing your own search context builder (see this comment: #917 (comment))

yuan-head · 2024-08-20T09:13:24Z

I don't have any great suggestions for you at the moment. Global search is time consuming, because it summarizes every community to find the best answers to your question. You are already using level 0, which will be the fewest communities summarized.

We are investigating ways to rank the communities so you can set a threshold, but I don't have a concrete timeline for a real implementation.

If you think your domain/content can be filtered, you could filter down the create_final_communities.parquet rows based on criteria in there. You could do this as a post-indexing step if it can be static, or at runtime by implementing your own search context builder (see this comment: #917 (comment))

can i ask about the methods that can help with shorten the time cost by the searching task?

mzh1996 · 2024-08-23T09:16:27Z

I have the same problem.

Is it possible to use some filters on the communities before the map stage?
For example, only the communities sharing same entities with the query are kept.
Or, we can only use the topK communities that are most similar to the query in the map stage, and the similarity can be measured by the cosine similarity between the report embedding and the query embedding.

(Just my personal suggestions. Hope the authors of this great project can provide better solutions)

natoverse · 2024-08-23T16:47:06Z

I have the same problem.

Is it possible to use some filters on the communities before the map stage? For example, only the communities sharing same entities with the query are kept. Or, we can only use the topK communities that are most similar to the query in the map stage, and the similarity can be measured by the cosine similarity between the report embedding and the query embedding.

(Just my personal suggestions. Hope the authors of this great project can provide better solutions)

Yes, those could be helpful. There is a balance between the high-level thematic capabilities of global search versus specific questions. For example, if you ask a very broad question such as "what are the top themes in the dataset", it is very difficult to filter because semantic search should not get good hits. However, if you add small qualifiers such as "what are the top political themes in the dataset", you have now added a keyword that would allow semantic search to return ranked community summaries that discuss politics. We are investigating exactly this right now, and hope to have some solutions soon. Again, no firm timeline, but we're looking very closely at this.

disperaller · 2024-08-27T05:23:37Z

I have the same problem.

Is it possible to use some filters on the communities before the map stage? For example, only the communities sharing same entities with the query are kept. Or, we can only use the topK communities that are most similar to the query in the map stage, and the similarity can be measured by the cosine similarity between the report embedding and the query embedding.

(Just my personal suggestions. Hope the authors of this great project can provide better solutions)

i changed the code to implement the topK strategy, which sort the community reports based on occurrence_weight and select the top K reports from the pool. Help speed up the searching process.

andreiionut1411 added the triage Default label assignment, indicates new issue needs reviewed by a maintainer label Aug 14, 2024

andreiionut1411 closed this as completed Aug 19, 2024

andreiionut1411 reopened this Aug 19, 2024

natoverse added awaiting_response Maintainers or community have suggested solutions or requested info, awaiting filer response and removed triage Default label assignment, indicates new issue needs reviewed by a maintainer labels Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: Global Search takes a long time to respond for large datasets #928

[Issue]: Global Search takes a long time to respond for large datasets #928

andreiionut1411 commented Aug 14, 2024

COPILOT-WDP commented Aug 18, 2024

andreiionut1411 commented Aug 19, 2024 •

edited

Loading

natoverse commented Aug 19, 2024 •

edited

Loading

yuan-head commented Aug 20, 2024

mzh1996 commented Aug 23, 2024

natoverse commented Aug 23, 2024

disperaller commented Aug 27, 2024

[Issue]: Global Search takes a long time to respond for large datasets #928

[Issue]: Global Search takes a long time to respond for large datasets #928

Comments

andreiionut1411 commented Aug 14, 2024

Is there an existing issue for this?

Describe the issue

Steps to reproduce

GraphRAG Config Used

Logs and screenshots

Additional Information

COPILOT-WDP commented Aug 18, 2024

andreiionut1411 commented Aug 19, 2024 • edited Loading

natoverse commented Aug 19, 2024 • edited Loading

yuan-head commented Aug 20, 2024

mzh1996 commented Aug 23, 2024

natoverse commented Aug 23, 2024

disperaller commented Aug 27, 2024

andreiionut1411 commented Aug 19, 2024 •

edited

Loading

natoverse commented Aug 19, 2024 •

edited

Loading