-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The create_final_community_reports.parquet file disappears. #940
Comments
I've been experiencing this bug for about a week now, and this error appears in the indexing-engine.log: Traceback (most recent call last): I've also re-executed on version 0.3.0 and still get errors |
I found out that my problem might be a slight formatting error in the template. |
Thanks, I found that my report propmt and the official report propmt also have some format differences, I'm trying it out
Thanks, I found that my report propmt and the official report propmt also have some format differences, I'm trying it out |
same problem ,how to resolve? |
I tried to modify the prompt, but it still reported an error, I don't know what the problem is, there will be no bug on a small data set, and it will not work if I replace it with a large data set. File "/home/lile/microsoft/graphrag/graphrag/index/emit/parquet_table_emitter.py", line 40, in emit I'm planning to store the data in a csv at this line of code to see what the problem is |
I tried to add the error blocking and error pocketing logic, the code is as follows, after processing it works fine graphrag. # Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License
"""ParquetTableEmitter module."""
import logging
import traceback
import pandas as pd
from pyarrow.lib import ArrowInvalid, ArrowTypeError
from graphrag.index.storage import PipelineStorage
from graphrag.index.typing import ErrorHandlerFn
from .table_emitter import TableEmitter
log = logging.getLogger(__name__)
class ParquetTableEmitter(TableEmitter):
"""ParquetTableEmitter class."""
_storage: PipelineStorage
_on_error: ErrorHandlerFn
def __init__(
self,
storage: PipelineStorage,
on_error: ErrorHandlerFn,
):
"""Create a new Parquet Table Emitter."""
self._storage = storage
self._on_error = on_error
async def preprocess_and_emit(self, filename: str, data: pd.DataFrame) -> None:
"""Preprocess data and emit to storage."""
def preprocess_findings_column(df):
def ensure_struct(x):
if isinstance(x, dict):
return x
elif pd.isnull(x).any():
return None
else:
return {'value': x}
df['findings'] = df['findings'].apply(ensure_struct)
return df
# Apply preprocessing
data = preprocess_findings_column(data)
await self._storage.set(filename, data.to_parquet())
async def emit(self, name: str, data: pd.DataFrame) -> None:
"""Emit a dataframe to storage."""
filename = f"{name}.parquet"
log.info("Emitting parquet table %s", filename)
try:
await self._storage.set(filename, data.to_parquet())
except (ArrowTypeError, ArrowInvalid) as e:
log.warning("Initial parquet save failed, preprocessing data and retrying due to error: %s", str(e))
try:
await self.preprocess_and_emit(filename, data)
except Exception as ex:
log.exception("Error while emitting parquet table after retry")
self._on_error(
ex,
traceback.format_exc(),
None,
)
except Exception as e:
log.exception("Unexpected error while emitting parquet table")
self._on_error(
e,
traceback.format_exc(),
None,
) |
Thanks for the reply, I also solved the problem by converting him to csv before, I'll try your method |
Thanks everyone for your insights. Converting the pandas data frame to csv, then converting to parquet worked for me. However, I am getting a new issue:
Update: solved the issue by casting column community to string in pandas |
感谢 @xgl0626 和 @therealcyberlord ,我遇到了同样的问题,并如下更改了代码,现在一切运行正常了
|
Thanks for help , The create_final_community_reports.parquet file has been created. File "/home/lile/graphrag/graphrag/query/api.py", line 272, in local_search_streaming |
I have tried this out and it fixes the problem. Many thanks. I wish that there would be a pull request to fix this problem. |
Is there an existing issue for this?
Describe the issue
The create_final_community_reports.parquet file is not generated after all processes of the index are executed.
Steps to reproduce
No response
GraphRAG Config Used
# Paste your config here
Logs and screenshots
No response
Additional Information
The text was updated successfully, but these errors were encountered: