Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: <title> Prompts tuning issue #924

Open
2 tasks
wolfhawkld opened this issue Aug 14, 2024 · 2 comments
Open
2 tasks

[Issue]: <title> Prompts tuning issue #924

wolfhawkld opened this issue Aug 14, 2024 · 2 comments
Labels
awaiting_response Maintainers or community have suggested solutions or requested info, awaiting filer response

Comments

@wolfhawkld
Copy link

Is there an existing issue for this?

  • I have searched the existing issues
  • I have checked #657 to validate if my issue is covered by community support

Describe the issue

When I'm creating the graph data by GraphRAG, no matter where I provide a general response role description (entity_extraction and summarize_description), and then re-generate the whole graph data, the general prompt role does not working still, is there any location or any way to add a general response prompt role description to restrict the query result? Thanks in advanced.

Steps to reproduce

No response

GraphRAG Config Used

encoding_model: cl100k_base
skip_workflows: []
llm:
  api_key: ${GRAPHRAG_API_KEY}
  type: azure_openai_chat # or azure_openai_chat
  model: gpt-4o
  model_supports_json: true # recommended if this is available for your model.
  # max_tokens: 4000
  # request_timeout: 180.0
  api_base: https://api.nlp.dev.uptimize.merckgroup.com
  api_version: 2023-09-01-preview
  # organization: <organization_id>
  deployment_name: gpt-4o
  # tokens_per_minute: 150_000 # set a leaky bucket throttle
  # requests_per_minute: 10_000 # set a leaky bucket throttle
  # max_retries: 10
  # max_retry_wait: 10.0
  # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
  # concurrent_requests: 25 # the number of parallel inflight requests that may be made
  # temperature: 0 # temperature for sampling
  # top_p: 1 # top-p sampling
  # n: 1 # Number of completions to generate

parallelization:
  stagger: 0.3
  # num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:
  ## parallelization: override the global parallelization settings for embeddings
  async_mode: threaded # or asyncio
  llm:
    api_key: ${GRAPHRAG_API_KEY}
    type: azure_openai_embedding # or azure_openai_embedding
    model: text-embedding-3-large
    api_base: https://api.nlp.dev.uptimize.merckgroup.com
    api_version: 2023-09-01-preview
    # organization: <organization_id>
    deployment_name: text-embedding-3-large
    # tokens_per_minute: 150_000 # set a leaky bucket throttle
    # requests_per_minute: 10_000 # set a leaky bucket throttle
    # max_retries: 10
    # max_retry_wait: 10.0
    # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
    # concurrent_requests: 25 # the number of parallel inflight requests that may be made
    # batch_size: 16 # the number of documents to send in a single request
    # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
    # target: required # or optional
  


chunks:
  size: 1200
  overlap: 100
  group_by_columns: [id] # by default, we don't allow chunks to cross documents
    
input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$"

cache:
  type: file # or blob
  base_dir: "cache"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

storage:
  type: file # or blob
  base_dir: "output/${timestamp}/artifacts"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

reporting:
  type: file # or console, blob
  base_dir: "output/${timestamp}/reports"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

entity_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "custom_prompts/entity_extraction.txt"
  entity_types: [organization,person,geo,event]
  max_gleanings: 1

summarize_descriptions:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "custom_prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  # enabled: true
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 1

community_reports:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "custom_prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes
  # num_walks: 10
  # walk_length: 40
  # window_size: 2
  # iterations: 3
  # random_seed: 597832

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: false
  raw_entities: false
  top_level_nodes: false

local_search:
  # text_unit_prop: 0.5
  # community_prop: 0.1
  # conversation_history_max_turns: 5
  # top_k_mapped_entities: 10
  # top_k_relationships: 10
  # llm_temperature: 0 # temperature for sampling
  # llm_top_p: 1 # top-p sampling
  # llm_n: 1 # Number of completions to generate
  # max_tokens: 12000

global_search:
  # llm_temperature: 0 # temperature for sampling
  # llm_top_p: 1 # top-p sampling
  # llm_n: 1 # Number of completions to generate
  # max_tokens: 12000
  # data_max_tokens: 12000
  # map_max_tokens: 1000
  # reduce_max_tokens: 2000
  # concurrency: 32

Logs and screenshots

No response

Additional Information

  • GraphRAG Version:
  • Operating System:
  • Python Version:
  • Related Issues:
@wolfhawkld wolfhawkld added the triage Default label assignment, indicates new issue needs reviewed by a maintainer label Aug 14, 2024
@AlonsoGuevara
Copy link
Contributor

Hi!

As I undertand, the issue is related to GraphRAG not picking your custom prompts, right?

Could you please share the prompt_tune command you're using?

Thanks!

@natoverse natoverse added awaiting_response Maintainers or community have suggested solutions or requested info, awaiting filer response and removed triage Default label assignment, indicates new issue needs reviewed by a maintainer labels Aug 16, 2024
@wolfhawkld
Copy link
Author

Hi Alonso,

Thank you for replying me back, this is my cmd:
python -m graphrag.prompt_tune --root .\playground\langchain\knowledge_base\graph_base --domain 'a List of Prohibited Ingredients for Cosmetics' --method random --limit 2 --chunk-size 1200 --output custom_prompts

And I've found after tuning a new custom prompts, I need to re-generate the graph data again to make the prompts working right, am I right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting_response Maintainers or community have suggested solutions or requested info, awaiting filer response
Projects
None yet
Development

No branches or pull requests

3 participants