You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When you use open source model, for generate prompt by autopromptemplate: python -m graphrag.prompt_tune --root /path/to/project --domain "environmental news" --method random --limit 10 --max-tokens 2048 --chunk-size 256 --no-entity-types --output /path/to/output
, it leverages an issue cause by tiktoken, BPE tokenizer of Open AI model
Steps to reproduce
When you change settings yaml file , llm argument by something like llama 70 B:
encoding_model: cl100k_base
skip_workflows: []
llm:
api_key: ${GROQ_API_KEY}
type: openai_chat # or azure_openai_chat
model: llama3-70b-8192
model_supports_json: true # recommended if this is available for your model.
max_tokens: 4000
# request_timeout: 180.0
api_base: https://api.groq.com/openai/v1
# api_version: 2024-02-15-preview
Tokenizer error is leverage
GraphRAG Config Used
encoding_model: cl100k_base
skip_workflows: []
llm:
api_key: ${GROQ_API_KEY}
type: openai_chat # or azure_openai_chat
model: llama3-70b-8192
model_supports_json: true # recommended if this is available for your model.
max_tokens: 4000
# request_timeout: 180.0
api_base: https://api.groq.com/openai/v1
# api_version: 2024-02-15-preview
# organization: <organization_id>
# deployment_name: <azure_model_deployment_name>
tokens_per_minute: 3000 # set a leaky bucket throttle
requests_per_minute: 2 # set a leaky bucket throttle
max_retries: 1
# max_retry_wait: 10.0
# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
# concurrent_requests: 25 # the number of parallel inflight requests that may be made
parallelization:
stagger: 0.3
# num_threads: 50 # the number of threads to use for parallel processing
async_mode: threaded # or asyncio
embeddings:
## parallelization: override the global parallelization settings for embeddings
async_mode: threaded # or asyncio
llm:
api_key: ${TOGETHER_API_KEY}
type: openai_embedding # or azure_openai_embedding
model: togethercomputer/m2-bert-80M-8k-retrieval
api_base: https://api.together.xyz/v1
# api_version: 2024-02-15-preview
# organization: <organization_id>
# deployment_name: <azure_model_deployment_name>
tokens_per_minute: 3000 # set a leaky bucket throttle
requests_per_minute: 2 # set a leaky bucket throttle
max_retries: 1
# max_retry_wait: 10.0
# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
# concurrent_requests: 25 # the number of parallel inflight requests that may be made
batch_size: 16 # the number of documents to send in a single request
batch_max_tokens: 3000 # the maximum number of tokens to send in a single request
# target: required # or optional
chunks:
size: 300
overlap: 100
group_by_columns: [id] # by default, we don't allow chunks to cross documents
input:
type: file # or blob
file_type: text # or csv
base_dir: "input"
file_encoding: utf-8
file_pattern: ".*\\.txt$"
cache:
type: file # or blob
base_dir: "cache"
# connection_string: <azure_blob_storage_connection_string>
# container_name: <azure_blob_storage_container_name>
storage:
type: file # or blob
base_dir: "output/${timestamp}/artifacts"
# connection_string: <azure_blob_storage_connection_string>
# container_name: <azure_blob_storage_container_name>
reporting:
type: file # or console, blob
base_dir: "output/${timestamp}/reports"
# connection_string: <azure_blob_storage_connection_string>
# container_name: <azure_blob_storage_container_name>
entity_extraction:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/entity_extraction.txt"
entity_types: [organization,person,geo,event]
max_gleanings: 0
summarize_descriptions:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/summarize_descriptions.txt"
max_length: 500
claim_extraction:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
# enabled: true
prompt: "prompts/claim_extraction.txt"
description: "Any claims or facts that could be relevant to information discovery."
max_gleanings: 0
community_report:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/community_report.txt"
max_length: 2000
max_input_length: 8000
cluster_graph:
max_cluster_size: 10
embed_graph:
enabled: false # if true, will generate node2vec embeddings for nodes
# num_walks: 10
# walk_length: 40
# window_size: 2
# iterations: 3
# random_seed: 597832
umap:
enabled: false # if true, will generate UMAP embeddings for nodes
snapshots:
graphml: false
raw_entities: false
top_level_nodes: false
local_search:
# text_unit_prop: 0.5
# community_prop: 0.1
# conversation_history_max_turns: 5
# top_k_mapped_entities: 10
# top_k_relationships: 10
# max_tokens: 12000
global_search:
# max_tokens: 12000
# data_max_tokens: 12000
# map_max_tokens: 1000
# reduce_max_tokens: 2000
# concurrency: 32
Logs and screenshots
To solve i build a little bit adaptive classe , in utils/tokens.py, need to make model input as variable after or adjust it :
Copyright (c) 2024 Microsoft Corporation.
Licensed under the MIT License
"""Utilities for working with tokens."""
import tiktoken
from transformers import AutoTokenizer
from huggingface_hub import login
login(token = 'token')
``
class TokenizerWrapper:
def __init__(self, model: str):
self.model = model
if model.startswith("gpt-") or model in ["cl100k_base"]:
self.tokenizer = self._init_tiktoken(model)
else:
self.tokenizer = self._init_huggingface(model)
def _init_tiktoken(self, model: str):
try:
return tiktoken.encoding_for_model(model)
except KeyError:
print(f"Warning: Model {model} not found. Using cl100k_base encoding.")
return tiktoken.get_encoding("cl100k_base")
def _init_huggingface(self, model: str):
return AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-70B-Instruct")
def encode(self, text: str) -> list:
if isinstance(self.tokenizer, tiktoken.Encoding):
return self.tokenizer.encode(text)
else: # HuggingFace tokenizer
return self.tokenizer.encode(text, add_special_tokens=False)
def decode(self, tokens: list) -> str:
if isinstance(self.tokenizer, tiktoken.Encoding):
return self.tokenizer.decode(tokens)
else: # HuggingFace tokenizer
return self.tokenizer.decode(tokens)
def num_tokens_from_string(string: str, model: str) -> int:
"""Return the number of tokens in a text string."""
tokenizer = TokenizerWrapper(model)
return len(tokenizer.encode(string))
def string_from_tokens(tokens: list, model: str) -> str:
"""Return a text string from a list of tokens."""
tokenizer = TokenizerWrapper(model)
return tokenizer.decode(tokens)
Additional Information
GraphRAG Version:
Operating System:
Python Version:
Related Issues:
The text was updated successfully, but these errors were encountered:
jeaneigsi
added
the
triage
Default label assignment, indicates new issue needs reviewed by a maintainer
label
Jul 9, 2024
Describe the issue
When you use open source model, for generate prompt by autopromptemplate:
python -m graphrag.prompt_tune --root /path/to/project --domain "environmental news" --method random --limit 10 --max-tokens 2048 --chunk-size 256 --no-entity-types --output /path/to/output
, it leverages an issue cause by tiktoken, BPE tokenizer of Open AI model
Steps to reproduce
When you change settings yaml file , llm argument by something like llama 70 B:
Tokenizer error is leverage
GraphRAG Config Used
Logs and screenshots
To solve i build a little bit adaptive classe , in utils/tokens.py, need to make model input as variable after or adjust it :
Copyright (c) 2024 Microsoft Corporation.
Licensed under the MIT License
"""Utilities for working with tokens."""
Additional Information
The text was updated successfully, but these errors were encountered: