Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Open source tokenizer support]: <title> #467

Closed
jeaneigsi opened this issue Jul 9, 2024 · 2 comments
Closed

[Open source tokenizer support]: <title> #467

jeaneigsi opened this issue Jul 9, 2024 · 2 comments
Labels
community_support Issue handled by community members

Comments

@jeaneigsi
Copy link

Describe the issue

When you use open source model, for generate prompt by autopromptemplate:
python -m graphrag.prompt_tune --root /path/to/project --domain "environmental news" --method random --limit 10 --max-tokens 2048 --chunk-size 256 --no-entity-types --output /path/to/output
, it leverages an issue cause by tiktoken, BPE tokenizer of Open AI model

Steps to reproduce

When you change settings yaml file , llm argument by something like llama 70 B:

encoding_model: cl100k_base
skip_workflows: []
llm:
  api_key: ${GROQ_API_KEY}
  type: openai_chat # or azure_openai_chat
  model: llama3-70b-8192
  model_supports_json: true # recommended if this is available for your model.
  max_tokens: 4000
  # request_timeout: 180.0
  api_base: https://api.groq.com/openai/v1
  # api_version: 2024-02-15-preview

Tokenizer error is leverage

GraphRAG Config Used


encoding_model: cl100k_base
skip_workflows: []
llm:
  api_key: ${GROQ_API_KEY}
  type: openai_chat # or azure_openai_chat
  model: llama3-70b-8192
  model_supports_json: true # recommended if this is available for your model.
  max_tokens: 4000
  # request_timeout: 180.0
  api_base: https://api.groq.com/openai/v1
  # api_version: 2024-02-15-preview
  # organization: <organization_id>
  # deployment_name: <azure_model_deployment_name>
  tokens_per_minute: 3000 # set a leaky bucket throttle
  requests_per_minute: 2 # set a leaky bucket throttle
  max_retries: 1
  # max_retry_wait: 10.0
  # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
  # concurrent_requests: 25 # the number of parallel inflight requests that may be made

parallelization:
  stagger: 0.3
  # num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:
  ## parallelization: override the global parallelization settings for embeddings
  async_mode: threaded # or asyncio
  llm:
    api_key: ${TOGETHER_API_KEY}
    type: openai_embedding # or azure_openai_embedding
    model: togethercomputer/m2-bert-80M-8k-retrieval
    api_base:  https://api.together.xyz/v1
    # api_version: 2024-02-15-preview
    # organization: <organization_id>
    # deployment_name: <azure_model_deployment_name>
    tokens_per_minute: 3000 # set a leaky bucket throttle
    requests_per_minute: 2 # set a leaky bucket throttle
    max_retries: 1
    # max_retry_wait: 10.0
    # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
    # concurrent_requests: 25 # the number of parallel inflight requests that may be made
    batch_size: 16 # the number of documents to send in a single request
    batch_max_tokens: 3000 # the maximum number of tokens to send in a single request
    # target: required # or optional
  


chunks:
  size: 300
  overlap: 100
  group_by_columns: [id] # by default, we don't allow chunks to cross documents
    
input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$"

cache:
  type: file # or blob
  base_dir: "cache"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

storage:
  type: file # or blob
  base_dir: "output/${timestamp}/artifacts"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

reporting:
  type: file # or console, blob
  base_dir: "output/${timestamp}/reports"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

entity_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/entity_extraction.txt"
  entity_types: [organization,person,geo,event]
  max_gleanings: 0

summarize_descriptions:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  # enabled: true
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 0

community_report:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes
  # num_walks: 10
  # walk_length: 40
  # window_size: 2
  # iterations: 3
  # random_seed: 597832

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: false
  raw_entities: false
  top_level_nodes: false

local_search:
  # text_unit_prop: 0.5
  # community_prop: 0.1
  # conversation_history_max_turns: 5
  # top_k_mapped_entities: 10
  # top_k_relationships: 10
  # max_tokens: 12000

global_search:
  # max_tokens: 12000
  # data_max_tokens: 12000
  # map_max_tokens: 1000
  # reduce_max_tokens: 2000
  # concurrency: 32

Logs and screenshots

To solve i build a little bit adaptive classe , in utils/tokens.py, need to make model input as variable after or adjust it :

Copyright (c) 2024 Microsoft Corporation.

Licensed under the MIT License

"""Utilities for working with tokens."""

import tiktoken
from transformers import AutoTokenizer
from huggingface_hub import login

login(token = 'token')


``
class TokenizerWrapper:
    def __init__(self, model: str):
        self.model = model
        if model.startswith("gpt-") or model in ["cl100k_base"]:
            self.tokenizer = self._init_tiktoken(model)
        else:
            self.tokenizer = self._init_huggingface(model)

    def _init_tiktoken(self, model: str):
        try:
            return tiktoken.encoding_for_model(model)
        except KeyError:
            print(f"Warning: Model {model} not found. Using cl100k_base encoding.")
            return tiktoken.get_encoding("cl100k_base")

    def _init_huggingface(self, model: str):
        return AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-70B-Instruct")

    def encode(self, text: str) -> list:
        if isinstance(self.tokenizer, tiktoken.Encoding):
            return self.tokenizer.encode(text)
        else:  # HuggingFace tokenizer
            return self.tokenizer.encode(text, add_special_tokens=False)

    def decode(self, tokens: list) -> str:
        if isinstance(self.tokenizer, tiktoken.Encoding):
            return self.tokenizer.decode(tokens)
        else:  # HuggingFace tokenizer
            return self.tokenizer.decode(tokens)

def num_tokens_from_string(string: str, model: str) -> int:
    """Return the number of tokens in a text string."""
    tokenizer = TokenizerWrapper(model)
    return len(tokenizer.encode(string))

def string_from_tokens(tokens: list, model: str) -> str:
    """Return a text string from a list of tokens."""
    tokenizer = TokenizerWrapper(model)
    return tokenizer.decode(tokens)

Additional Information

  • GraphRAG Version:
  • Operating System:
  • Python Version:
  • Related Issues:
@jeaneigsi jeaneigsi added the triage Default label assignment, indicates new issue needs reviewed by a maintainer label Jul 9, 2024
@AlonsoGuevara AlonsoGuevara self-assigned this Jul 9, 2024
@AlonsoGuevara AlonsoGuevara added bug Something isn't working chore and removed triage Default label assignment, indicates new issue needs reviewed by a maintainer labels Jul 9, 2024
@KylinMountain
Copy link
Contributor

now you can try prompt tune, it will fall back to default cl100k_base for open source LLM.

@natoverse natoverse added community_support Issue handled by community members and removed bug Something isn't working chore labels Jul 22, 2024
@natoverse
Copy link
Collaborator

Consolidating alternate model issues here: #657

@natoverse natoverse closed this as not planned Won't fix, can't repro, duplicate, stale Jul 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community_support Issue handled by community members
Projects
None yet
Development

No branches or pull requests

4 participants