Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarking Lexer Against Hugging Face Transformer #12

Open
7 tasks
p3nGu1nZz opened this issue Mar 30, 2024 · 2 comments
Open
7 tasks

Benchmarking Lexer Against Hugging Face Transformer #12

p3nGu1nZz opened this issue Mar 30, 2024 · 2 comments
Assignees
Labels
documentation Improvements or additions to documentation good first issue Good for newcomers help wanted Extra attention is needed

Comments

@p3nGu1nZz
Copy link
Collaborator

Benchmarking Lexer Against Hugging Face Transformer

Objective:

To evaluate the performance and effectiveness of our custom Lexer in comparison to the Hugging Face transformer, we will create a benchmark that measures speed, memory usage, and the quality of context representation.

Tasks:

  • Set up a Python environment with the necessary Hugging Face transformers library and dependencies.
  • Develop a Python script to tokenize and vectorize text using the Hugging Face transformer.
  • Include additional context calculations in the Python script, such as entropy, whitespace, variance, etc.
  • Create a mechanism within Unity to call the Python script and capture its output.
  • Design the benchmark to measure the processing time, output size, and context quality for both systems.
  • Ensure the benchmark tests are repeatable and consistent across multiple runs.
  • Document the benchmark process, including setup, execution, and result interpretation.

Acceptance Criteria:

  • The benchmark should accurately measure and compare the performance of our Lexer and the Hugging Face transformer.
  • Results should highlight the strengths and weaknesses of each approach in terms of speed, efficiency, and context representation.
  • The benchmarking process should be well-documented and easily reproducible for future testing and development.

This ticket will guide the development of a comprehensive benchmarking suite that will inform our decision-making process regarding text processing tools within our project.

@p3nGu1nZz p3nGu1nZz added the help wanted Extra attention is needed label Mar 30, 2024
@p3nGu1nZz p3nGu1nZz self-assigned this Mar 30, 2024
@Josephrp
Copy link

i want to follow along with this but dont know how much i can help ^^

@p3nGu1nZz
Copy link
Collaborator Author

i want to follow along with this but dont know how much i can help ^^

you could make a simple python script to tokenize a string of words (using huggingface transformers) no more than 1000 characters. And track how long it takes to tokenize that string as accurately as possible.

@p3nGu1nZz p3nGu1nZz added documentation Improvements or additions to documentation good first issue Good for newcomers labels Apr 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants