Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add upper and lower max chunk size limits to ChunkingSettings #115130

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

dan-rubinstein
Copy link
Contributor

@dan-rubinstein dan-rubinstein commented Oct 18, 2024

Description

We currently do not have an upper limit the max chunk size provided by the user in either word or sentence boundary chunking settings. The lower limit is currently 1 which is not a useful chunk size to set. To provide more reasonable limits to the users this change introduces a lower limit of 10 words for word based strategies and 20 words for sentence based strategies as well as an upper limit of 300 for word and sentence based strategies.

Testing

  • Unit tests
  • Ensured that creating embedding endpoints with valid chunking strategies succeeds and endpoints successfully chunk large documents.
  • Ensured that attempting to create embedding endpoints with max chunk sizes that are too low or too high fails.

Note: There is still an outstanding question of whether we want to allow 0 overlap on word based chunking settings. We currently force the user to provide a positive integer so 0 is not an option. We should hold off on merging this PR until we decide on this.

@elasticsearchmachine elasticsearchmachine added v9.0.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Oct 18, 2024
@dan-rubinstein dan-rubinstein added >non-issue :ml Machine learning Team:ML Meta label for the ML team v8.16.0 v8.17.0 and removed external-contributor Pull request authored by a developer outside the Elasticsearch team labels Oct 18, 2024
@dan-rubinstein
Copy link
Contributor Author

@elasticmachine merge upstream

@dan-rubinstein dan-rubinstein marked this pull request as ready for review October 18, 2024 19:06
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning >non-issue Team:ML Meta label for the ML team v8.16.0 v8.17.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants