🕵️‍♂️ chunk-match

A NodeJS library that semantically chunks text and matches it against a user query using cosine similarity for precise and relevant text retrieval.

Maintained by

Features

Semantic text chunking with configurable options
Query matching using cosine similarity
Configurable similarity thresholds and chunk sizes
Returns chunks sorted by relevance with similarity scores
Built on top of semantic-chunking for robust text processing
Support for various ONNX embedding models

Installation

npm install chunk-match

Usage

import { matchChunks } from 'chunk-match';

const documents = [
    {
        document_name: "doc1.txt",
        document_text: "Your document text here..."
    },
    {
        document_name: "doc2.txt",
        document_text: "Another document text..."
    }
];

const query = "What are the key points?";

const options = {
    maxResults: 5,
    minSimilarity: 0.5,
    chunkingOptions: {
        maxTokenSize: 500,
        similarityThreshold: 0.5,
        dynamicThresholdLowerBound: 0.4,
        dynamicThresholdUpperBound: 0.8,
        numSimilaritySentencesLookahead: 3,
        combineChunks: true,
        combineChunksSimilarityThreshold: 0.8,
        onnxEmbeddingModel: "nomic-ai/nomic-embed-text-v1.5",
        dtype: 'q8',
        chunkPrefixDocument: "search_document",
        chunkPrefixQuery: "search_query"
    }
};

const results = await matchChunks(documents, query, options);
console.log(results);

API

matchChunks(documents, query, options)

Parameters

documents required (Array): Array of document objects with properties:
- document_name (string): Name/identifier of the document
- document_text (string): Text content to be chunked and matched
query required (string): The search query to match against documents
options optional (Object): Configuration options
- maxResults (number): Maximum number of results to return (default: 10)
- minSimilarity (number): Minimum similarity threshold for matches (default: 0.475)
- chunkingOptions (Object): Options for text chunking
  - maxTokenSize (number): Maximum token size for chunks (default: 500)
  - similarityThreshold (number): Threshold for semantic similarity (default: 0.5)
  - dynamicThresholdLowerBound (number): Lower bound for dynamic thresholding (default: 0.475)
  - dynamicThresholdUpperBound (number): Upper bound for dynamic thresholding (default: 0.8)
  - numSimilaritySentencesLookahead (number): Number of sentences to look ahead (default: 2)
  - combineChunks (boolean): Whether to combine similar chunks (default: true)
  - combineChunksSimilarityThreshold (number): Threshold for combining chunks (default: 0.6)
  - onnxEmbeddingModel (string): ONNX model to use for embeddings (see Models section below) (default: Xenova/all-MiniLM-L6-v2)
  - dtype: String (optional, default fp32) - Precision of the embedding model (options: fp32, fp16, q8, q4).
  - chunkPrefixDocument (string): Prefix for document chunks (for embedding models that support task prefixes) (default: null)
  - chunkPrefixQuery (string): Prefix for query chunk (for embedding models that support task prefixes) (default: null)

📗 For more details on the chunking options, see the semantic-chunking documentation

🚨 Note on Model Loading 🚨

The first time you use a specific embedding model, it will take longer to process as the model needs to be downloaded and cached locally, please be patient. Subsequent uses will be much faster since the cached model will be used.

Returns

Array of match results, each containing:

chunk (string): The matched text chunk
document_name (string): Source document name
document_id (number): Document identifier
chunk_number (number): Chunk sequence number
token_length (number): Length in tokens
similarity (number): Similarity score (0-1)

Embedding Models

This library supports various ONNX embedding models through the semantic-chunking package. Most models have a quantized version available (set onnxEmbeddingModelQuantized: true), which offers better performance with minimal impact on accuracy.

For a complete list of supported models and their characteristics, see the semantic-chunking embedding models documentation.

`onnxEmbeddingModel`

Type: String
Default: Xenova/all-MiniLM-L6-v2
Description: Specifies the model used to generate sentence embeddings. Different models may yield different qualities of embeddings, affecting the chunking quality, especially in multilingual contexts.
Resource Link: ONNX Embedding Models
Link to a filtered list of embedding models converted to ONNX library format by Xenova.
Refer to the Model table below for a list of suggested models and their sizes (choose a multilingual model if you need to chunk text other than English).

`dtype`

Type: String
Default: fp32
Description: Indicates the precision of the embedding model. Options are fp32, fp16, q8, q4. fp32 is the highest precision but also the largest size and slowest to load. q8 is a good compromise between size and speed if the model supports it. All models support fp32, but only some support fp16, q8, and q4.

Curated ONNX Embedding Models

Model	Precision (dtype)	Link	Size
nomic-ai/nomic-embed-text-v1.5	fp32, q8	https://huggingface.co/nomic-ai/nomic-embed-text-v1.5	548 MB, 138 MB
thenlper/gte-base	fp32	https://huggingface.co/thenlper/gte-base	436 MB
Xenova/all-MiniLM-L6-v2	fp32, fp16, q8	https://huggingface.co/Xenova/all-MiniLM-L6-v2	23 MB, 45 MB, 90 MB
Xenova/paraphrase-multilingual-MiniLM-L12-v2	fp32, fp16, q8	https://huggingface.co/Xenova/paraphrase-multilingual-MiniLM-L12-v2	470 MB, 235 MB, 118 MB
Xenova/all-distilroberta-v1	fp32, fp16, q8	https://huggingface.co/Xenova/all-distilroberta-v1	326 MB, 163 MB, 82 MB
BAAI/bge-base-en-v1.5	fp32	https://huggingface.co/BAAI/bge-base-en-v1.5	436 MB
BAAI/bge-small-en-v1.5	fp32	https://huggingface.co/BAAI/bge-small-en-v1.5	133 MB
yashvardhan7/snowflake-arctic-embed-m-onnx	fp32	https://huggingface.co/yashvardhan7/snowflake-arctic-embed-m-onnx	436 MB

Each of these parameters allows you to customize the chunkit function to better fit the text size, content complexity, and performance requirements of your application.

Web UI

Checkout the webui folder for a web-based interface for experimenting with and tuning Chunk Match settings. This tool provides a visual way to test and configure the chunk-match library's semantic text matching capabilities to get optimal results for your specific use case. Once you've found the best settings, you can generate code to implement them in your project.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Appreciation

If you enjoy this library please consider sending me a tip to support my work 😀

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.github		.github
.vault/chunk-match		.vault/chunk-match
example		example
img		img
utils		utils
webui		webui
.cursorignore		.cursorignore
.cursorrules		.cursorrules
.gitignore		.gitignore
.npmignore		.npmignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
chunk-match.js		chunk-match.js
config.js		config.js
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🕵️‍♂️ chunk-match

Maintained by

Features

Installation

Usage

API

matchChunks(documents, query, options)

Parameters

🚨 Note on Model Loading 🚨

Returns

Embedding Models

`onnxEmbeddingModel`

`dtype`

Curated ONNX Embedding Models

Web UI

License

Appreciation

🍵 tip me here

About

Releases 3

Sponsor this project

Languages

License

jparkerweb/chunk-match

Folders and files

Latest commit

History

Repository files navigation

🕵️‍♂️ chunk-match

Maintained by

Features

Installation

Usage

API

matchChunks(documents, query, options)

Parameters

🚨 Note on Model Loading 🚨

Returns

Embedding Models

onnxEmbeddingModel

dtype

Curated ONNX Embedding Models

Web UI

License

Appreciation

🍵 tip me here

About

Resources

License

Stars

Watchers

Forks

Releases 3

Sponsor this project

Languages

`onnxEmbeddingModel`

`dtype`