Skip to content

πŸ•΅οΈβ€β™‚οΈ chunk-match is a NodeJS library that semantically chunks text and matches it against a user query using cosine similarity for precise and relevant text retrieval.

License

Notifications You must be signed in to change notification settings

jparkerweb/chunk-match

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

38 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ•΅οΈβ€β™‚οΈ chunk-match

A NodeJS library that semantically chunks text and matches it against a user query using cosine similarity for precise and relevant text retrieval.

Maintained by

eQuill Labs

Features

  • Semantic text chunking with configurable options
  • Query matching using cosine similarity
  • Configurable similarity thresholds and chunk sizes
  • Returns chunks sorted by relevance with similarity scores
  • Built on top of semantic-chunking for robust text processing
  • Support for various ONNX embedding models

Installation

npm install chunk-match

Usage

import { matchChunks } from 'chunk-match';

const documents = [
    {
        document_name: "doc1.txt",
        document_text: "Your document text here..."
    },
    {
        document_name: "doc2.txt",
        document_text: "Another document text..."
    }
];

const query = "What are the key points?";

const options = {
    maxResults: 5,
    minSimilarity: 0.5,
    chunkingOptions: {
        maxTokenSize: 500,
        similarityThreshold: 0.5,
        dynamicThresholdLowerBound: 0.4,
        dynamicThresholdUpperBound: 0.8,
        numSimilaritySentencesLookahead: 3,
        combineChunks: true,
        combineChunksSimilarityThreshold: 0.8,
        onnxEmbeddingModel: "nomic-ai/nomic-embed-text-v1.5",
        dtype: 'q8',
        chunkPrefixDocument: "search_document",
        chunkPrefixQuery: "search_query"
    }
};

const results = await matchChunks(documents, query, options);
console.log(results);

API

matchChunks(documents, query, options)

Parameters

  • documents required (Array): Array of document objects with properties:

    • document_name (string): Name/identifier of the document
    • document_text (string): Text content to be chunked and matched
  • query required (string): The search query to match against documents

  • options optional (Object): Configuration options

    • maxResults (number): Maximum number of results to return (default: 10)
    • minSimilarity (number): Minimum similarity threshold for matches (default: 0.475)
    • chunkingOptions (Object): Options for text chunking
      • maxTokenSize (number): Maximum token size for chunks (default: 500)
      • similarityThreshold (number): Threshold for semantic similarity (default: 0.5)
      • dynamicThresholdLowerBound (number): Lower bound for dynamic thresholding (default: 0.475)
      • dynamicThresholdUpperBound (number): Upper bound for dynamic thresholding (default: 0.8)
      • numSimilaritySentencesLookahead (number): Number of sentences to look ahead (default: 2)
      • combineChunks (boolean): Whether to combine similar chunks (default: true)
      • combineChunksSimilarityThreshold (number): Threshold for combining chunks (default: 0.6)
      • onnxEmbeddingModel (string): ONNX model to use for embeddings (see Models section below) (default: Xenova/all-MiniLM-L6-v2)
      • dtype: String (optional, default fp32) - Precision of the embedding model (options: fp32, fp16, q8, q4).
      • chunkPrefixDocument (string): Prefix for document chunks (for embedding models that support task prefixes) (default: null)
      • chunkPrefixQuery (string): Prefix for query chunk (for embedding models that support task prefixes) (default: null)

πŸ“— For more details on the chunking options, see the semantic-chunking documentation

🚨 Note on Model Loading 🚨

The first time you use a specific embedding model, it will take longer to process as the model needs to be downloaded and cached locally, please be patient. Subsequent uses will be much faster since the cached model will be used.

Returns

Array of match results, each containing:

  • chunk (string): The matched text chunk
  • document_name (string): Source document name
  • document_id (number): Document identifier
  • chunk_number (number): Chunk sequence number
  • token_length (number): Length in tokens
  • similarity (number): Similarity score (0-1)

Embedding Models

This library supports various ONNX embedding models through the semantic-chunking package. Most models have a quantized version available (set onnxEmbeddingModelQuantized: true), which offers better performance with minimal impact on accuracy.

For a complete list of supported models and their characteristics, see the semantic-chunking embedding models documentation.

onnxEmbeddingModel

  • Type: String
  • Default: Xenova/all-MiniLM-L6-v2
  • Description: Specifies the model used to generate sentence embeddings. Different models may yield different qualities of embeddings, affecting the chunking quality, especially in multilingual contexts.
  • Resource Link: ONNX Embedding Models
    Link to a filtered list of embedding models converted to ONNX library format by Xenova.
    Refer to the Model table below for a list of suggested models and their sizes (choose a multilingual model if you need to chunk text other than English).

dtype

  • Type: String
  • Default: fp32
  • Description: Indicates the precision of the embedding model. Options are fp32, fp16, q8, q4. fp32 is the highest precision but also the largest size and slowest to load. q8 is a good compromise between size and speed if the model supports it. All models support fp32, but only some support fp16, q8, and q4.

Curated ONNX Embedding Models

Model Precision (dtype) Link Size
nomic-ai/nomic-embed-text-v1.5 fp32, q8 https://huggingface.co/nomic-ai/nomic-embed-text-v1.5 548 MB, 138 MB
thenlper/gte-base fp32 https://huggingface.co/thenlper/gte-base 436 MB
Xenova/all-MiniLM-L6-v2 fp32, fp16, q8 https://huggingface.co/Xenova/all-MiniLM-L6-v2 23 MB, 45 MB, 90 MB
Xenova/paraphrase-multilingual-MiniLM-L12-v2 fp32, fp16, q8 https://huggingface.co/Xenova/paraphrase-multilingual-MiniLM-L12-v2 470 MB, 235 MB, 118 MB
Xenova/all-distilroberta-v1 fp32, fp16, q8 https://huggingface.co/Xenova/all-distilroberta-v1 326 MB, 163 MB, 82 MB
BAAI/bge-base-en-v1.5 fp32 https://huggingface.co/BAAI/bge-base-en-v1.5 436 MB
BAAI/bge-small-en-v1.5 fp32 https://huggingface.co/BAAI/bge-small-en-v1.5 133 MB
yashvardhan7/snowflake-arctic-embed-m-onnx fp32 https://huggingface.co/yashvardhan7/snowflake-arctic-embed-m-onnx 436 MB

Each of these parameters allows you to customize the chunkit function to better fit the text size, content complexity, and performance requirements of your application.

Web UI

Checkout the webui folder for a web-based interface for experimenting with and tuning Chunk Match settings. This tool provides a visual way to test and configure the chunk-match library's semantic text matching capabilities to get optimal results for your specific use case. Once you've found the best settings, you can generate code to implement them in your project.

chunk-match_web-ui

License

This project is licensed under the MIT License - see the LICENSE file for details.

Appreciation

If you enjoy this library please consider sending me a tip to support my work πŸ˜€

About

πŸ•΅οΈβ€β™‚οΈ chunk-match is a NodeJS library that semantically chunks text and matches it against a user query using cosine similarity for precise and relevant text retrieval.

Resources

License

Stars

Watchers

Forks

Sponsor this project