Discussion on `AutoMergingRetriever` and `HierarchicalDocumentSplitter` #78

TuanaCelik · 2024-09-05T12:43:35Z

TuanaCelik
Sep 5, 2024

This is the discussion board for the experimental AutoMergingRetriever and HierarchicalDocumentSplitter components

These components are used to split documents with a reference to the 'parent' document, and then based on a threshold setting, to return the parent documents if a certain number of 'child' documents are retrieved.

The rational is, given that a paragraph is split into multiple chunks represented as leaf documents, and if for a given query, multiple chunks are matched, the whole paragraph might be more informative than the individual chunks alone.

📚Full Documentation of the AutoMergingRetriever
📚Full Documentation of the HierarchicalDocumentSplitter

🧑‍🍳 Try the Cookbook here

Note: Experimental features live in this repository for a fixed period of time. We don't guarantee that we will continue maintaining experimental features. But if they are successful, stable and if you like it, we will move the feature to the core Haystack package.

PS: The first version of this experiment was implemented by @davidsbatista 🚀

TuanaCelik · 2024-09-06T10:10:36Z

TuanaCelik
Sep 6, 2024
Author

Ok y'all, I have been looking at the draft cookbook we are about to share for this component so I decided, why don't I add the first comment already:

Imo, for the Hierarchical document splitter to be useable in pipelines, we should also have an output called parent_documents that outputs the __level = 1 documents. Otherwise, to be able to use the component in a pipeline, we have to probably use the conditional router which becomes unnecessarily complex. The common use case will probably just be indexing parent doucmemtns, as shown by David already 🙏

1 reply

davidsbatista Sep 11, 2024
Maintainer

I don't think it makes much sense to hard-code the HierarchicalDocumentSplitter to have an extra output parent_documents that outputs the __level = 1. The HierarchicalDocumentSplitter aims to build a hierarchy based on the input documents and the number of levels/blocks the user specifies. It's then on the user to decide which documents from which level he/she wants and what to do with them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion on `AutoMergingRetriever` and `HierarchicalDocumentSplitter` #78

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Discussion on AutoMergingRetriever and HierarchicalDocumentSplitter #78

TuanaCelik Sep 5, 2024

Replies: 1 comment · 1 reply

TuanaCelik Sep 6, 2024 Author

davidsbatista Sep 11, 2024 Maintainer

Discussion on `AutoMergingRetriever` and `HierarchicalDocumentSplitter` #78

TuanaCelik
Sep 5, 2024

Replies: 1 comment 1 reply

TuanaCelik
Sep 6, 2024
Author

davidsbatista Sep 11, 2024
Maintainer