Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let document retrieval be more flexible #206

Open
lgabs opened this issue Jun 7, 2024 · 2 comments
Open

Let document retrieval be more flexible #206

lgabs opened this issue Jun 7, 2024 · 2 comments
Assignees
Labels
embedding enhancement New feature or request

Comments

@lgabs
Copy link
Collaborator

lgabs commented Jun 7, 2024

Currently, the LCEL retriever in dialog-lib forces the document content to join question and content together:

https://github.com/talkdai/dialog-lib/blob/4e8de796be1a21c877eb393066a78235e6a193ac/dialog_lib/embeddings/retrievers.py#L31-L39

However, the user already defines which fields should be embedded in load_csv.py`, so this retriever should keep this choice with a simple return like

        return [
            Document(
                page_content=content.content,
                metadata={
                    "title": content.question,
                    "category": content.category,
                    "subcategory": content.subcategory,
                    "dataset": content.dataset,
                    "link": content.link,
                },
            )
            for content in relevant_contents
        ]

Moreover, since the default embedding way of langchain's CSVLoader is to already embedd the field name prefixed to the field value, e.g. category: cat1\nsubcategory: subcat1\ncontent: content1 (see this test), it already achieves the same idea that the current implementation does, but in generic way.

That proposition works normally with default project chains, while giving flexibility to users that would implement their own prompt design. For example, the project default RAG Chain has this format_docs:

def format_docs(docs):
return "\n\n".join([d.page_content for d in docs])

and users can customize this as they wish to achieve their ideas. Later, when we implement metadata saving to the vectorstore, we could even return other metadata dynamically as well.

@vmesel
Copy link
Member

vmesel commented Jun 13, 2024

@lgabs want to handle this change?

@lgabs
Copy link
Collaborator Author

lgabs commented Jun 13, 2024

Sure, I can do it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
embedding enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants