Dataset Viber

Avoid the hype, check the vibe!

I've cooked up Dataset Viber, a cool set of tools to make your life easier when dealing with data for AI models. Dataset Viber is all about making your data prep journey smooth and fun. It's not for team collaboration or production, nor trying to be all fancy and formal - just a bunch of cool tools to help you collect feedback and do vibe-checks as an AI engineer or lover. Want to see it in action? Just plug it in and start vibing with your data. It's that easy!

CollectorInterface: Lazily collect data of model interactions without human annotation.
AnnotatorInterface: Walk through your data and annotate it with models in the loop.
Synthesizer: Synthesize data with distilabel in the loop.
BulkInterface: Explore your data distribution and annotate in bulk.

Need any tweaks or want to hear more about a specific tool? Just open an issue or give me a shout!

Note

Data is logged to a local CSV or directly to the Hugging Face Hub.
All tools also run in .ipynb notebooks.
Models in the loop through fn_model.
Input data streamers through fn_next_input.
It supports various tasks for text, chat and image modalities.
Import and export from the Hugging Face Hub or CSV files.

Tip

Code examples: src/dataset_viber/examples.
Hub examples: https://huggingface.co/dataset-viber.

Installation

You can install the package via pip:

pip install dataset-viber

Or install Synthesizer dependencies:

pip install dataset-viber[synthesizer]

Or install BulkInterface dependencies:

pip install dataset-viber[bulk]

How are we vibing?

CollectorInterface

Built on top of the gr.Interface and gr.ChatInterface to lazily collect data for interactions automatically.

collector.mp4

Hub dataset

CollectorInterface

import gradio as gr
from dataset_viber import CollectorInterface

def calculator(num1, operation, num2):
    if operation == "add":
        return num1 + num2
    elif operation == "subtract":
        return num1 - num2
    elif operation == "multiply":
        return num1 * num2
    elif operation == "divide":
        return num1 / num2

inputs = ["number", gr.Radio(["add", "subtract", "multiply", "divide"]), "number"]
outputs = "number"

interface = CollectorInterface(
    fn=calculator,
    inputs=inputs,
    outputs=outputs,
    csv_logger=False, # True if you want to log to a CSV
    dataset_name="<my_hf_org>/<my_dataset>"
)
interface.launch()

CollectorInterface.from_interface

interface = gr.Interface(
    fn=calculator,
    inputs=inputs,
    outputs=outputs
)
interface = CollectorInterface.from_interface(
   interface=interface,
   csv_logger=False, # True if you want to log to a CSV
   dataset_name="<my_hf_org>/<my_dataset>"
)
interface.launch()

CollectorInterface.from_pipeline

from transformers import pipeline
from dataset_viber import CollectorInterface

pipeline = pipeline("text-classification", model="mrm8488/bert-tiny-finetuned-sms-spam-detection")
interface = CollectorInterface.from_pipeline(
    pipeline=pipeline,
    csv_logger=False, # True if you want to log to a CSV
    dataset_name="<my_hf_org>/<my_dataset>"
)
interface.launch()

AnnotatorInterface

Built on top of the CollectorInterface to collect and annotate data and log it to the Hub.

Text

text-annotator.mp4

Hub dataset

text-classification/multi-label-text-classification

from dataset_viber import AnnotatorInterFace

texts = [
    "Anthony Bourdain was an amazing chef!",
    "Anthony Bourdain was a terrible tv persona!"
]
labels = ["positive", "negative"]

interface = AnnotatorInterFace.for_text_classification(
    texts=texts,
    labels=labels,
    multi_label=False, # True if you have multi-label data
    fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
    fn_next_input=None, # a function that feeds gradio components actively with the next input
    csv_logger=False, # True if you want to log to a CSV
    dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()

token-classification

from dataset_viber import AnnotatorInterFace

texts = ["Anthony Bourdain was an amazing chef in New York."]
labels = ["NAME", "LOC"]

interface = AnnotatorInterFace.for_token_classification(
    texts=texts,
    labels=labels,
    fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
    fn_next_input=None, # a function that feeds gradio components actively with the next input
    csv_logger=False, # True if you want to log to a CSV
    dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()

extractive-question-answering

from dataset_viber import AnnotatorInterFace

questions = ["Where was Anthony Bourdain located?"]
contexts = ["Anthony Bourdain was an amazing chef in New York."]

interface = AnnotatorInterFace.for_question_answering(
    questions=questions,
    contexts=contexts,
    fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
    fn_next_input=None, # a function that feeds gradio components actively with the next input
    csv_logger=False, # True if you want to log to a CSV
    dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()

text-generation/translation/completion

from dataset_viber import AnnotatorInterFace

prompts = ["Tell me something about Anthony Bourdain."]
completions = ["Anthony Michael Bourdain was an American celebrity chef, author, and travel documentarian."]

interface = AnnotatorInterFace.for_text_generation(
    prompts=prompts, # source
    completions=completions, # optional to show initial completion / target
    fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
    fn_next_input=None, # a function that feeds gradio components actively with the next input
    csv_logger=False, # True if you want to log to a CSV
    dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()

text-generation-preference

from dataset_viber import AnnotatorInterFace

prompts = ["Tell me something about Anthony Bourdain."]
completions_a = ["Anthony Michael Bourdain was an American celebrity chef, author, and travel documentarian."]
completions_b = ["Anthony Michael Bourdain was an cool guy that knew how to cook."]

interface = AnnotatorInterFace.for_text_generation_preference(
    prompts=prompts,
    completions_a=completions_a,
    completions_b=completions_b,
    fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
    fn_next_input=None, # a function that feeds gradio components actively with the next input
    csv_logger=False, # True if you want to log to a CSV
    dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()

Chat and multi-modal chat

annotator-chat.mp4

Hub dataset

Tip

I recommend uploading the files files to a cloud storage and using the remote URL to avoid any issues. This can be done using Hugging Face Datasets. As shown in utils. Additionally GradioChatbot shows how to use the chatbot interface for multi-modal.

chat-classification

from dataset_viber import AnnotatorInterFace

prompts = [
    [
        {
            "role": "user",
            "content": "Tell me something about Anthony Bourdain."
        },
        {
            "role": "assistant",
            "content": "Anthony Michael Bourdain was an American celebrity chef, author, and travel documentarian."
        }
    ]
]

interface = AnnotatorInterFace.for_chat_classification(
    prompts=prompts,
    labels=["toxic", "non-toxic"],
    multi_label=False, # True if you have multi-label data
    fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
    fn_next_input=None, # a function that feeds gradio components actively with the next input
    csv_logger=False, # True if you want to log to a CSV
    dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()

chat-generation

from dataset_viber import AnnotatorInterFace

prompts = [
    [
        {
            "role": "user",
            "content": "Tell me something about Anthony Bourdain."
        }
    ]
]

completions = [
    "Anthony Michael Bourdain was an American celebrity chef, author, and travel documentarian.",
]

interface = AnnotatorInterFace.for_chat_generation(
    prompts=prompts,
    completions=completions,
    fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
    fn_next_input=None, # a function that feeds gradio components actively with the next input
    csv_logger=False, # True if you want to log to a CSV
    dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()

chat-generation-preference

from dataset_viber import AnnotatorInterFace

prompts = [
    [
        {
            "role": "user",
            "content": "Tell me something about Anthony Bourdain."
        }
    ]
]
completions_a = [
    "Anthony Michael Bourdain was an American celebrity chef, author, and travel documentarian.",
]
completions_b = [
    "Anthony Michael Bourdain was an cool guy that knew how to cook."
]

interface = AnnotatorInterFace.for_chat_generation_preference(
    prompts=prompts,
    completions_a=completions_a,
    completions_b=completions_b,
    fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
    fn_next_input=None, # a function that feeds gradio components actively with the next input
    csv_logger=False, # True if you want to log to a CSV
    dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()

Image and multi-modal

annotator.mp4

Hub dataset

Tip

I recommend uploading the files files to a cloud storage and using the remote URL to avoid any issues. This can be done using Hugging Face Datasets. As shown in utils.

image-classification/multi-label-image-classification

from dataset_viber import AnnotatorInterFace

images = [
    "https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Anthony_Bourdain_Peabody_2014b.jpg/440px-Anthony_Bourdain_Peabody_2014b.jpg",
    "https://upload.wikimedia.org/wikipedia/commons/8/85/David_Chang_David_Shankbone_2010.jpg"
]
labels = ["anthony-bourdain", "not-anthony-bourdain"]

interface = AnnotatorInterFace.for_image_classification(
    images=images,
    labels=labels,
    multi_label=False, # True if you have multi-label data
    fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
    fn_next_input=None, # a function that feeds gradio components actively with the next input
    csv_logger=False, # True if you want to log to a CSV
    dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()

image-generation

from dataset_viber import AnnotatorInterFace

prompts = [
    "Anthony Bourdain laughing",
    "David Chang wearing a suit"
]
images = [
    "https://upload.wikimedia.org/wikipedia/commons/8/85/David_Chang_David_Shankbone_2010.jpg",
    "https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Anthony_Bourdain_Peabody_2014b.jpg/440px-Anthony_Bourdain_Peabody_2014b.jpg",
]

interface = AnnotatorInterFace.for_image_generation(
    prompts=prompts,
    completions=images,
    fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
    fn_next_input=None, # a function that feeds gradio components actively with the next input
    csv_logger=False, # True if you want to log to a CSV
    dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)

interface.launch()

image-description

from dataset_viber import AnnotatorInterFace

images = [
    "https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Anthony_Bourdain_Peabody_2014b.jpg/440px-Anthony_Bourdain_Peabody_2014b.jpg",
    "https://upload.wikimedia.org/wikipedia/commons/8/85/David_Chang_David_Shankbone_2010.jpg"
]
descriptions = ["Anthony Bourdain laughing", "David Chang wearing a suit"]

interface = AnnotatorInterFace.for_image_description(
    images=images,
    descriptions=descriptions, # optional to show initial descriptions
    fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
    fn_next_input=None, # a function that feeds gradio components actively with the next input
    csv_logger=False, # True if you want to log to a CSV
    dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()

image-question-answering/visual-question-answering

from dataset_viber import AnnotatorInterFace

images = [
    "https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Anthony_Bourdain_Peabody_2014b.jpg/440px-Anthony_Bourdain_Peabody_2014b.jpg",
    "https://upload.wikimedia.org/wikipedia/commons/8/85/David_Chang_David_Shankbone_2010.jpg"
]
questions = ["Who is this?", "What is he wearing?"]
answers = ["Anthony Bourdain", "a suit"]

interface = AnnotatorInterFace.for_image_question_answering(
    images=images,
    questions=questions, # optional to show initial questions
    answers=answers, # optional to show initial answers
    fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
    fn_next_input=None, # a function that feeds gradio components actively with the next input
    csv_logger=False, # True if you want to log to a CSV
    dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()

image-generation-preference

from dataset_viber import AnnotatorInterFace

prompts = [
    "Anthony Bourdain laughing",
    "David Chang wearing a suit"
]

images_a = [
    "https://upload.wikimedia.org/wikipedia/commons/8/85/David_Chang_David_Shankbone_2010.jpg",
    "https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Anthony_Bourdain_Peabody_2014b.jpg/440px-Anthony_Bourdain_Peabody_2014b.jpg",
]

images_b = [
    "https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Anthony_Bourdain_Peabody_2014b.jpg/440px-Anthony_Bourdain_Peabody_2014b.jpg",
    "https://upload.wikimedia.org/wikipedia/commons/8/85/David_Chang_David_Shankbone_2010.jpg"
]

interface = AnnotatorInterFace.for_image_generation_preference(
    prompts=prompts,
    completions_a=images_a,
    completions_b=images_b,
    fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
    fn_next_input=None, # a function that feeds gradio components actively with the next input
    csv_logger=False, # True if you want to log to a CSV
    dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()

Synthesizer

Built on top of the distilabel to synthesize data with models in the loop.

text-classification

from dataset_viber import AnnotatorInterFace
from dataset_viber.synthesizer import Synthesizer

synthesizer = Synthesizer.for_text_classification(task_description="IMDB movie reviews")

interface = AnnotatorInterFace.for_text_classification(
    fn_next_input=synthesizer,
    labels=["positive", "negative"]
)
interface.launch()

text-generation

from dataset_viber import AnnotatorInterFace
from dataset_viber.synthesizer import Synthesizer

synthesizer = Synthesizer.for_text_generation(task_description="A phone company customer support expert")

interface = AnnotatorInterFace.for_text_generation(
    fn_next_input=synthesizer
)
interface.launch()

chat-classification

from dataset_viber import AnnotatorInterFace
from dataset_viber.synthesizer import Synthesizer

synthesizer = Synthesizer.for_chat_classification(task_description="A phone company customer support expert")

interface = AnnotatorInterFace.for_chat_classification(
    fn_next_input=synthesizer,
    labels=["positive", "negative"]
)
interface.launch()

chat-generation

from dataset_viber import AnnotatorInterFace
from dataset_viber.synthesizer import Synthesizer

synthesizer = Synthesizer.for_chat_generation(task_description="A phone company customer support expert")

interface = AnnotatorInterFace.for_chat_generation(
    fn_next_input=synthesizer
)
interface.launch()

chat-generation-preference

from dataset_viber import AnnotatorInterFace
from dataset_viber.synthesizer import Synthesizer

synthesizer = Synthesizer.for_chat_generation_preference(task_description="A phone company customer support expert")

interface = AnnotatorInterFace.for_chat_generation_preference(
    fn_next_input=synthesizer
)
interface.launch()

BulkInterface

Built on top of the Dash, plotly-express, umap-learn, and fast-sentence-transformers to embed and understand your distribution and annotate your data.

explorer.mp4

Hub dataset

text-visualization

from dataset_viber import BulkInterface
from datasets import load_dataset

ds = load_dataset("SetFit/ag_news", split="train[:2000]")

interface: BulkInterface = BulkInterface.for_text_visualization(
    ds.to_pandas()[["text", "label_text"]],
    content_column='text',
    label_column='label_text',
)
interface.launch()

text-classification

from dataset_viber import BulkInterface
from datasets import load_dataset

ds = load_dataset("SetFit/ag_news", split="train[:2000]")
df = ds.to_pandas()[["text", "label_text"]]

interface = BulkInterface.for_text_classification(
    dataframe=df,
    content_column='text',
    label_column='label_text',
    labels=df['label_text'].unique().tolist()
)
interface.launch()

chat-visualization

from dataset_viber.bulk import BulkInterface
from datasets import load_dataset

ds = load_dataset("argilla/distilabel-capybara-dpo-7k-binarized", split="train[:1000]")
df = ds.to_pandas()[["chosen"]]

interface = BulkInterface.for_chat_visualization(
    dataframe=df,
    chat_column='chosen',
)
interface.launch()

chat-classification

from dataset_viber.bulk import BulkInterface
from datasets import load_dataset

ds = load_dataset("argilla/distilabel-capybara-dpo-7k-binarized", split="train[:1000]")
df = ds.to_pandas()[["chosen"]]

interface = BulkInterface.for_chat_classification(
    dataframe=df,
    chat_column='chosen',
    labels=["math", "science", "history", "question seeking"],
)
interface.launch()

Utils

Shuffle inputs in the same order

When working with multiple inputs, you might want to shuffle them in the same order.

def shuffle_lists(*lists):
    if not lists:
        return []

    # Get the length of the first list
    length = len(lists[0])

    # Check if all lists have the same length
    if not all(len(lst) == length for lst in lists):
        raise ValueError("All input lists must have the same length")

    # Create a list of indices and shuffle it
    indices = list(range(length))
    random.shuffle(indices)

    # Reorder each list based on the shuffled indices
    return [
        [lst[i] for i in indices]
        for lst in lists
    ]

Random swap to randomize completions

When working with multiple completions, you might want to swap out the completions at the same index, where each completion index x is swapped with a random completion at the same index. This is useful for preference learning.

def swap_completions(*lists):
    # Assuming all lists are of the same length
    length = len(lists[0])

    # Check if all lists have the same length
    if not all(len(lst) == length for lst in lists):
        raise ValueError("All input lists must have the same length")

    # Convert the input lists (which are tuples) to a list of lists
    lists = [list(lst) for lst in lists]

    # Iterate over each index
    for i in range(length):
        # Get the elements at index i from all lists
        elements = [lst[i] for lst in lists]

        # Randomly shuffle the elements
        random.shuffle(elements)

        # Assign the shuffled elements back to the lists
        for j, lst in enumerate(lists):
            lst[i] = elements[j]

    return lists

Load remote image URLs from Hugging Face Hub

When working with images, you might want to load remote URLs from the Hugging Face Hub.

from datasets import Dataset, Image, load_dataset

dataset = load_dataset(
    "my_hf_org/my_image_dataset"
).cast_column("my_image_column", Image(decode=False))
dataset[0]["my_image_column"]
# {'bytes': None, 'path': 'path_to_image.jpg'}

Contribute and development setup

First, install PDM.

Then, install the environment, this will automatically create a .venv virtual env and install the dev environment.

pdm install

Lastly, run pre-commit for formatting on commit.

pre-commit install

Name		Name	Last commit message	Last commit date
Latest commit History 166 Commits
.github/workflows		.github/workflows
src/dataset_viber		src/dataset_viber
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
LICENSE_HEADER		LICENSE_HEADER
README.md		README.md
pdm.lock		pdm.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dataset Viber

Avoid the hype, check the vibe!

Installation

How are we vibing?

CollectorInterface

AnnotatorInterface

Text

Chat and multi-modal chat

Image and multi-modal

Synthesizer

BulkInterface

Utils

Contribute and development setup

References

Logo

Inspirations

About

Releases

Packages

Languages

License

plurigrid/dataset-viber

Folders and files

Latest commit

History

Repository files navigation

Dataset Viber

Avoid the hype, check the vibe!

Installation

How are we vibing?

CollectorInterface

AnnotatorInterface

Text

Chat and multi-modal chat

Image and multi-modal

Synthesizer

BulkInterface

Utils

Contribute and development setup

References

Logo

Inspirations

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages