Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Branch detection for a specific cluster or cut_distance #660

Open
m-macias opened this issue Nov 21, 2024 · 2 comments
Open

Branch detection for a specific cluster or cut_distance #660

m-macias opened this issue Nov 21, 2024 · 2 comments

Comments

@m-macias
Copy link

Hello,
I am currently trying to combine HDBSCAN with BranchDetector and I am not sure if the current implementation supports my use case. I use HDBSCAN single_linkage_tree to find all lambda values and then I select a smaller subset of these values as possible thresholds for the clustering cut_distance parameter. It works great and I am able to get a list of labels for a given lambda value, but I would like to find the branching clusters that are available for the given cut_distance as well.
Both BranchDetector and detect_branches_in_clusters seem to work on the whole model of HDBSCAN which lack the described flexibility and I would like to avoid calculating HDBSCAN for each cut_distance from the start. Alternatively, I would like to find branches for a specific cluster in subtree rather than passing the whole model, but I guess it might not be possible due to the nature of the algorithm.

Please let me know how I can do it, because maybe I am missing some parameter in the documentation that is crucial. Thanks!

@lmcinnes
Copy link
Collaborator

What you want is certainly not available by default. I would suggest you reach out to @JelmerBot, the author of the BranchDetector, to see if you can work out how best to implement what you need.

@JelmerBot
Copy link
Contributor

JelmerBot commented Dec 2, 2024

What you could do now is clone and partially overwrite the hdbscan object with labels from a cut_distance. Such an object would contain everything the branch detection stage needs to function and can be made using the snippet below. To analyse a single subtree, manipulate the labels so that points in the other subtrees have the noise label.

def clone_with_labels(clusterer, labels, probabilities=None):
    from copy import copy

    clone = copy(clusterer)  # Shallow copy
    # used to know which point is in which cluster
    clone.labels_ = labels.astype(clusterer.labels_.dtype)
    # used for weighted centroid computation
    if probabilities is not None:
        clone.probabilities_ = probabilities.astype(clusterer.probabilities_.dtype)
    else:
        clone.probabilities_ = np.ones_like(clusterer.probabilities_)
    # used to read the number of clusters
    clone.cluster_persistence_ = [None for _ in range(clone.labels_.max() + 1)]
    return clone

Assuming clusterer is the original hdbscan object and cut_labels contains the labels at a cut_distance, it can be used as:

branch_detector = BranchDetector(
    #parameters
).fit(clone_with_labels(clusterer, cut_labels))

or

result = detect_branches_in_clusters(
    clone_with_labels(clusterer, cut_labels), 
    # other parameters
)

This approach is not very clean or robust. The cloned object has references into the original hdbscan object. Changing the clone can change the original object. That can lead to subtle bugs if one is not careful.

Facilitating this use case properly in the branch detection API could be fairly straight forward. We could add a cut_distance parameter and evaluate dbscan_clustering to obtain a labelling when the parameter it is given. However, it seems wasteful to recompute the dbscan clustering if one has already done that. Alternatively, we could add a more flexible override_labels parameter. Then, the implementation has to check that the labelling and minimum spanning tree match (i.e., MST edges within a cluster must form a single connected component). This has a similar downside when cut_distance labels are provided, because they do not need this additional check. Which approach would fit best?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants