Branch detection for a specific cluster or cut_distance #660

m-macias · 2024-11-21T07:17:10Z

Hello,
I am currently trying to combine HDBSCAN with BranchDetector and I am not sure if the current implementation supports my use case. I use HDBSCAN single_linkage_tree to find all lambda values and then I select a smaller subset of these values as possible thresholds for the clustering cut_distance parameter. It works great and I am able to get a list of labels for a given lambda value, but I would like to find the branching clusters that are available for the given cut_distance as well.
Both BranchDetector and detect_branches_in_clusters seem to work on the whole model of HDBSCAN which lack the described flexibility and I would like to avoid calculating HDBSCAN for each cut_distance from the start. Alternatively, I would like to find branches for a specific cluster in subtree rather than passing the whole model, but I guess it might not be possible due to the nature of the algorithm.

Please let me know how I can do it, because maybe I am missing some parameter in the documentation that is crucial. Thanks!

The text was updated successfully, but these errors were encountered:

lmcinnes · 2024-11-21T16:06:12Z

What you want is certainly not available by default. I would suggest you reach out to @JelmerBot, the author of the BranchDetector, to see if you can work out how best to implement what you need.

JelmerBot · 2024-12-02T14:12:08Z

What you could do now is clone and partially overwrite the hdbscan object with labels from a cut_distance. Such an object would contain everything the branch detection stage needs to function and can be made using the snippet below. To analyse a single subtree, manipulate the labels so that points in the other subtrees have the noise label.

def clone_with_labels(clusterer, labels, probabilities=None):
    from copy import copy

    clone = copy(clusterer)  # Shallow copy
    # used to know which point is in which cluster
    clone.labels_ = labels.astype(clusterer.labels_.dtype)
    # used for weighted centroid computation
    if probabilities is not None:
        clone.probabilities_ = probabilities.astype(clusterer.probabilities_.dtype)
    else:
        clone.probabilities_ = np.ones_like(clusterer.probabilities_)
    # used to read the number of clusters
    clone.cluster_persistence_ = [None for _ in range(clone.labels_.max() + 1)]
    return clone

Assuming clusterer is the original hdbscan object and cut_labels contains the labels at a cut_distance, it can be used as:

branch_detector = BranchDetector(
    #parameters
).fit(clone_with_labels(clusterer, cut_labels))

or

result = detect_branches_in_clusters(
    clone_with_labels(clusterer, cut_labels), 
    # other parameters
)

This approach is not very clean or robust. The cloned object has references into the original hdbscan object. Changing the clone can change the original object. That can lead to subtle bugs if one is not careful.

Facilitating this use case properly in the branch detection API could be fairly straight forward. We could add a cut_distance parameter and evaluate dbscan_clustering to obtain a labelling when the parameter it is given. However, it seems wasteful to recompute the dbscan clustering if one has already done that. Alternatively, we could add a more flexible override_labels parameter. Then, the implementation has to check that the labelling and minimum spanning tree match (i.e., MST edges within a cluster must form a single connected component). This has a similar downside when cut_distance labels are provided, because they do not need this additional check. Which approach would fit best?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Branch detection for a specific cluster or cut_distance #660

Branch detection for a specific cluster or cut_distance #660

m-macias commented Nov 21, 2024

lmcinnes commented Nov 21, 2024

JelmerBot commented Dec 2, 2024 •

edited

Loading

Branch detection for a specific cluster or cut_distance #660

Branch detection for a specific cluster or cut_distance #660

Comments

m-macias commented Nov 21, 2024

lmcinnes commented Nov 21, 2024

JelmerBot commented Dec 2, 2024 • edited Loading

JelmerBot commented Dec 2, 2024 •

edited

Loading