Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: add bincount #812

Open
ogrisel opened this issue Jun 5, 2024 · 5 comments
Open

RFC: add bincount #812

ogrisel opened this issue Jun 5, 2024 · 5 comments
Labels
API extension Adds new functions or objects to the API. Needs Discussion Needs further discussion. RFC Request for comments. Feature requests and proposed changes.

Comments

@ogrisel
Copy link

ogrisel commented Jun 5, 2024

It seems quite widely adopted:

Although I have not checked how compatible they are.

JAX in particular requires an extra length argument to statically define the length of the output array to make JIT possible and dask uses (nan,) shape to represent a data-dependent output shapes.

@kgryte kgryte added RFC Request for comments. Feature requests and proposed changes. Needs Discussion Needs further discussion. API extension Adds new functions or objects to the API. labels Jun 5, 2024
@rgommers
Copy link
Member

Thanks for the suggestion @ogrisel!

bincount is indeed a heavily used function. I did a quick tally: scikit-learn currently has 117 usages of np.bincount, and SciPy 38. It's also used as a primitive for histogram & co, and it's not easy to implement efficiently with other existing functions in the standard. So from that perspective, I think it meets the bar for inclusion fairly easily.

Signatures do all seem compatible. There is at least one difference in behavior I spotted: the accepted input is non-negative integer values, but the case of negative values is not treated the same - it may raise (NumPy), or clip (JAX).

The NumPy issue tracker contains a lot of open issues about bincount, so the constraints on values, weights etc. seem to need careful specification.

The PyTorch docs also contain a warning about non-deterministic gradients - probably related to the data-dependent behavior:

JAX in particular requires an extra length argument to statically define the length of the output array to make JIT possible and dask uses (nan,) shape to represent a data-dependent output shapes.

Yes, that's non-ideal. CuPy also adds a warning about bincount possibly causing synchronization.

The number of APIs that use data-dependent shapes is still quite low. As of now, I think this is the full list:

  • boolean indexing
  • the four unique_* functions
  • repeats
  • nonzero

@NeilGirdhar
Copy link

NeilGirdhar commented Jun 15, 2024

Just curious, but are you sure you want to keep using the name bincount? It seems redundant: "binning" means counting. What about bin_integers, bin, bucket, or integer_histogram?

@ogrisel
Copy link
Author

ogrisel commented Jun 16, 2024

To me, 'binning' means assigning to a bin. E.g. an array of continuous values (floating point values) is transformed into an array of discrete values (bin indices) typically represented as integers. This is a synonym of discretization.

Then you can aggregate those values as a histogram of per bin counts if you wish, but it's not the only thing you can do. In machine learning, binned values are used in many various ways and only some involve histograms, and often only filtered subsets of the original binned array.

@NeilGirdhar
Copy link

NeilGirdhar commented Jun 16, 2024

To me, 'binning' means assigning to a bin. E.g. an array of continuous values (floating point values) is transformed into an array of discrete values (bin indices) typically represented as integers. This is a synonym of discretization.

Right, I agree that that's what happens when you "bin" floating point values. This function (bincount) is just the integer version of binning floating point values. In this integer case, bins can be discrete values (or they could be ranges even though bincount doesn't support ranges).

Then you can aggregate those values as a histogram of per bin counts if you wish,

I understand what you're saying. But, it's also a fact that the result of bincount is by definition already a histogram. This is clear because you can convert any call to bincount into a similar call to histogram (possibly massaging the output to get the types right and shapes right).

I still think that "bincount" is a bad name. It is just "binning", and its name should reflect the technical term that people use. No one calls it the "bincount" or "bincounting".

If I had to guess, I think its name reflects a very human tendency to combine two names that would have each been fine:

  • numpy.bin would have been fine, or
  • numpy.count might have been okay.

I think someone just glued these two names. This is like when people say "step foot" when "set foot" or "step" would each be fine, but the combination just isn't.

@NeilGirdhar
Copy link

NeilGirdhar commented Jun 16, 2024

An alternative to choosing a different name would be to replace bincount with a true integer_histogram:

def integer_histogram(x,
                      /,
                      bins: int | Array,
                      range: None | tuple[int, int] = None,
                      density: bool = False,
                      weights: Array
                      ) -> Array:
    """Compute the histogram of an integer dataset.

    Args:
        x: The input data.  The histogram is computed over the flattened array.
        bins: If bins is an int, it defines the number of equal-width bins in the given range.
            If bins is an integer array, it defines a monotonically increasing array of bin edges,
            including the rightmost edges, which allows for non-uniform bin widths.
        range: The lower and upper range of the bins.  range[1] - range[0] + 1 must be divisible by
            the integer bins.
        weights: An array of weights, of the same shape as a. Each value in a only contributes its
            associated weight towards the bin count (instead of 1). If density is True, the weights
            are normalized, so that the integral of the density over the range remains 1.
        density: If False, the result will contain the number of samples in each bin. If True, the
            result is the value of the probability density function at the bin, normalized such that
            the integral over the range is 1. Note that the sum of the histogram values will not be
            equal to 1 unless bins of unity width are chosen; it is not a probability mass function.
    """

It's just a generalization that supports normalization and general-width bins.

Scanning how bincount is actually used, it seems that normalization is very common. So even if we don't add range and a general bins, I think it might be good to add density to cover this common case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API extension Adds new functions or objects to the API. Needs Discussion Needs further discussion. RFC Request for comments. Feature requests and proposed changes.
Projects
None yet
Development

No branches or pull requests

4 participants