Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Histogram estimator correction #53

Merged
merged 19 commits into from
Sep 3, 2024
Merged

Histogram estimator correction #53

merged 19 commits into from
Sep 3, 2024

Conversation

Mattehub
Copy link
Collaborator

@Mattehub Mattehub commented Jul 8, 2024

Thanks for contributing a pull request!

Please be aware that we are a loose team of volunteers so patience is
necessary. Assistance handling other issues is very welcome. We value
all user contributions, no matter how minor they are. If we are slow to
review, either the pull request needs some benchmarking, tinkering,
convincing, etc. or more likely the reviewers are simply busy. In either
case, we ask for your understanding during the review process.

Again, thanks for contributing!

Reference issue

Example: Fixes #1234.

What does this implement/fix?

Explain your changes.

Additional information

Any additional information you think is important.

@EtienneCmb
Copy link
Collaborator

Ciao @Mattehub. Can you please describe your pull request please?

@Mattehub
Copy link
Collaborator Author

Mattehub commented Jul 8, 2024

Sorry for not describing it @EtienneCmb , I see now that few test are failing. I found that a possible way to estimate entropy by binning a continuous variable is by using the histogram estimator. The way we were doing it before has a bias. This estimator is made in order to avoid that bias. So I implemented the possibility of estimating the entropy using this estimator. Now it is possible to do that automatically by giving in input to the function entropy_bin, the bin_siz. So concretely what changed is:

  • function entropy_bin allow to give one more input, bin_siz
  • function digitalize, when use_sklearn = False, give one more output, bin_siz

Important point: do you want I add one more estimator that directly from the continuous data compute the entropy with the histogram estimator? In this case giving "binning", would be the entropy of a discrete variable. With "histogram estimator", it is the entropy of a continuous variable, but with the estimator described above.

PS. On frama I sent some messages about that.

@EtienneCmb
Copy link
Collaborator

Ok. I propose a few modifications :

  • the test are failing because the digitize function returns two outputs. If you never use the second one, then I think it's not needed. We can let the function as it was, same for the tests
  • can you rename for bin_size? It's more explicit

Can you please provide the ressources you used please?

@Mattehub
Copy link
Collaborator Author

Mattehub commented Jul 8, 2024

Thanks for the answer! I changed the test so that it worked. The thing is that the histogram estimator, without changing the functions as I did, can not be implemented easily by the users. In my pull request, the aim is that users, by using together the functions we provide, can perform histogram estimation (continuous data -> binning, bin size -> entropy with correction). Another possibility would be that we give another entropy method, in which all these three steps are done automatically, in thi case we can call it "histogram" estimator. In this case, among the possible methods we have: "binning" for discrete data; "histogram" for continuous data, by discretizing it. what do you think about it?

Here two references in which I took the information: document, and wikipedia.

For bin_size, I completely agree! I am changing it.

@Mattehub
Copy link
Collaborator Author

Mattehub commented Jul 17, 2024

Hi @EtienneCmb, sorry to bother you here. I have a little issue in the changes of the functions. To use digitize within entropy_hist I had to substitute the functions within digitize, from numpy -> jax.numpy. I think this is one of the things that are making the test failing. Before changing more lines, I wanted to make sure that is ok if to have jnp instead of np in the functions within digitize. You can see the details about it in the changes to the file utils.stats

Do you think it is ok?

if not use_sklearn:
return np.apply_along_axis(digitize_1d, axis, x, n_bins)
return np.apply_along_axis(digitize_1d, axis, x, n_bins), b_siz
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit weird to return two arguments in one case and only one in the other. It's confusing, could we improve it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Mattehub, few points about this function :

  1. Currently, the digitize_1d is applied differently to each dimension of x. Maybe the binning should be applied to the entire matrix. Can you test?
  2. One way to have digitize returning two ouputs is to define an output dict that we can fill with both methods. Something like :
# prepare parameter dict
parameters = dict()

if ...:
    # sklearn stuff
else::
    # manual stuff
    parameters["bin_size"] = bin_size

return x, parameters

"KNN-3": get_entropy("knn", k=3),
"Gaussian": get_entropy(method="gauss"),
"Kernel": get_entropy("kernel"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the example using the binning in the plot_entropies example? Something like in the plot_mi example :

def ent_binning(x):
    x, bin_size = digitize(x)
    fcn_binning = get_entropy("binning", bin_size=bin_size)
    return fcn_binning(x)

And then can you paste the plot comparing estimators inside this conversation please?

n_features, n_samples = x.shape
print(n_features, n_samples)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove print

@EtienneCmb EtienneCmb merged commit 06e0a15 into brainets:main Sep 3, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants