-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About training and prediction #66
Comments
I don't have access to the dataset at the moment, but the dataset was not from the RWC dataset itself but re-synthesized vocal tracks as described in the pYIN paper, in a similar manner as the MDB-stem-synth dataset. We obtained the resynthesized files from the authors, and its labels contained continuous frequency annotations. |
Thanks, also since you are taking just one pitch output for a frame, why are you taking 'sigmoid' activation in |
It's one of the tricks used for the approach, which is not quite orthodox for classification tasks in ML - it also uses binary cross entropy with soft labels, whereas the labels are usually one-hot in classification models. We found that this combination (binary cross entropy with soft label) worked more robustly on pitch estimation, combined with the decoding heuristics taking the weighted average of activations near argmax. |
Thanks for the info. May I ask how you were able to obtain soft labels? Was it labelled that way in the data itself? I have a similar dataset that has hard pitch frequency labels. The only way I can think of taking soft labels is by having a Gaussian around each pitch frequency and with a standard deviation of 5-10 cents |
The labels I had contained Hz values (that doesn't necessarily align with semitone intervals), from which I calculated the soft labels using a Gaussian-shaped curve with a standard deviation of 25 cents. You can find an example code in the comments of this issue. |
Hello,
First of all thanks for the amazing paper and the repo !!
I have a basic doubt, the RWC Dataset says that annotated data is at semi-tone intervals, that is 50 cents.
How is CREPE able to predict with 10 or 20 cent intervals?
The text was updated successfully, but these errors were encountered: