Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
setup basic FR preprocessing #87
base: master
Are you sure you want to change the base?
setup basic FR preprocessing #87
Changes from 4 commits
3933cd0
2cd6595
f4156ff
600715d
7c34361
40fe406
fe6d5b9
934ec38
5b1a6fa
168a555
7f37c78
5090d1c
149e960
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nicolaspanel Maybe we shoud include potential spaces? I saw data like "1 er".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lissyx since their is no such case in
clips.tsv
I suggest to waitThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lissyx ok for you ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this exhaustive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the goal here is to fix existing issues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this contain all acronyms in the current set of French texts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, the one from clips.tsv for
fr
languageThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this contains all the acronyms in fr for the current clips.tsv, then we're fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this exhaustive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine here too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already accounted for here. Are you running clips.tsv through CorporaCreator before suggesting these changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, I found those in clips.tsv so I had the rules.
Currently it is not clear which case are handle and where due to missing unit tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already accounted for here. Are you running clips.tsv through CorporaCreator before suggesting these changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already accounted for here. Are you running clips.tsv through CorporaCreator before suggesting these changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there are a few leftover digits being searched for here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We ideally want to not have digits. That said I'm not sure I understand the motivation for this change.
For example "123 000" might have been read "one hundred twenty three zero zero zero". However, now its changed to "123000" which I doubt would be read as "one hundred twenty three zero zero zero". So we'd introduce a miss-match between the audio and the text.
Are you assuming that
replace_numbers()
fixes this?If so, how can
replace_numbers()
do this accurately as it does no know about splits like "123 000". All it would see is "123000", which, due to the split, may have been pronounced "one hundred twenty three zero zero zero".There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of that is my code from CommonVoice-fr repo, so it was accurate enough on a dataset like the one from French parliament.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
I assume it is not the case and user said
cent vingt trois mille
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @lissyx
It seems that some case were still not properly handled. for example, clips.tsv contains sentences like:
les trois,000 valeur du trésor de Loretto
à ma fille, et dix.000 fr.
Loretto contenait un trésor à peu près de trois,000 liv.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nicolaspanel Yep, those are actually part of another dataset, that was much less well formatted and some error slipped throuh :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is not the perfect place to discuss this, but....
I'm wondering if we could save a lot of time by simply having
common.py
mark as invalid any sentences with digits in them.It's Draconian, but I think there are many problems like the ones we are thinking about here in multiple languages and I don't think they will be solved soon in all the languages, and we want to get the data out the door as soon as possible.
I'd be interested in your opinions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kdavis-mozilla @nicolaspanel
I guess we can skip numbers for now, fix it in the CV dataset, and hand-craft the current recording, 366 is not impossible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there are a few leftover digits being searched for here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there are a few leftover digits being searched for here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there are a few leftover digits being searched for here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the "/" in the first capture group be thrown away?
What about "30 euros/m2 " used for example for the cost per square meter of an apartment? As far as I can tell this would turn into "30 euros mètre carré " which I don't thinks makes sense. (At least in English it does not make sense.) I'd guess it should turn in to "30 euros par mètre carré ".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, see 40fe406
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there are a few leftover digits being searched for here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there are a few leftover digits being searched for here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there are a few leftover digits being searched for here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there are a few leftover digits being searched for here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering if we should do that now or later: as you shown, my heuristics were not perfect, so maybe it'd be best if I listened to recording and adjust with
client_id
, and fix Common Voice dataset at the same time ?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, it work just fine as is (I am also using it in trainingspeech projet).
It is a good idea to pick / listen a few examples to check but checking all the examples will take a lot of time...
Personally, I won't have such bandwidth anytime soon...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why I was suggesting that I do it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lissyx why not doing it in another PR ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nicolaspanel That's what was implied :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lissyx @nicolaspanel This is related to my "invalidate all sentences with digits" comment above.
I'd be interested in your take on the Draconian idea to have
common.py
mark as invalid any sentences with digits in them.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kdavis-mozilla I guess it's not such a bad idea, with a logging mode to help identify and fix the dataset as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has no effect as multiple white spaces are already removed here. So it seems like it should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In parallel with removal of the above commented out code, I'd ask that this is also removed as it's not used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can this work in all cases?
For example, "Room 456" can validly be read as "Room four five six" or as "Room four hundred and fifty six" . This code can't catch that.
It is for reasons exactly like this that the
client_id
is passed tofr()
so you can hear what the person said and provide the correct transcript.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here we assume
value
is not ambiguous. situation like "Room four five six" should have already been handle bymaybe_normalize
step to produce "Room 4 5 6" instead of original "Room 456"There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kdavis-mozilla is it ok for you ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we allowed to assume we are in a non-ambiguous case?
I don't see how we can assume such without hearing the audio.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kdavis-mozilla I think in French we should be fine regarding ambigous cases unless for numbers > 1099 <= 9999. Those might (and are often in the case of dates) be spelled by hundreds. But as I said to @nicolaspanel if it's too much work for him and too tricky / edge cases to risk polluting the dataset, then I can dig into clips and listen, later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest removing tests that no longer make sense in like of digits being banned. For example
Some of the tests here, for example
have nothing to do with digits and can actually be run independent of this comment.