Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

setup basic FR preprocessing #87

Open
wants to merge 13 commits into
base: master
Choose a base branch
from
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -47,3 +47,4 @@ MANIFEST

# Per-project virtualenvs
.virtualenv/
.python-version
1 change: 1 addition & 0 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ setup_requires = pyscaffold>=3.1a0,<3.2a0
# Add here dependencies of your project (semicolon/line-separated), e.g.
install_requires = pandas
swifter
num2words==0.5.9
nicolaspanel marked this conversation as resolved.
Show resolved Hide resolved
# The usage of test_requires is discouraged, see `Dependency Management` docs
# tests_require = pytest; pytest-cov
# Require a specific Python version, e.g. Python 2.7 or >= 3.4
Expand Down
32 changes: 30 additions & 2 deletions src/corporacreator/preprocessors/fr.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,27 @@
import re

from corporacreator.utils import maybe_normalize, replace_numbers, FIND_PUNCTUATIONS_REG, FIND_MULTIPLE_SPACES_REG

FIND_ORDINAL_REG = re.compile(r"(\d+)([ème|éme|ieme|ier|iere]+)")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nicolaspanel Maybe we shoud include potential spaces? I saw data like "1 er".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lissyx since their is no such case in clips.tsv I suggest to wait

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lissyx ok for you ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep.



FR_NORMALIZATIONS = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this exhaustive?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as before

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine here too

[re.compile(r'(^|\s)(\d+)\s(0{3})(\s|\.|,|\?|!|$)'), r'\1\2\3\4'], # "123 000 …" => "123000 …"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We ideally want to not have digits. That said I'm not sure I understand the motivation for this change.

For example "123 000" might have been read "one hundred twenty three zero zero zero". However, now its changed to "123000" which I doubt would be read as "one hundred twenty three zero zero zero". So we'd introduce a miss-match between the audio and the text.

Are you assuming that replace_numbers() fixes this?

If so, how can replace_numbers() do this accurately as it does no know about splits like "123 000". All it would see is "123000", which, due to the split, may have been pronounced "one hundred twenty three zero zero zero".

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of that is my code from CommonVoice-fr repo, so it was accurate enough on a dataset like the one from French parliament.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you assuming that replace_numbers() fixes this?

yes

For example "123 000" might have been read "one hundred twenty three zero zero zero".

I assume it is not the case and user said cent vingt trois mille.

Copy link
Author

@nicolaspanel nicolaspanel Feb 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of that is my code from CommonVoice-fr repo, so it was accurate enough on a dataset like the one from French parliament.

thanks @lissyx
It seems that some case were still not properly handled. for example, clips.tsv contains sentences like:

  • les trois,000 valeur du trésor de Loretto
  • à ma fille, et dix.000 fr.
  • Loretto contenait un trésor à peu près de trois,000 liv.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nicolaspanel Yep, those are actually part of another dataset, that was much less well formatted and some error slipped throuh :/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is not the perfect place to discuss this, but....

I'm wondering if we could save a lot of time by simply having common.py mark as invalid any sentences with digits in them.

It's Draconian, but I think there are many problems like the ones we are thinking about here in multiple languages and I don't think they will be solved soon in all the languages, and we want to get the data out the door as soon as possible.

I'd be interested in your opinions

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kdavis-mozilla @nicolaspanel

$ cut -f3 source/fr/validated.tsv | grep '[[:digit:]]' | wc -l
366

I guess we can skip numbers for now, fix it in the CV dataset, and hand-craft the current recording, 366 is not impossible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there are a few leftover digits being searched for here.

[re.compile(r'(^|\s)km(\s|\.|,|\?|!|$)'), r'\1 kilomètres \2'],
[re.compile(r'(^|\s)0(\d)(\s|\.|,|\?|!|$)'), r'\1zéro \2 \3'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there are a few leftover digits being searched for here.

['%', ' pourcent'],
[re.compile(r'(^|\s)\+(\s|\.|,|\?|!|$)'), r'\1 plus \2'],
nicolaspanel marked this conversation as resolved.
Show resolved Hide resolved
[re.compile(r'(\d+)\s?m(?:2|²)(\s|\.|,|\?|!|$)'), r'\1 mètre carré\2'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there are a few leftover digits being searched for here.

[re.compile(r'(^|\s|/)m(?:2|²)(\s|\.|,|\?|!|$)'), r' mètre carré\2'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the "/" in the first capture group be thrown away?

What about "30 euros/m2 " used for example for the cost per square meter of an apartment? As far as I can tell this would turn into "30 euros mètre carré " which I don't thinks makes sense. (At least in English it does not make sense.) I'd guess it should turn in to "30 euros par mètre carré ".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, see 40fe406

[re.compile(r'(^|\s)(\d+),(\d{2})\s?€(\s|\.|,|\?|!|$)'), r'\1\2 euros \3 \4'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there are a few leftover digits being searched for here.

[re.compile(r'\s?€(.+)'), r' euros\1'],
[re.compile(r'\s?€$'), r' euros'],
[re.compile(r'(^| )(n)(?:°|º|°)(\s)?', flags=re.IGNORECASE), r'\1\2uméro '],
[re.compile(r'(^|\s)(\d+)h(\d*)(\s|\.|,|$)'), r'\1\2 heure \3\4'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there are a few leftover digits being searched for here.

[re.compile(r'(^|\s)(\d+)h(\s|\.|,|$)'), r'\1\2 heure \3'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there are a few leftover digits being searched for here.

]


def fr(client_id, sentence):
"""Cleans up the passed sentence, removing or reformatting invalid data.

Expand All @@ -8,5 +32,9 @@ def fr(client_id, sentence):
Returns:
(str): Cleaned up sentence. Returning None or a `str` of whitespace flags the sentence as invalid.
"""
# TODO: Clean up fr data
return sentence
text = maybe_normalize(sentence, mapping=FR_NORMALIZATIONS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these always make sense? (See comments above on FR_NORMALIZATIONS.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so.
If not then special cases should be handled using client_id.
@kdavis-mozilla can you think of an example ?

text = replace_numbers(text, locale='fr', ordinal_regex=FIND_ORDINAL_REG)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering if we should do that now or later: as you shown, my heuristics were not perfect, so maybe it'd be best if I listened to recording and adjust with client_id, and fix Common Voice dataset at the same time ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, it work just fine as is (I am also using it in trainingspeech projet).
It is a good idea to pick / listen a few examples to check but checking all the examples will take a lot of time...
Personally, I won't have such bandwidth anytime soon...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why I was suggesting that I do it :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lissyx why not doing it in another PR ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nicolaspanel That's what was implied :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lissyx @nicolaspanel This is related to my "invalidate all sentences with digits" comment above.

I'd be interested in your take on the Draconian idea to have common.py mark as invalid any sentences with digits in them.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kdavis-mozilla I guess it's not such a bad idea, with a logging mode to help identify and fix the dataset as well.

text = text.replace('’', "'").replace('\u00A0', ' ')
nicolaspanel marked this conversation as resolved.
Show resolved Hide resolved
text = FIND_PUNCTUATIONS_REG.sub(' ', text)
nicolaspanel marked this conversation as resolved.
Show resolved Hide resolved
text = FIND_MULTIPLE_SPACES_REG.sub(' ', text)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has no effect as multiple white spaces are already removed here. So it seems like it should be removed.

return text.strip().lower()
nicolaspanel marked this conversation as resolved.
Show resolved Hide resolved
51 changes: 51 additions & 0 deletions src/corporacreator/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
import re
from typing import Pattern

from num2words import num2words


NUMS_REGEX = re.compile(r"(\d+,?\u00A0?\d+)|(\d+\w+)|(\d)+")
FIND_MULTIPLE_SPACES_REG = re.compile(r'\s{2,}')
FIND_PUNCTUATIONS_REG = re.compile(r"[/°\-,;!?.()\[\]*…—]")


def get_numbers(text):
return NUMS_REGEX.split(text)


def replace_numbers(inp: str, locale: str, ordinal_regex: Pattern = None):
finalinp = ''
for e in get_numbers(inp):
if not e:
continue
newinp = e
try:
ee = ''.join(e.split())
if int(e) >= 0:
newinp = num2words(int(ee), lang=locale)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can this work in all cases?

For example, "Room 456" can validly be read as "Room four five six" or as "Room four hundred and fifty six" . This code can't catch that.

It is for reasons exactly like this that the client_id is passed to fr() so you can hear what the person said and provide the correct transcript.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we assume value is not ambiguous. situation like "Room four five six" should have already been handle by maybe_normalize step to produce "Room 4 5 6" instead of original "Room 456"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kdavis-mozilla is it ok for you ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we allowed to assume we are in a non-ambiguous case?

I don't see how we can assume such without hearing the audio.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kdavis-mozilla I think in French we should be fine regarding ambigous cases unless for numbers > 1099 <= 9999. Those might (and are often in the case of dates) be spelled by hundreds. But as I said to @nicolaspanel if it's too much work for him and too tricky / edge cases to risk polluting the dataset, then I can dig into clips and listen, later.

except ValueError:
try:
ee = ''.join(e.replace(',', '.').split())
if float(ee):
newinp = num2words(float(ee), lang=locale)
nicolaspanel marked this conversation as resolved.
Show resolved Hide resolved
except ValueError:
if ordinal_regex:
matches = ordinal_regex.match(e)
if matches:
newinp = num2words(int(matches.group(1)), ordinal=True, lang=locale)

finalinp += newinp

return finalinp


def maybe_normalize(value: str, mapping):
for norm in mapping:
if type(norm[0]) == str:
value = value.replace(norm[0], norm[1])
elif isinstance(norm[0], Pattern):
value = norm[0].sub(norm[1], value)
else:
print('UNEXPECTED', type(norm[0]), norm[0])
nicolaspanel marked this conversation as resolved.
Show resolved Hide resolved

return value
21 changes: 21 additions & 0 deletions tests/test_preprocessors.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
import pytest
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


from corporacreator import preprocessors


@pytest.mark.parametrize('locale, client_id, sentence, expected', [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest removing tests that no longer make sense in like of digits being banned. For example

('fr', '*', "donc, ce sera 299 € + 99 €", "donc, ce sera deux cent quatre-vingt-dix-neuf euros plus quatre-vingt-dix-neuf euros"),

Some of the tests here, for example

('fr', '*', "Jean-Paul II.", "Jean-Paul deux.")

have nothing to do with digits and can actually be run independent of this comment.

('fr', '*', 'Faisons donc attention à utiliser les bons mots.', 'faisons donc attention à utiliser les bons mots'),
('fr', '*', "bah 98%", "bah quatre vingt dix huit pourcent"),
('fr', '*', "prix au m2", "prix au mètre carré"),
('fr', '*', "prix au m²", "prix au mètre carré"),
('fr', '*', "10 m²", "dix mètre carré"),
('fr', '*', "2éme page", "deuxième page"),
('fr', '*', "donc, ce sera 299 € + 99 €", "donc ce sera deux cent quatre vingt dix neuf euros plus quatre vingt dix neuf euros"),
('fr', '*', "ok pour 18h", "ok pour dix huit heure"),
('fr', '*', '2 0 200', "deux zéro deux cents"),
('fr', '*', 'rue Coq-Héron au nº13', "rue coq héron au numéro treize"),
('fr', '*', "En comparaison, la Lune orbite en moyenne à 390 000 km de la Terre", "en comparaison la lune orbite en moyenne à trois cent quatre vingt dix mille kilomètres de la terre"),
])
def test_preprocessor(locale, client_id, sentence, expected):
preprocessor = getattr(preprocessors, locale.replace('-', ''))
assert expected == preprocessor(client_id, preprocessors.common(sentence))