-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change ticks --> apostrophes #101
base: master
Are you sure you want to change the base?
Conversation
There are a lot of tick marks that occur in English Common Voice that should be apostrophes This is part of a larger problem which involves quotation marks / double quotes
This is also an issue WRT hyphens... should hyphens and dashes be collapsed?
there are 160 utterances with C++ spoken in common voice english, and this user is in the `test.csv` file after `test.tsv` is passed through `import_cv2.py`, and I verified that it is spoken this way
I listened to this myself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few things that maybe need clarification, see the comments.
For pure Deep Speech use everything you do makes sense. But Common Voice is used outside of Deep Speech.
sentence = sentence.replace("C++", "C plus plus") | ||
## collapse all apostrophe-like marks | ||
## e.g. common_voice_en_18441344.mp3 ‘I’m not a serpent!’ --> 'I'm not a serpent!' | ||
sentence = sentence.replace("’","'") # right-ticks --> apostrophes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this always the right thing to do? For instance see here.
## collapse all apostrophe-like marks | ||
## e.g. common_voice_en_18441344.mp3 ‘I’m not a serpent!’ --> 'I'm not a serpent!' | ||
sentence = sentence.replace("’","'") # right-ticks --> apostrophes | ||
sentence = sentence.replace("‘","'") # left-ticks --> apostrophes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd guess I'd pose a similar question here.
sentence = sentence.replace("‘","'") # left-ticks --> apostrophes | ||
## Change em-dash to dash | ||
## e.g. common_voice_en_18607891.mp3 Nelly, come here — is it morning? --> Nelly, come here – is it morning? | ||
sentence = sentence.replace("—","–") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also here; em-dash is used in cases where dash is not used and vice versa. For instance see here.
There are a lot of tick marks that occur in English Common Voice that should be apostrophes
This is part of a larger problem which involves quotation marks / double quotes