Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

foldfilter breaks translation from language without spaces to language with spaces #21

Open
kpu opened this issue Dec 3, 2020 · 4 comments

Comments

@kpu
Copy link
Owner

kpu commented Dec 3, 2020

@jelmervdl Problem with foldfilter: if we translate from e.g. ko (without spaces) to en then the output concatenates the last English word of a preceding sentence with the first English word of the following sentence, without an intervening space.

This impacts quality of zh and ko and probably other paracrawls.

@jelmervdl
Copy link
Contributor

Oh crap… that is not something I had thought about.

By default it just uses the delimiter it chopped out as glue. If there is no delimiter, there is no glue. So no space. It makes sense in the foldfilter cat scenario, but not for much for MT…

I'd suggest adding a space between words when the separator was empty (i.e. at a mid-word break instead of some soft line wrapping) except when the -s option is given. That option is supposed to pass the delimiters it chops also to the wrapped program. When translating to a language that doesn't do spaces and commas etc. it would make sense to just let the MT system decide to drop them. And foldfilter -s cat would still do what you expect.

This should be a simple fix, just an extra if-statement inside https://github.com/kpu/preprocess/blob/master/preprocess/foldfilter_main.cc#L132

@kpu
Copy link
Owner Author

kpu commented Dec 3, 2020

it's worse than first thought. Even languages with spaces are losing them.

Let's make a cat that strips leading and trailing whitespace, like most MT systems will.

#!/usr/bin/env python3
import sys
for l in sys.stdin:
  print(l.strip())
build/bin/foldfilter -w 4 ./remove_space.py <<<"hello hi how are you"
hellohihowareyou

@jelmervdl
Copy link
Contributor

I misinterpreted my own documentation…

The passing spaces to the wrapped command is the default, but paracrawl using it with the -s option, which does not pass the whitespace to the wrapped program.

With -s:

foldfilter -s -w 4 ./remove_space.py <<<"hello hi how are you"
hello hi how are you

So damage is not that bad (I would have noticed it earlier!) but the original case (input without delimiters) is still a problem. Also maybe the defaults are bad.

@kpu
Copy link
Owner Author

kpu commented Dec 3, 2020

Yeah ok we should have used -s for the fa->en part where I noticed the issue, then thought fa didn't have spaces, then looked at corpora and realized it does have spaces, then blamed ko and zh.
Default should probably preserve space across system. Period should probably stick to the sentence before and go to the MT system though, lest MT will hallucinate a new period then we end up with two.
Or maybe we should just be patching Marian to do this with BPE tokens...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants