You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Originally opened in the Firefox Translations repo as issue 499
Describe the bug
Markdown formatting will in most cases not survive the translation process, either being mangled, closed improperly, mapped to other characters or outright omitted in the final translation.
This is mostly a nice to have, there are some parts of the syntax that will never be able to be properly translated (i.e. codeblocks and quotes). I also fully realize that this is a rather niche use case, and might degrade overal translation performance.
Potential solution
Include markdown syntax as part of the training pipeline, similar to what was mentioned in above issue
Example
Test environment
Translations were run with simplified BergamotWorker script and WASM without postprocessing Firefox Translation Models were used as translation models, version 0.3.3
Example
Markdown snippet used for testing, covering most aspects of the syntax
---
# Markdown test
This is a *test* to see how **well** Bergamot _handles_ the [Markdown](https://www.markdownguide.org/) syntax.
1. The **bergamot orange**, is a fragrant citrus fruit the size of an orange
2. Has a *yellow* or *green* color similar to a lime, depending on ripeness
- The word bergamot is derived from the Italian word _bergamotto_
- It is a small tree that blossoms during the winter
\```js
variable = 10
if (variable == "10")
variable = "10" + 1
\```
> “Beware of bugs in the above code; I have only proved it correct, not tried it.”
> — Donald E. Knuth.
---
Some failing examples
French:
# turns into -
** disappears or gets changed into a single quote
Dutch:
--- turns into -- ---
# turns into •
Numberings gets repeated
German:
# turns into -
_ turns into "
The text was updated successfully, but these errors were encountered:
Originally opened in the Firefox Translations repo as issue 499
Describe the bug
Markdown formatting will in most cases not survive the translation process, either being mangled, closed improperly, mapped to other characters or outright omitted in the final translation.
This is mostly a nice to have, there are some parts of the syntax that will never be able to be properly translated (i.e. codeblocks and quotes). I also fully realize that this is a rather niche use case, and might degrade overal translation performance.
Related issues
mozilla/firefox-translations#486
Potential solution
Include markdown syntax as part of the training pipeline, similar to what was mentioned in above issue
Example
Test environment
Translations were run with simplified BergamotWorker script and WASM without postprocessing
Firefox Translation Models were used as translation models, version 0.3.3
Example
Markdown snippet used for testing, covering most aspects of the syntax
Some failing examples
French:
#
turns into-
**
disappears or gets changed into a single quoteDutch:
---
turns into-- ---
#
turns into•
German:
#
turns into-
_
turns into"
The text was updated successfully, but these errors were encountered: