Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent handling of Markdown syntax #68

Open
Fevol opened this issue Sep 8, 2022 · 0 comments
Open

Inconsistent handling of Markdown syntax #68

Fevol opened this issue Sep 8, 2022 · 0 comments

Comments

@Fevol
Copy link

Fevol commented Sep 8, 2022

Originally opened in the Firefox Translations repo as issue 499

Describe the bug
Markdown formatting will in most cases not survive the translation process, either being mangled, closed improperly, mapped to other characters or outright omitted in the final translation.

This is mostly a nice to have, there are some parts of the syntax that will never be able to be properly translated (i.e. codeblocks and quotes). I also fully realize that this is a rather niche use case, and might degrade overal translation performance.

Related issues
mozilla/firefox-translations#486

Potential solution
Include markdown syntax as part of the training pipeline, similar to what was mentioned in above issue


Example

Test environment

Translations were run with simplified BergamotWorker script and WASM without postprocessing
Firefox Translation Models were used as translation models, version 0.3.3

Example

Markdown snippet used for testing, covering most aspects of the syntax

--- 
# Markdown test
This is a *test* to see how **well** Bergamot _handles_ the [Markdown](https://www.markdownguide.org/) syntax. 

1. The **bergamot orange**, is a fragrant citrus fruit the size of an orange
2. Has a *yellow* or *green* color similar to a lime, depending on ripeness

- The word bergamot is derived from the Italian word _bergamotto_
	- It is a small tree that blossoms during the winter

\```js
variable = 10
if (variable == "10")
	variable = "10" + 1
\```

> “Beware of bugs in the above code; I have only proved it correct, not tried it.”
> — Donald E. Knuth.
--- 

Some failing examples

French:

  • # turns into -
  • ** disappears or gets changed into a single quote

Dutch:

  • --- turns into -- ---
  • # turns into
  • Numberings gets repeated

German:

  • # turns into -
  • _ turns into "
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant