Skip to content
This repository has been archived by the owner on Sep 4, 2023. It is now read-only.

Inconsistent handling of Markdown formatting #499

Closed
Fevol opened this issue Aug 23, 2022 · 4 comments
Closed

Inconsistent handling of Markdown formatting #499

Fevol opened this issue Aug 23, 2022 · 4 comments
Labels

Comments

@Fevol
Copy link

Fevol commented Aug 23, 2022

Describe the bug
Markdown formatting will in most cases not survive the translation process, either being mangled, closed improperly, mapped to other characters or outright omitted in the final translation.

Related issues
#486 (Requires additional training to fix/implement)

To Reproduce
Steps to reproduce the behavior:

  1. Select any model, behavior will vary depending on the model used
  2. Translate any piece of text containing markdown formatting

Example piece of markdown:

--- 
# Markdown test
This is a *test* to see how **well** Bergamot _handles_ the [Markdown](https://www.markdownguide.org/) syntax. 

1. The **bergamot orange**, is a fragrant citrus fruit the size of an orange
2. Has a *yellow* or *green* color similar to a lime, depending on ripeness

- The word bergamot is derived from the Italian word _bergamotto_
	- It is a small tree that blossoms during the winter

\```js
variable = 10
if (variable == "10")
	variable = "10" + 1
\```

> “Beware of bugs in the above code; I have only proved it correct, not tried it.”
> — Donald E. Knuth.
--- 

Expected behavior
Markdown formatting should survive the translation process.

Actual behavior
French:

  • # turns into -
  • ** disappears or gets changed into a single quote

Dutch:

  • --- turns into -- ---
  • # turns into
  • Numberings gets repeated

Additional context
This would be a nice to have, though I do realize that some parts of the syntax will never be able to be translated in a proper manner (i.e. codeblocks and quotes).

Improving the markdown handling would probably entail randomly adding markdown syntax to words (similar to what was mentioned in #486), admittedly, I have no experience in this field, and it might not be feasible to re-train the models for this small use case.

Apologies if this BR is misplaced, please let me know if I should move this issue to the students training repo.

@andrenatal
Copy link
Contributor

Hi @Fevol , filing this in the students repo definitely would help bringing the issue to the attention of the maintainers there, but I'll leave this open here for reference.

@Fevol
Copy link
Author

Fevol commented Sep 8, 2022

@andrenatal, thanks for the response! Closing the issue as it has been refiled to browsermt/students#68

@Fevol Fevol closed this as completed Sep 8, 2022
@andrenatal
Copy link
Contributor

@Fevol
Copy link
Author

Fevol commented Sep 8, 2022

I'm guessing the issue is more related to browser/students than the Bergamot engine itself. Post/preprocessing the translation will probably not be able to fix the issue. I've also seen that there is a Firefox specific repo for model training? I'm not entirely sure if that repository is more suitable for the issue, the two seem closely related.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants