-
-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix deprecated tokenizer methods, add test cases #139
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @johnml1135 and @mshannon-sil)
machine/translation/huggingface/hugging_face_nmt_model_trainer.py
line 401 at r1 (raw file):
def _add_lang_code_to_tokenizer(tokenizer: Union[PreTrainedTokenizer, PreTrainedTokenizerFast], lang_code: str):
You can just make this a "public" function (i.e. from the underscore) and export it from the "huggingface" package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 4 files reviewed, 1 unresolved discussion (waiting on @ddaspit and @johnml1135)
machine/translation/huggingface/hugging_face_nmt_model_trainer.py
line 401 at r1 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
You can just make this a "public" function (i.e. from the underscore) and export it from the "huggingface" package.
Done.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #139 +/- ##
==========================================
+ Coverage 88.14% 88.27% +0.12%
==========================================
Files 273 273
Lines 16050 16133 +83
==========================================
+ Hits 14148 14242 +94
+ Misses 1902 1891 -11 ☔ View full report in Codecov by Sentry. |
Is this code tested? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 3 files at r1, 3 of 3 files at r2, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @ddaspit and @mshannon-sil)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @ddaspit and @johnml1135)
machine/translation/huggingface/hugging_face_nmt_engine.py
line 82 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Is this code tested?
Yes, although not rigorously. The if statement is covered by test_translate_greedy
in test_huggingface_nmt_engine.py
, which uses the "stas/tiny-m2m_100" model and corresponding M2M100Tokenizer. And the basic functionality of using the lang_code_to_token
method for M2M100Tokenizer is also tested in the new test case I added test_m2m_100_tokenizer_add_lang_code()
in test_hugginface_nmt_model_trainer.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r2, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @mshannon-sil)
tests/translation/huggingface/test_hugging_face_nmt_model_trainer.py
line 25 at r2 (raw file):
from machine.corpora import DictionaryTextCorpus, MemoryText, TextRow from machine.translation.huggingface import HuggingFaceNmtEngine, HuggingFaceNmtModelTrainer from machine.translation.huggingface.hugging_face_nmt_model_trainer import add_lang_code_to_tokenizer
Yoiu should import the function directly from the machine.translation.huggingface
package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit)
tests/translation/huggingface/test_hugging_face_nmt_model_trainer.py
line 25 at r2 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
Yoiu should import the function directly from the
machine.translation.huggingface
package.
Ope sorry I overlooked that, I'll go ahead and make the quick change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 3 of 4 files reviewed, 1 unresolved discussion (waiting on @ddaspit and @johnml1135)
tests/translation/huggingface/test_hugging_face_nmt_model_trainer.py
line 25 at r2 (raw file):
Previously, mshannon-sil wrote…
Ope sorry I overlooked that, I'll go ahead and make the quick change.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @mshannon-sil)
36f5448
to
9889d9f
Compare
This PR removes the lang_code_to_id method in places where it could be used by the NllbTokenizer, due to NllbTokenizer deprecating that method, and keeps it for the tokenizers that still use it. In its place, I compare against
tokenizer.added_tokens_encoder
, since that's what huggingface uses as their check inside theirtokenizer.convert_tokens_to_ids
, which was already being used in theadd_lang_code_to_tokenizer
method for obtaining the lang_id from the lang_code. I also added test cases for each of the different tokenizer types supported by machine.py.This change is