fix deprecated tokenizer methods, add test cases #139

mshannon-sil · 2024-11-06T18:05:14Z

This PR removes the lang_code_to_id method in places where it could be used by the NllbTokenizer, due to NllbTokenizer deprecating that method, and keeps it for the tokenizers that still use it. In its place, I compare against tokenizer.added_tokens_encoder, since that's what huggingface uses as their check inside their tokenizer.convert_tokens_to_ids, which was already being used in the add_lang_code_to_tokenizer method for obtaining the lang_id from the lang_code. I also added test cases for each of the different tokenizer types supported by machine.py.

This change is

ddaspit

Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @johnml1135 and @mshannon-sil)

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 401 at r1 (raw file):

def _add_lang_code_to_tokenizer(tokenizer: Union[PreTrainedTokenizer, PreTrainedTokenizerFast], lang_code: str):

You can just make this a "public" function (i.e. from the underscore) and export it from the "huggingface" package.

mshannon-sil

Reviewable status: 1 of 4 files reviewed, 1 unresolved discussion (waiting on @ddaspit and @johnml1135)

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 401 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You can just make this a "public" function (i.e. from the underscore) and export it from the "huggingface" package.

Done.

codecov-commenter · 2024-11-06T21:32:17Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.27%. Comparing base (3a9df2c) to head (36f5448).
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #139      +/-   ##
==========================================
+ Coverage   88.14%   88.27%   +0.12%     
==========================================
  Files         273      273              
  Lines       16050    16133      +83     
==========================================
+ Hits        14148    14242      +94     
+ Misses       1902     1891      -11

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

johnml1135 · 2024-11-06T22:01:38Z

machine/translation/huggingface/hugging_face_nmt_engine.py line 82 at r2 (raw file):

            else:
                src_lang_token = src_lang
                tgt_lang_token = tgt_lang

Is this code tested?

johnml1135

Reviewed 1 of 3 files at r1, 3 of 3 files at r2, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @ddaspit and @mshannon-sil)

mshannon-sil

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @ddaspit and @johnml1135)

machine/translation/huggingface/hugging_face_nmt_engine.py line 82 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Is this code tested?

Yes, although not rigorously. The if statement is covered by test_translate_greedy in test_huggingface_nmt_engine.py, which uses the "stas/tiny-m2m_100" model and corresponding M2M100Tokenizer. And the basic functionality of using the lang_code_to_token method for M2M100Tokenizer is also tested in the new test case I added test_m2m_100_tokenizer_add_lang_code() in test_hugginface_nmt_model_trainer.py.

johnml1135

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit)

ddaspit

Reviewed 3 of 3 files at r2, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @mshannon-sil)

tests/translation/huggingface/test_hugging_face_nmt_model_trainer.py line 25 at r2 (raw file):

from machine.corpora import DictionaryTextCorpus, MemoryText, TextRow
from machine.translation.huggingface import HuggingFaceNmtEngine, HuggingFaceNmtModelTrainer
from machine.translation.huggingface.hugging_face_nmt_model_trainer import add_lang_code_to_tokenizer

Yoiu should import the function directly from the machine.translation.huggingface package.

mshannon-sil

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit)

tests/translation/huggingface/test_hugging_face_nmt_model_trainer.py line 25 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Yoiu should import the function directly from the machine.translation.huggingface package.

Ope sorry I overlooked that, I'll go ahead and make the quick change.

mshannon-sil

Reviewable status: 3 of 4 files reviewed, 1 unresolved discussion (waiting on @ddaspit and @johnml1135)

tests/translation/huggingface/test_hugging_face_nmt_model_trainer.py line 25 at r2 (raw file):

Previously, mshannon-sil wrote…

Ope sorry I overlooked that, I'll go ahead and make the quick change.

Done.

johnml1135

Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit)

ddaspit

Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @mshannon-sil)

mshannon-sil added the bug Something isn't working label Nov 6, 2024

mshannon-sil requested review from ddaspit and johnml1135 November 6, 2024 18:05

mshannon-sil self-assigned this Nov 6, 2024

mshannon-sil linked an issue Nov 6, 2024 that may be closed by this pull request

Replace deprecated tokenizer methods #137

Closed

ddaspit requested changes Nov 6, 2024

View reviewed changes

mshannon-sil commented Nov 6, 2024

View reviewed changes

johnml1135 reviewed Nov 6, 2024

View reviewed changes

mshannon-sil commented Nov 6, 2024

View reviewed changes

johnml1135 approved these changes Nov 7, 2024

View reviewed changes

ddaspit reviewed Nov 7, 2024

View reviewed changes

mshannon-sil commented Nov 7, 2024

View reviewed changes

johnml1135 approved these changes Nov 8, 2024

View reviewed changes

ddaspit approved these changes Nov 8, 2024

View reviewed changes

mshannon-sil added 3 commits November 8, 2024 12:33

fix deprecated tokenizer methods, add test cases

c3dd68f

make add_lang_code_to_tokenizer a public function again

29ab189

fix test import

9889d9f

johnml1135 force-pushed the #137_lang_code branch from 36f5448 to 9889d9f Compare November 8, 2024 17:33

johnml1135 merged commit 730ea67 into main Nov 8, 2024
12 of 13 checks passed

johnml1135 deleted the #137_lang_code branch November 8, 2024 17:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix deprecated tokenizer methods, add test cases #139

fix deprecated tokenizer methods, add test cases #139

mshannon-sil commented Nov 6, 2024 •

edited by ddaspit

Loading

ddaspit left a comment

mshannon-sil left a comment

codecov-commenter commented Nov 6, 2024 •

edited

Loading

johnml1135 commented Nov 6, 2024

johnml1135 left a comment

mshannon-sil left a comment

johnml1135 left a comment

ddaspit left a comment

mshannon-sil left a comment

mshannon-sil left a comment

johnml1135 left a comment

ddaspit left a comment

fix deprecated tokenizer methods, add test cases #139

fix deprecated tokenizer methods, add test cases #139

Conversation

mshannon-sil commented Nov 6, 2024 • edited by ddaspit Loading

ddaspit left a comment

Choose a reason for hiding this comment

mshannon-sil left a comment

Choose a reason for hiding this comment

codecov-commenter commented Nov 6, 2024 • edited Loading

Codecov Report

johnml1135 commented Nov 6, 2024

johnml1135 left a comment

Choose a reason for hiding this comment

mshannon-sil left a comment

Choose a reason for hiding this comment

johnml1135 left a comment

Choose a reason for hiding this comment

ddaspit left a comment

Choose a reason for hiding this comment

mshannon-sil left a comment

Choose a reason for hiding this comment

mshannon-sil left a comment

Choose a reason for hiding this comment

johnml1135 left a comment

Choose a reason for hiding this comment

ddaspit left a comment

Choose a reason for hiding this comment

mshannon-sil commented Nov 6, 2024 •

edited by ddaspit

Loading

codecov-commenter commented Nov 6, 2024 •

edited

Loading