Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix deprecated tokenizer methods, add test cases #139

Merged
merged 3 commits into from
Nov 8, 2024
Merged

Conversation

mshannon-sil
Copy link
Collaborator

@mshannon-sil mshannon-sil commented Nov 6, 2024

This PR removes the lang_code_to_id method in places where it could be used by the NllbTokenizer, due to NllbTokenizer deprecating that method, and keeps it for the tokenizers that still use it. In its place, I compare against tokenizer.added_tokens_encoder, since that's what huggingface uses as their check inside their tokenizer.convert_tokens_to_ids, which was already being used in the add_lang_code_to_tokenizer method for obtaining the lang_id from the lang_code. I also added test cases for each of the different tokenizer types supported by machine.py.


This change is Reviewable

@mshannon-sil mshannon-sil added the bug Something isn't working label Nov 6, 2024
@mshannon-sil mshannon-sil self-assigned this Nov 6, 2024
@mshannon-sil mshannon-sil linked an issue Nov 6, 2024 that may be closed by this pull request
Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @johnml1135 and @mshannon-sil)


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 401 at r1 (raw file):

def _add_lang_code_to_tokenizer(tokenizer: Union[PreTrainedTokenizer, PreTrainedTokenizerFast], lang_code: str):

You can just make this a "public" function (i.e. from the underscore) and export it from the "huggingface" package.

Copy link
Collaborator Author

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 4 files reviewed, 1 unresolved discussion (waiting on @ddaspit and @johnml1135)


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 401 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You can just make this a "public" function (i.e. from the underscore) and export it from the "huggingface" package.

Done.

@codecov-commenter
Copy link

codecov-commenter commented Nov 6, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.27%. Comparing base (3a9df2c) to head (36f5448).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #139      +/-   ##
==========================================
+ Coverage   88.14%   88.27%   +0.12%     
==========================================
  Files         273      273              
  Lines       16050    16133      +83     
==========================================
+ Hits        14148    14242      +94     
+ Misses       1902     1891      -11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@johnml1135
Copy link
Collaborator

machine/translation/huggingface/hugging_face_nmt_engine.py line 82 at r2 (raw file):

            else:
                src_lang_token = src_lang
                tgt_lang_token = tgt_lang

Is this code tested?

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 3 files at r1, 3 of 3 files at r2, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @ddaspit and @mshannon-sil)

Copy link
Collaborator Author

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @ddaspit and @johnml1135)


machine/translation/huggingface/hugging_face_nmt_engine.py line 82 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Is this code tested?

Yes, although not rigorously. The if statement is covered by test_translate_greedy in test_huggingface_nmt_engine.py, which uses the "stas/tiny-m2m_100" model and corresponding M2M100Tokenizer. And the basic functionality of using the lang_code_to_token method for M2M100Tokenizer is also tested in the new test case I added test_m2m_100_tokenizer_add_lang_code() in test_hugginface_nmt_model_trainer.py.

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit)

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r2, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @mshannon-sil)


tests/translation/huggingface/test_hugging_face_nmt_model_trainer.py line 25 at r2 (raw file):

from machine.corpora import DictionaryTextCorpus, MemoryText, TextRow
from machine.translation.huggingface import HuggingFaceNmtEngine, HuggingFaceNmtModelTrainer
from machine.translation.huggingface.hugging_face_nmt_model_trainer import add_lang_code_to_tokenizer

Yoiu should import the function directly from the machine.translation.huggingface package.

Copy link
Collaborator Author

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit)


tests/translation/huggingface/test_hugging_face_nmt_model_trainer.py line 25 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Yoiu should import the function directly from the machine.translation.huggingface package.

Ope sorry I overlooked that, I'll go ahead and make the quick change.

Copy link
Collaborator Author

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 3 of 4 files reviewed, 1 unresolved discussion (waiting on @ddaspit and @johnml1135)


tests/translation/huggingface/test_hugging_face_nmt_model_trainer.py line 25 at r2 (raw file):

Previously, mshannon-sil wrote…

Ope sorry I overlooked that, I'll go ahead and make the quick change.

Done.

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit)

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @mshannon-sil)

@johnml1135 johnml1135 merged commit 730ea67 into main Nov 8, 2024
12 of 13 checks passed
@johnml1135 johnml1135 deleted the #137_lang_code branch November 8, 2024 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

Replace deprecated tokenizer methods
4 participants