Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add str method to datamodule #20301

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

MrWhatZitToYaa
Copy link

@MrWhatZitToYaa MrWhatZitToYaa commented Sep 25, 2024

What does this PR do?

Added a str function to the DataModule in order to be able to print Datasets included in the module instead of an adress.

Fixes #9947

Before submitting
  • Was this discussed/agreed via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist
  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

📚 Documentation preview 📚: https://pytorch-lightning--20301.org.readthedocs.build/en/20301/

@github-actions github-actions bot added the pl Generic label for PyTorch Lightning package label Sep 25, 2024
@MrWhatZitToYaa
Copy link
Author

Will add documentation once code is fixed

@MrWhatZitToYaa MrWhatZitToYaa marked this pull request as ready for review October 3, 2024 15:22
@MrWhatZitToYaa
Copy link
Author

I have been waiting for quite some time. How do I go on? Do I just wait some more until someone looks at this?

@lantiga lantiga changed the title Feature/9947 Add str method to datamodule Add str method to datamodule Nov 12, 2024
@lantiga
Copy link
Collaborator

lantiga commented Nov 12, 2024

hey @MrWhatZitToYaa thanks for taking this up and sorry for the wait

can you add a bit more detail on what do we think it's good printing from a datamodule?

the original design in #9947 was a bit different, as was #9967, but happy to consider alternatives after there's some discussion on the rationale

@lantiga lantiga added waiting on author Waiting on user action, correction, or update lightningdatamodule pl.LightningDataModule labels Nov 12, 2024
@MrWhatZitToYaa
Copy link
Author

hey @MrWhatZitToYaa thanks for taking this up and sorry for the wait

can you add a bit more detail on what do we think it's good printing from a datamodule?

the original design in #9947 was a bit different, as was #9967, but happy to consider alternatives after there's some discussion on the rationale

The idea originally came to me when I had to to use the Datamodule to implement some datasets we have in our lab. It was really annoying to deal with the fact that you never really know what you are dealing with and makes debugging really hard. I talked to some colleagues and they agreed that they ran into the same problem. That's why I would like to add a str functionality for the Datamodule: primarily for debugging / logging.
I think the simplest approach is to scan the whole module for instances of datasets and then return a string where each line contains the name of the dataset and the length if it is available. This would be a big quality of life improvement when dealing with / constructing Datamodules.

@github-actions github-actions bot added docs Documentation related ci Continuous Integration fabric lightning.fabric.Fabric dependencies Pull requests that update a dependency file dockers package labels Nov 20, 2024
Added alternative Boring Data Module implementations
Added test cases for all possible options
Added additional check for NotImplementedError in string function of DataModule
Made changes to comply with requested suggestions
Switched from hardcoded \n to more general os.linesep
lantiga and others added 2 commits November 21, 2024 00:49
Corrected the annotation for the internal function and the list that is suppsoed to store the information on the datasets
@mergify mergify bot removed the has conflicts label Nov 25, 2024
Copy link
Collaborator

@lantiga lantiga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution, I aligned it with the latest master and added a comment for a fix before we can merge. It should be quick.

attr = getattr(self, attr_name)

# Get Dataset information
if isinstance(attr, Dataset):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with this is that there are no guarantees that any datasets actually exist as attributes.

I think a better design would be to leverage the contract, and get dataloaders through self.train_dataloader (caveat: it may be a dataloader or an iterable of dataloaders), self.val_dataloader, etc, access their dataset property (https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) and produce a string from there in a similar way to the current way.

Also instead of ignoring non-Dataset subclasses in this case, we could call str from whatever is returned as dataset.

I think it's a valuable addition, wdyt?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good idea. I also had a similar idea in the beginning but discarded it because I thought it would add too much overhead. But I did some research an DataLoader should not add too much overhead to the actual dataset so we can use it in the str method.
Your suggestion would make the method more save, while still maintaining the same functionality, since one has to call prepare and setup anyway before one is able to access the dataset information.
I will try to implement your suggestion

Copy link

codecov bot commented Nov 25, 2024

Codecov Report

Attention: Patch coverage is 97.05882% with 1 line in your changes missing coverage. Please review.

Project coverage is 79%. Comparing base (1e88899) to head (1ce0f92).

❗ There is a different number of reports uploaded between BASE (1e88899) and HEAD (1ce0f92). Click for more details.

HEAD has 95 uploads less than BASE
Flag BASE (1e88899) HEAD (1ce0f92)
cpu 46 24
lightning 35 18
pytest 25 0
python3.9 12 6
lightning_fabric 7 0
python3.11 11 6
python3.12 18 9
python3.10 5 3
gpu 2 0
Additional details and impacted files
@@            Coverage Diff            @@
##           master   #20301     +/-   ##
=========================================
- Coverage      88%      79%     -9%     
=========================================
  Files         267      264      -3     
  Lines       23266    23245     -21     
=========================================
- Hits        20375    18297   -2078     
- Misses       2891     4948   +2057     
---- 🚨 Try these New Features:

MrWhatZitToYaa and others added 3 commits November 25, 2024 13:54
Fixed type annotation issue
Reduced code size by using Sized object from abc library
Switched from Dataset based implementation to Dataloader based implementation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci Continuous Integration dependencies Pull requests that update a dependency file dockers docs Documentation related fabric lightning.fabric.Fabric lightningdatamodule pl.LightningDataModule package pl Generic label for PyTorch Lightning package waiting on author Waiting on user action, correction, or update
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support str(datamodule)
3 participants