Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regarding CUDA out of memory error only during validation #22

Open
Aafiya-H opened this issue Nov 27, 2024 · 0 comments
Open

Regarding CUDA out of memory error only during validation #22

Aafiya-H opened this issue Nov 27, 2024 · 0 comments

Comments

@Aafiya-H
Copy link

Aafiya-H commented Nov 27, 2024

Hello, thank you so much for you work!
I am trying to finetune the mantis model for multi-image question answering. For the time being I just want to check if my script works. Using mixed precision causes this error. So I removed all the encoder and decoder layers from the model except 1 of each only to fit the model in memory. However, this still gives CUDA out of memory during validation, in spite of the reduced model size (trainable params: 1,322,308,128). I am trying to run this on A40 GPU. I am puzzled as to how training works fine but validation causes an out of memory issue.

Approaches tried:
Using LORA- training works fine, CUDA out of memory after certain number of validation steps
Using accelerate- The script hangs. Meaning not a single training iteration is performed even after a considerable time.

I have also added torch.cuda.empty_cache() before every training_step and prediction_step call.

Here are relevant portion of the code:
Imports that set environment:

import sys
sys.path.append("") # path to some library
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "2"

Model and processor initialization:

model_name = "TIGER-Lab/Mantis-8B-siglip-llama3"
processor = MLlavaProcessor.from_pretrained(model_name)
model = CustomLlavaForConditionalGeneration.from_pretrained(model_name).cuda()# Customer model inherits LlavaForConditionalGeneration and performs torch.cuda.empty_cache() before every prediction and training step
model.vision_tower.vision_model.encoder.layers = torch.nn.ModuleList([model.vision_tower.vision_model.encoder.layers[0]])
model.language_model.model.layers = torch.nn.ModuleList([model.language_model.model.layers[0]])
torch.cuda.empty_cache()
print(f"Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad)}")
model.language_model.resize_token_embeddings(len(processor.tokenizer))
model.config.text_config.vocab_size = len(processor.tokenizer)

Trainer and training arguments:

training_args = TrainingArguments(
        output_dir=args.output_dir,
        per_device_train_batch_size=args.batch_size, # train batch size is 2
        per_device_eval_batch_size=1,
        gradient_accumulation_steps = 2,
        num_train_epochs=args.num_epochs,
        warmup_ratio=args.warmup_ratio,
        eval_strategy="steps",
        eval_steps=100,   # set to this value only to check eval
        logging_dir=os.path.join(args.output_dir, "logs"),
        logging_steps=10,
        remove_unused_columns=False,
        learning_rate = 5e-6,
        lr_scheduler_type = "linear",
        weight_decay=0.01,
        max_grad_norm = 1.0,
        dataloader_pin_memory = True,
        report_to=None
)

trainer = CustomTrainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        data_collator=data_collator,
        tokenizer=processor.tokenizer,
        compute_metrics=compute_metrics
)
torch.cuda.empty_cache()
trainer.train()
torch.cuda.empty_cache()

Error message:
CUDA out of memory. Tried to allocate 13.89 GiB. GPU 0 has a total capacity of 44.35 GiB of which 13.54 GiB is free. Including non-PyTorch memory, this process has 30.79 GiB memory in use. Of the allocated memory 28.70 GiB is allocated by PyTorch, and 1.78 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables).

This occurs after 16 iterations in validation have completed.
Could you please help me figure out what could be causing this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant