Regarding CUDA out of memory error only during validation #22

Aafiya-H · 2024-11-27T01:02:00Z

Hello, thank you so much for you work!
I am trying to finetune the mantis model for multi-image question answering. For the time being I just want to check if my script works. Using mixed precision causes this error. So I removed all the encoder and decoder layers from the model except 1 of each only to fit the model in memory. However, this still gives CUDA out of memory during validation, in spite of the reduced model size (trainable params: 1,322,308,128). I am trying to run this on A40 GPU. I am puzzled as to how training works fine but validation causes an out of memory issue.

Approaches tried:
Using LORA- training works fine, CUDA out of memory after certain number of validation steps
Using accelerate- The script hangs. Meaning not a single training iteration is performed even after a considerable time.

I have also added torch.cuda.empty_cache() before every training_step and prediction_step call.

Here are relevant portion of the code:
Imports that set environment:

import sys
sys.path.append("") # path to some library
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "2"

Model and processor initialization:

model_name = "TIGER-Lab/Mantis-8B-siglip-llama3"
processor = MLlavaProcessor.from_pretrained(model_name)
model = CustomLlavaForConditionalGeneration.from_pretrained(model_name).cuda()# Customer model inherits LlavaForConditionalGeneration and performs torch.cuda.empty_cache() before every prediction and training step
model.vision_tower.vision_model.encoder.layers = torch.nn.ModuleList([model.vision_tower.vision_model.encoder.layers[0]])
model.language_model.model.layers = torch.nn.ModuleList([model.language_model.model.layers[0]])
torch.cuda.empty_cache()
print(f"Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad)}")
model.language_model.resize_token_embeddings(len(processor.tokenizer))
model.config.text_config.vocab_size = len(processor.tokenizer)

Trainer and training arguments:

training_args = TrainingArguments(
        output_dir=args.output_dir,
        per_device_train_batch_size=args.batch_size, # train batch size is 2
        per_device_eval_batch_size=1,
        gradient_accumulation_steps = 2,
        num_train_epochs=args.num_epochs,
        warmup_ratio=args.warmup_ratio,
        eval_strategy="steps",
        eval_steps=100,   # set to this value only to check eval
        logging_dir=os.path.join(args.output_dir, "logs"),
        logging_steps=10,
        remove_unused_columns=False,
        learning_rate = 5e-6,
        lr_scheduler_type = "linear",
        weight_decay=0.01,
        max_grad_norm = 1.0,
        dataloader_pin_memory = True,
        report_to=None
)

trainer = CustomTrainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        data_collator=data_collator,
        tokenizer=processor.tokenizer,
        compute_metrics=compute_metrics
)
torch.cuda.empty_cache()
trainer.train()
torch.cuda.empty_cache()

Error message:
CUDA out of memory. Tried to allocate 13.89 GiB. GPU 0 has a total capacity of 44.35 GiB of which 13.54 GiB is free. Including non-PyTorch memory, this process has 30.79 GiB memory in use. Of the allocated memory 28.70 GiB is allocated by PyTorch, and 1.78 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables).

This occurs after 16 iterations in validation have completed.
Could you please help me figure out what could be causing this issue?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regarding CUDA out of memory error only during validation #22

Regarding CUDA out of memory error only during validation #22

Aafiya-H commented Nov 27, 2024 •

edited

Loading

Regarding CUDA out of memory error only during validation #22

Regarding CUDA out of memory error only during validation #22

Comments

Aafiya-H commented Nov 27, 2024 • edited Loading

Aafiya-H commented Nov 27, 2024 •

edited

Loading