How are we going to evaluate CLIP-like models for use in biology?
- Zero-shot classification accuracy on IID data
- Zero-shot clasification accuracy on co-mimics
- Zero-shot classification accuracy on co-mimics with textual descriptions of the differences
- Data efficiency (OPEN QUESTION)
- Generalization to lab/museum photos (QUESTIONS)
- Trait presence (OPEN QUESTION)
We want to use 1K classes so we can compare to ImageNet1K classes. We also want to compare performance on seen and unseen classes. So we might include 9K classes during pretraining, then evaluate accuracy on 1K seen classes, then evaluate accuracy on the remaining 1K unseen classes.
We also want to compare common names and taxonomic names for the text encoder: common names are likely easier.
Can we classify museum photos of co-mimics? They're deliberately visually challenging (to predators). We would want to do zero-shot classification, perhaps with textual descriptions of the differences between species. We could also use butterflies for zero-shot fine-grained classification of very similar classes, and to test whether CLIP generalizes to museum photos.
Data on /fs/ess/PAS2136/Butterfly/Jiggins_dataset/Jiggins_datav2
might useful for this.
How can we do few-shot evaluation of CLIP models? Should we fine-tune? Should we do some sort of prompting or in-context learning?
Probably can just do Tip-Adapter: this takes advantage of the text encoder, but doesn't require any training. We can also do CLIP-Adapter which inserts a bottleneck layer (linear projection to lower dimension, then a linear projection to original dimension) and a residual connection after the vision and text encoders, which are tuned on the few-shot examples. Other options include Wise-FT which is a linear combination of the tuned weights and the original weights. Linear probing is consistently worse than zero-shot CLIP or these other options; do not use it.
Can we do zero-shot classification of the fish photos from the Kenya team? These photos are OOD in the sense that they are on white backgrounds (same applies to the butterflies).
Ideally we would have some naturalist/citizen science photos of the same species, then see if the classification generalizes across background, rather than generalize across background and unseen species.
Can we identify traits in a picture? What about in a species? Given a picture of some animals, can we definitely say "these traits are present in this photo"? Can we say "this animal has these traits, even though they are not visible"?
What dataset can we use for this? Can we construct a small one (~200-500 examples)?