Where can we get more training data?
iNaturalist has lots of image data that might be useful.
- DarwinCore Archive is the entire iNat tree and all common names. Found here.
- Amazon Open Data Program provides a way for researchers to download lots of media from iNat for free for research purposes.
There are comments on iNat as well. I haven't found any that aren't just species suggestions. But if we could find "debates" about an image's species, there might be textual data describing individual traits (supporting evidence for a particular classification).
Field guides will likely have great textual descriptions of traits. Many of them (for birds) should be available in PDF form. There are two challenges:
- Extracting images and text from PDFs, depending on whether we need OCR or if there is text embedded in the document.
- Copyright issues
Butterfly field guides:
- Common Butterflies of the Chicago Region
- Field Studies Council Butterflies guide
- The Complete Field Guide to Butterflies of Australia
- eBMS Field Guides for butterfly Identification
- Field Guide to the Butterflies of Sri Lanka
Heliconius:
LILA has 10M labeled images. We don't have textual descriptions, but 10M images is nothing to sneeze at.
Reddit (and Twitter) have communities (r/whatisthisanimal
, r/animalid
) around identifying images of species.
There is likely lots of rich textual data describing animal traits.
However, there is also likely a lot of noisy text data.
Merlin has lots of detailed pictures of animals at varying degrees of detail:
- Adult male (Taiga)
- Adult male (Prairie)
- Female/immature (Taiga)
- Etc
There are also detailed text descriptions:
- "Small stocky falcon with a blocky head. Males are generally dark overall, but their color varies geographically. The Taiga subspecies is medium gray above with a pale mustache stripe and a thin white eyebrow."
This sort of data is exactly what we want!
How do we get access to it?