Skip to content

Latest commit

 

History

History
76 lines (50 loc) · 3.33 KB

training-data-sources.md

File metadata and controls

76 lines (50 loc) · 3.33 KB

Training Data

Where can we get more training data?

  1. iNat
  2. Field Guides
  3. Birds
  4. LILA BC
  5. Reddit
  6. Merlin

iNaturalist

iNaturalist has lots of image data that might be useful.

There are comments on iNat as well. I haven't found any that aren't just species suggestions. But if we could find "debates" about an image's species, there might be textual data describing individual traits (supporting evidence for a particular classification).

Field Guides

Field guides will likely have great textual descriptions of traits. Many of them (for birds) should be available in PDF form. There are two challenges:

  1. Extracting images and text from PDFs, depending on whether we need OCR or if there is text embedded in the document.
  2. Copyright issues

Butterfly field guides:

Heliconius:

Bird datasets

LILA BC

LILA has 10M labeled images. We don't have textual descriptions, but 10M images is nothing to sneeze at.

Reddit

Reddit (and Twitter) have communities (r/whatisthisanimal, r/animalid) around identifying images of species. There is likely lots of rich textual data describing animal traits. However, there is also likely a lot of noisy text data.

Merlin

Merlin has lots of detailed pictures of animals at varying degrees of detail:

  • Adult male (Taiga)
  • Adult male (Prairie)
  • Female/immature (Taiga)
  • Etc

There are also detailed text descriptions:

  • "Small stocky falcon with a blocky head. Males are generally dark overall, but their color varies geographically. The Taiga subspecies is medium gray above with a pale mustache stripe and a thin white eyebrow."

This sort of data is exactly what we want!

How do we get access to it?