Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Atlas of Living Australia #4918

Open
7 tasks done
sarayourfriend opened this issue Sep 12, 2024 · 0 comments
Open
7 tasks done

Atlas of Living Australia #4918

sarayourfriend opened this issue Sep 12, 2024 · 0 comments
Labels
💻 aspect: code Concerns the software code in the repository 🌟 goal: addition Addition of new feature 🟩 priority: low Low priority and doesn't need to be rushed ☁️ provider: audio Audio provider ☁️ provider: images Image provider 🧱 stack: catalog Related to the catalog and Airflow DAGs

Comments

@sarayourfriend
Copy link
Collaborator

Source API Endpoint / Documentation

https://support.ala.org.au/support/solutions/articles/6000196714-how-to-download-occurrence-records

Provider description

Atlas of Living Australia (ALA) aggregates open datasets from several sources around Australia. If you exclude iNaturalist Australia, they have over 2,000,000 images and nearly 40,000 sounds.

I don't know how many of those are openly licensed, but at a quick glance, every individual record I clicked on was some variation of CC licensed. According to their image-specific search tool only 4003 images are all rights reserved. A large number are "unrecognised" licenses, but here is an example of one that has a CC license URI in the rights field: https://images.ala.org.au/image/8486cc13-4da9-4dd3-a0f4-5a3d1feea1dc

There are some unrecognised that also just do not have a license listed. I suspect the vast majority are CC license URIs though.

Licenses Provided

CC licences

Provider API Technical info

The organisation of data from ALA is similar to Europeana in that it's a collection of other sources, but also is a source itself.

There is an API, here's an example (page size set to 1): https://biocache-ws.ala.org.au/ws/occurrences/search?q=*%3A*&disableAllQualityFilters=true&qualityProfile=ALA&fq=multimedia%3A%22Image%22&fq=-data_resource_uid%3A%22dr1411%22&qc=-_nest_parent_%3A*&pageSize=1

{
  "pageSize": 1,
  "startIndex": 0,
  "totalRecords": 2379535,
  "sort": "score",
  "dir": "asc",
  "status": "OK",
  "occurrences": [
    {
      "uuid": "c4262666-da59-4c89-964d-9ca5e4bcdb03",
      "occurrenceID": "https://canbr.gov.au/photo/apii/id/dig/905",
      "raw_catalogNumber": "dig 905.1",
      "taxonConceptID": "https://id.biodiversity.org.au/node/apni/2902845",
      "eventDate": 1126656000000,
      "scientificName": "Acacia subcaerulea",
      "vernacularName": "Blue-barked Acacia",
      "taxonRank": "species",
      "taxonRankID": 7000,
      "kingdom": "Plantae",
      "phylum": "Charophyta",
      "classs": "Equisetopsida",
      "order": "Fabales",
      "family": "Fabaceae",
      "genus": "Acacia",
      "genusGuid": "https://id.biodiversity.org.au/taxon/apni/51471290",
      "species": "Acacia subcaerulea",
      "speciesGuid": "https://id.biodiversity.org.au/node/apni/2902845",
      "year": 2005,
      "month": "09",
      "basisOfRecord": "HUMAN_OBSERVATION",
      "dataResourceUid": "dr413",
      "dataResourceName": "Australian Plant Image Index",
      "assertions": [
        "MODIFIED_DATE_INVALID",
        "MISSING_TAXONRANK",
        "TAXON_MISAPPLIED_MATCHED",
        "LOCATION_NOT_SUPPLIED",
        "COORDINATE_UNCERTAINTY_METERS_INVALID",
        "MISSING_GEOREFERENCE_DATE",
        "MISSING_GEOREFERENCEDBY",
        "MISSING_GEOREFERENCEPROTOCOL",
        "MISSING_GEOREFERENCESOURCES",
        "MISSING_GEOREFERENCEVERIFICATIONSTATUS"
      ],
      "speciesGroups": ["Plants", "Flowering plants", "Dicots"],
      "image": "a31fb54a-255e-4d74-a647-105d36626cc5",
      "images": ["a31fb54a-255e-4d74-a647-105d36626cc5"],
      "spatiallyValid": true,
      "recordedBy": ["Fagg, M."],
      "collectors": ["Fagg, M."],
      "raw_scientificName": "Acacia subcaerulea",
      "raw_basisOfRecord": "HumanObservation",
      "multimedia": ["Image"],
      "license": "CC-BY 3.0 (Au)",
      "imageUrl": "https://images.ala.org.au/image/proxyImage?imageId=a31fb54a-255e-4d74-a647-105d36626cc5",
      "largeImageUrl": "https://images.ala.org.au/image/proxyImageThumbnailLarge?imageId=a31fb54a-255e-4d74-a647-105d36626cc5",
      "smallImageUrl": "https://images.ala.org.au/image/proxyImageThumbnail?imageId=a31fb54a-255e-4d74-a647-105d36626cc5",
      "thumbnailUrl": "https://images.ala.org.au/image/proxyImageThumbnail?imageId=a31fb54a-255e-4d74-a647-105d36626cc5",
      "imageUrls": [
        "https://images.ala.org.au/image/proxyImageThumbnailLarge?imageId=a31fb54a-255e-4d74-a647-105d36626cc5"
      ],
      "geospatialKosher": "true",
      "collector": ["Fagg, M."],
      "namesLsid": "Acacia subcaerulea|https://id.biodiversity.org.au/node/apni/2902845|Blue-barked Acacia|Plantae|Fabaceae",
      "left": 587970,
      "right": 587970
    }
  ],
  "facetResults": [],
  "query": "?q=*%3A*&disableAllQualityFilters=true&qualityProfile=ALA&fq=multimedia%3A%22Image%22&fq=-data_resource_uid%3A%22dr1411%22&qc=-_nest_parent_%3A*",
  "urlParameters": "?q=*%3A*&disableAllQualityFilters=true&qualityProfile=ALA&fq=multimedia%3A%22Image%22&fq=-data_resource_uid%3A%22dr1411%22&qc=-_nest_parent_%3A*",
  "queryTitle": "[all records]",
  "activeFacetMap": {
    "multimedia": {
      "name": "multimedia",
      "displayName": "Multimedia:\"Image\"",
      "value": "\"Image\""
    },
    "-data_resource_uid": {
      "name": "-data_resource_uid",
      "displayName": "-<span>Data resource: iNaturalist Australia</span>",
      "value": "\"dr1411\""
    }
  },
  "activeFacetObj": {
    "multimedia": [
      {
        "name": "multimedia",
        "displayName": "Multimedia:\"Image\"",
        "value": "multimedia:\"Image\""
      }
    ],
    "-data_resource_uid": [
      {
        "name": "-data_resource_uid",
        "displayName": "-<span>Data resource: iNaturalist Australia</span>",
        "value": "-data_resource_uid:\"dr1411\""
      }
    ]
  }
}

However, I think more powerful is the fact that they offer bulk downloads of individual queries. If you visit the "advanced search" page for the above query (https://biocache.ala.org.au/occurrence/search?q=*%3A*&disableAllQualityFilters=true&qualityProfile=ALA&fq=multimedia%3A%22Image%22&qc=-_nest_parent_%3A*&fq=-data_resource_uid%3A%22dr1411%22), there is a download button, which lets you export a CSV. The "download" is asynchronous, in that you trigger an export on their end, they generate a zip, and then you get back a link later.

The API for that is documented here: https://docs.ala.org.au/openapi/index.html?urls.primaryName=occurrences#/Download

We'd need a DAG that completes this flow:

  1. Trigger a download
  2. Poke the status endpoint until it says it's complete
  3. Download the zip to disk
  4. Unzip it and upload the CSV to s3
  5. Then follow the iNaturalist approach (load the CSV into Postgres, etc)

ALA has their own image proxying with various sizes of thumbnails.

Note that each "occurrence" may have more than one image! The "occurrenceID" only points to the "main" image, I think? The other UUIDs in images all have proxied image URLs provided by ALA and are distinct on the ones that I saw this happening for.

Checklist to complete before beginning development

  • Verify there is a way to retrieve the entire relevant portion of the provider's collection in a systematic way via their API.
  • Verify the API provides license info (license type and version; license URL provides both, and is preferred)
  • Verify the API provides stable direct links to individual works.
  • Verify the API provides a stable landing page URL to individual works.
  • Note other info the API provides, such as thumbnails, dimensions, attribution info (required if non-CC0 licenses will be kept), title, description, other meta data, tags, etc.
  • Attach example responses to API queries that have the relevant info.

Implementation

  • 🙋 I would be interested in implementing this feature.
@sarayourfriend sarayourfriend added 🟩 priority: low Low priority and doesn't need to be rushed 🧹 status: ticket work required Needs more details before it can be worked on 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work 🌟 goal: addition Addition of new feature 💻 aspect: code Concerns the software code in the repository 🧱 stack: catalog Related to the catalog and Airflow DAGs ☁️ provider: any Replace with the actual provider type labels Sep 12, 2024
@openverse-bot openverse-bot moved this to 📋 Backlog in Openverse Backlog Sep 12, 2024
@obulat obulat removed 🧹 status: ticket work required Needs more details before it can be worked on 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work labels Sep 16, 2024
@obulat obulat added ☁️ provider: audio Audio provider ☁️ provider: images Image provider and removed ☁️ provider: any Replace with the actual provider type labels Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🌟 goal: addition Addition of new feature 🟩 priority: low Low priority and doesn't need to be rushed ☁️ provider: audio Audio provider ☁️ provider: images Image provider 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Status: 📋 Backlog
Development

No branches or pull requests

2 participants