occurrence search using datasetid #108

MathewBiddle · 2022-11-02T00:57:10Z

I'm trying to rework an old notebook to use this package.

I have this piece of code:

from pyobis.occurrences import OccQuery

datasetid = '2ae2a2bd-8412-405b-8a9f-b71adc41d4c5'

occ = OccQuery()
dataset = occ.search(datasetid = datasetid)

but it didn't work - took to long.

Here are the expected details from my other process:
OBIS Dataset page: https://obis.org/dataset/2ae2a2bd-8412-405b-8a9f-b71adc41d4c5
API request: https://api.obis.org/v3/occurrence?datasetid=2ae2a2bd-8412-405b-8a9f-b71adc41d4c5
Found 698900 records.

ayushanand18 · 2022-11-02T08:18:55Z

I tried to reproduce the same error, but I found something interesting. There is something going on with request cache.

Step 1: executed occ.search(datasetid = ...), waited for long but didn't get any response. But the get_search_url() built the initial url correctly.
Step 2: executed a dummy query with scientificname as parameter. No datasetid.
Step 3: again did step 1, but this time it worked.

I don't why it behaved as such.

MathewBiddle · 2022-11-14T16:10:37Z

So, pyobis is doing something odd with the query/response. The turn around time is waay too slow as compared to just using urllib.request.urlopen() and manually building the urls.

I think something is up with requests and how the query is being performed?

pyobis/pyobis/obisutils.py

Line 48 in 8c33cae

out = requests.get(url, params=args, **kwargs)

This stack overflow thread might be helpful in deducing the issue.

At this point, the package it not very useful to me because it takes +10 mins to run a search and get responses. Granted I am trying to return 621,066 records. But, it works just fine using urllib.

ayushanand18 · 2022-11-15T05:38:57Z

At this point, the package it not very useful to me because it takes +10 mins to run a search and get responses. Granted I am trying to return 621,066 records. But, it works just fine using urllib.

Thank you so much for highlighting this issue. This was on my to-do for quite some time and I had been experimenting with urllib but couldn't get satisfactory improvements. While improvements from using urllib over requests was only around 25%, the method suggested in the stackoverflow thread you attached brought more than 75% savings in time.

I used a User-agent string header in the request and found that the improvement was really significant.
Something like this,

headers = {"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/537.36", "Connection":"close"}
out = requests.get(url, params=args, headers = headers, **kwargs)

I'll initiate a PR for this at the earliest. Thanks again!

This was referenced Nov 15, 2022

add convenience functions for common operations #110

Open

[Update] Making UI simpler and an OOP refactor #114

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

occurrence search using datasetid #108

occurrence search using datasetid #108

MathewBiddle commented Nov 2, 2022

ayushanand18 commented Nov 2, 2022

MathewBiddle commented Nov 14, 2022

ayushanand18 commented Nov 15, 2022

occurrence search using datasetid #108

occurrence search using datasetid #108

Comments

MathewBiddle commented Nov 2, 2022

ayushanand18 commented Nov 2, 2022

MathewBiddle commented Nov 14, 2022

ayushanand18 commented Nov 15, 2022