Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

occurrence search using datasetid #108

Open
MathewBiddle opened this issue Nov 2, 2022 · 3 comments
Open

occurrence search using datasetid #108

MathewBiddle opened this issue Nov 2, 2022 · 3 comments

Comments

@MathewBiddle
Copy link
Collaborator

I'm trying to rework an old notebook to use this package.

I have this piece of code:

from pyobis.occurrences import OccQuery

datasetid = '2ae2a2bd-8412-405b-8a9f-b71adc41d4c5'

occ = OccQuery()
dataset = occ.search(datasetid = datasetid)

but it didn't work - took to long.

Here are the expected details from my other process:
OBIS Dataset page: https://obis.org/dataset/2ae2a2bd-8412-405b-8a9f-b71adc41d4c5
API request: https://api.obis.org/v3/occurrence?datasetid=2ae2a2bd-8412-405b-8a9f-b71adc41d4c5
Found 698900 records.

@ayushanand18
Copy link
Collaborator

I tried to reproduce the same error, but I found something interesting. There is something going on with request cache.

  • Step 1: executed occ.search(datasetid = ...), waited for long but didn't get any response. But the get_search_url() built the initial url correctly.
  • Step 2: executed a dummy query with scientificname as parameter. No datasetid.
  • Step 3: again did step 1, but this time it worked.
    image

I don't why it behaved as such.

@MathewBiddle
Copy link
Collaborator Author

So, pyobis is doing something odd with the query/response. The turn around time is waay too slow as compared to just using urllib.request.urlopen() and manually building the urls.

I think something is up with requests and how the query is being performed?

out = requests.get(url, params=args, **kwargs)

This stack overflow thread might be helpful in deducing the issue.

At this point, the package it not very useful to me because it takes +10 mins to run a search and get responses. Granted I am trying to return 621,066 records. But, it works just fine using urllib.

@ayushanand18
Copy link
Collaborator

At this point, the package it not very useful to me because it takes +10 mins to run a search and get responses. Granted I am trying to return 621,066 records. But, it works just fine using urllib.

Thank you so much for highlighting this issue. This was on my to-do for quite some time and I had been experimenting with urllib but couldn't get satisfactory improvements. While improvements from using urllib over requests was only around 25%, the method suggested in the stackoverflow thread you attached brought more than 75% savings in time.

I used a User-agent string header in the request and found that the improvement was really significant.
Something like this,

headers = {"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/537.36", "Connection":"close"}
out = requests.get(url, params=args, headers = headers, **kwargs)

I'll initiate a PR for this at the earliest. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants