-
Notifications
You must be signed in to change notification settings - Fork 0
gateway tests
I was running a series of tests that I now doubt. I've cut them. First, I think I had some SQL doing the wrong thing in terms of individual domain PPM. I need to write some Python to do the right calculations. (And, I want some over-time graphs/histograms anyway.)
The tests did produce high throughput, which I believe was accurate. That is, with 400 fetch
workers, I was able to cycle through a lot of content. However, it was 200 domains with hundreds of workers in every service. And, more importantly, it was a short test.
Short tests have lots of URLs to fetch. Long tests get into the tail of a site. Or: if you have 200 sites, and many of them are small, you'll finish them (and your overall throughput will look great), but the one remaining large site will still fetch at one page every two seconds. The rate-over-time is always 2 pages/second for a given domain.
measure | value (v5) | v4 | v3 | v2 | v1 |
---|---|---|---|---|---|
guestbook pages | 49645 | 20163 | 23873 | 18724 | 25650 |
duration (minutes) | 47 | 20 | 28 | 21 | 14 |
overall PPM | 1056 | 1008 | 852 | 891 | 1832 |
entree total jobs | 252733 | 62581 | 68376 | 41196 | 69174 |
entree JPM | 5521 | 3129 | 2442 | 1961 | 4941 |
fetch total jobs | 1009510 | 242156 | 289043 | 66215 | 135531 |
fetch JPM | 22088 | 12107 | 10322 | 3153 | 9680 |
extract total jobs | 26297 | 6315 | 8797 | 4534 | 9797 |
extract JPM | 573 | 315 | 314 | 215 | 700 |
walk total jobs | 26130 | 6229 | 8690 | 4482 | 9689 |
walk JPM | 580 | 311 | 310 | 213 | 692 |
I tried one entree
worker, one fetch
, one extract
, and one walk
. At 2 seconds/page, I should see 30ppm as my optimal throughput. I saw closer to 12ppm. Why? Some things are PDFs, and are slow to extract. You see the same pages over-and-over, and those take time to walk and process. So, at 1 worker/service, it is slower than optimal, for expected/imaginable reasons.
This also confirmed that the services were "behaving," especially entree
and fetch
.
service | workers |
---|---|
entree | 50 |
extract | 20 |
fetch | 100 |
walk | 20 |
I ran the NIH urls that we have.
This is a more accurate/honest test. It will take around 24 hours. A few things learned.
- I need to create some kind of error log. This is different than logging to NR or similar. This is where, when pages fail for known reasons, that we log them in a DB for later investigation. I now have a number of points where a page can fail to fetch or extract, and I want to log those.
- I limited PDFs to 5MB, because I was running out of RAM. With 20 workers, I could get 20 large PDFs at the same time. Of all of the services,
extract
is the most RAM sensitive.
I saw high throughput to start, and am now down to a single URL, which is closing out the last 3000 items in the queue at one item per 2s. It will take around 1500s/60s/m => 25m.
- Create an error log for diagnostics/partner reporting.
-
pack
at intervals. At this point, I'd like the last domain to have packed every... 10000 pages, or everyn
minutes, or something. This way, we are providing incremental results, since we'll be crawling lots of things all the time. - I'm wiping
completed
every three minutes. This is a bit aggressive. But,completed
ends up filling the queue database, and having millions of rows in that DB by the end of the day is unnecessary.
These numbers were when there were 3K fetch actions left from the NIH run. So, this is 99% of the NIH dataset.
for d in sqlite zstd ; do pushd $d ; stat -c "%n,%s" -- * | column -t -s, ; popd ; done
ncats.nih.gov.sqlite 75907072
nida.nih.gov.sqlite 296427520
search.gov.sqlite 6754304
www.cancer.gov.sqlite 751804416
www.cc.nih.gov.sqlite 261128192
www.cit.nih.gov.sqlite 2113536
www.fac.gov.sqlite 1511424
www.fic.nih.gov.sqlite 368394240
www.genome.gov.sqlite 371859456
www.nccih.nih.gov.sqlite 633974784
www.nei.nih.gov.sqlite 103837696
www.nhlbi.nih.gov.sqlite 390328320
www.niaaa.nih.gov.sqlite 161193984
www.niaid.nih.gov.sqlite 106496
www.niams.nih.gov.sqlite 196116480
www.nia.nih.gov.sqlite 114688
www.nibib.nih.gov.sqlite 92028928
www.nichd.nih.gov.sqlite 4045082624
www.nidcr.nih.gov.sqlite 68005888
www.niddk.nih.gov.sqlite 172163072
www.niehs.nih.gov.sqlite 3007397888
www.nigms.nih.gov.sqlite 259956736
www.nimhd.nih.gov.sqlite 112951296
www.nimh.nih.gov.sqlite 706301952
www.ninds.nih.gov.sqlite 106496
www.ninr.nih.gov.sqlite 19484672
www.nlm.nih.gov.sqlite 151867392
and
ncats.nih.gov.sqlite.zstd 7666732
nida.nih.gov.sqlite.zstd 34773743
search.gov.sqlite.zstd 1123866
www.cancer.gov.sqlite.zstd 107231310
www.cc.nih.gov.sqlite.zstd 16772282
www.cit.nih.gov.sqlite.zstd 163872
www.fac.gov.sqlite.zstd 267103
www.fic.nih.gov.sqlite.zstd 28904089
www.genome.gov.sqlite.zstd 40279650
www.nccih.nih.gov.sqlite.zstd 41647877
www.nei.nih.gov.sqlite.zstd 15527666
www.nhlbi.nih.gov.sqlite.zstd 43916005
www.niaaa.nih.gov.sqlite.zstd 23605389
www.niaid.nih.gov.sqlite.zstd 11797
www.niams.nih.gov.sqlite.zstd 22963428
www.nia.nih.gov.sqlite.zstd 8384
www.nibib.nih.gov.sqlite.zstd 12554245
www.nichd.nih.gov.sqlite.zstd 1111609743
www.nidcr.nih.gov.sqlite.zstd 8830463
www.niddk.nih.gov.sqlite.zstd 29633034
www.niehs.nih.gov.sqlite.zstd 225221029
www.nigms.nih.gov.sqlite.zstd 21056165
www.nimhd.nih.gov.sqlite.zstd 10288537
www.nimh.nih.gov.sqlite.zstd 86354587
www.ninds.nih.gov.sqlite.zstd 10625
www.ninr.nih.gov.sqlite.zstd 2137708
www.nlm.nih.gov.sqlite.zstd 30627062
are the baseline numbers. There are 12GB of SQLite files from the run, but there are 1.8GB of files when they are compressed with zstd
. This matters, because we can create read-only databases with zstd.
In deployment, this is the difference between being able to fit all of NIH on a single droplet, or having to use two. The compression seems to range from a factor of 3x to 10x with zstd. In this case, we're looking at being able to fit 3x the content on a single droplet.
Ultimately, we're going to need a balance
component (or distribute
) that can take a query, and ask multiple servers for results, and then return the agglomeration. The challenge will be that the distributor will want to know that NIH is made up of ~20 domains, and those are all on one or more droplets. However, when we pass an NIH query to a droplet, it needs to know that there might be 20 databases that need to be queried in order to return a result.
Without compression, we're around 2-10ms per query. With compression, we'll be in the 20-50ms/query. If we have to query 20 databases... well, we can use gofuncs or similar, but in a straight line, we'd be looking at 1000ms to resolve the query. So, distributing queries over large agency domains will mean that we're taking around 2s to resolve the query. There may be things we can do in order to provide incremental results, or speed things up (e.g. parallelism). That's both... ok but also not ideal.
However, it is a small/light system, and it scales. As can be seen, it looks like all of NIH fits on a single server.
There's more I want to explore here about performance. This is just the first glance, and some initial thoughts.
The NIH content is 12GB uncompressed, and 1.8GB compressed. That suggests (?) that I should be able to do the following:
- Create a new (agglomerated) nih.gov database
- For each subdomain, insert the contents into the agglomerated DB
- Compress that, and get a single file of ~1.8GB.
This would make querying easier. I only have to do one query against a single file, and the question of ranking is solved. (That is, the ranking algorithms in FTS5 will work across all the content regardless of domain.) I do need to think about expanding our current table to include domains as well as paths, because the assumption is that a single file contains a single DB. And, it implies we may want to provide a ranking. That is, should the root of nih.gov rank above a subdomain? I don't know.
This is an experiment "to do."
I want to write some Python in a notebook to analyze the guestbook database. I'll first see if I can export that DB to SQLite for analysis. Or, CSV. sling
will be useful here.
To start:
sling run --src-conn WORKDB --src-stream 'public.hosts' --tgt-object 'file://hosts.json'
and
sling run --src-conn WORKDB --src-stream 'public.guestbook' --tgt-object 'file://guestbook.json'
created data to work with. (The connections had to be setup.)
host | count | |
---|---|---|
17 | public.csr.nih.go | 1 |
16 | www.cit.nih.gov | 42 |
1 | www.nia.nih.gov | 162 |
14 | www.ninds.nih.gov | 184 |
8 | www.niaid.nih.gov | 229 |
4 | www.ninr.nih.gov | 450 |
18 | ncats.nih.gov | 909 |
19 | www.nidcr.nih.gov | 1092 |
22 | www.nibib.nih.gov | 1219 |
2 | www.nigms.nih.gov | 1555 |
23 | www.cc.nih.gov | 2040 |
15 | www.niaaa.nih.gov | 2176 |
24 | www.nccih.nih.gov | 2295 |
21 | www.niams.nih.gov | 2296 |
6 | www.fic.nih.gov | 2518 |
7 | www.nei.nih.gov | 2633 |
11 | nida.nih.gov | 2974 |
13 | www.nimhd.nih.gov | 3833 |
3 | www.nimh.nih.gov | 5410 |
10 | www.niddk.nih.gov | 6374 |
20 | www.nhlbi.nih.gov | 7197 |
9 | www.nichd.nih.gov | 7238 |
12 | www.niehs.nih.gov | 7497 |
0 | www.genome.gov | 7675 |
25 | www.cancer.gov | 10375 |
5 | www.nlm.nih.gov | 36171 |
Now, I asked and answered these questions...
Q: How long did all of these fetches take?
A: One answer is from the first "last_fetched" until the last "last_fetched"
Q: Is that an accurate measure?
A: Yes, and no. No, because it asks what the largest domain is, and how long that took. That is, a large domain can only fetch one page per 2s, so that means the total duration is that of nlm.nih.gov.
Q: What is a more accurate measure?
A: What is the first and last fetch for each domain?
Q: Is that more accurate?
A: ... yes? It measures how long things take to get through the queue as well. So, it is authentic.
host | seconds | pages | ppm | |
---|---|---|---|---|
18 | ncats.nih.gov | 25761 | 909 | 2.11715 |
11 | nida.nih.gov | 39405 | 2974 | 4.52836 |
25 | www.cancer.gov | 46067 | 10375 | 13.5129 |
23 | www.cc.nih.gov | 44972 | 2040 | 2.72169 |
16 | www.cit.nih.gov | 1830 | 42 | 1.37705 |
6 | www.fic.nih.gov | 26478 | 2518 | 5.70587 |
0 | www.genome.gov | 58988 | 7675 | 7.80667 |
24 | www.nccih.nih.gov | 36567 | 2295 | 3.76569 |
7 | www.nei.nih.gov | 36993 | 2633 | 4.27054 |
20 | www.nhlbi.nih.gov | 66826 | 7197 | 6.46186 |
1 | www.nia.nih.gov | 74162 | 162 | 0.131064 |
15 | www.niaaa.nih.gov | 37522 | 2176 | 3.47956 |
8 | www.niaid.nih.gov | 40060 | 229 | 0.342986 |
21 | www.niams.nih.gov | 46591 | 2296 | 2.95679 |
22 | www.nibib.nih.gov | 19958 | 1219 | 3.6647 |
9 | www.nichd.nih.gov | 48998 | 7238 | 8.86322 |
19 | www.nidcr.nih.gov | 66145 | 1092 | 0.990551 |
10 | www.niddk.nih.gov | 51985 | 6374 | 7.35674 |
12 | www.niehs.nih.gov | 34842 | 7497 | 12.9103 |
2 | www.nigms.nih.gov | 24848 | 1555 | 3.75483 |
3 | www.nimh.nih.gov | 74370 | 5410 | 4.36466 |
13 | www.nimhd.nih.gov | 50098 | 3833 | 4.5906 |
14 | www.ninds.nih.gov | 78265 | 184 | 0.141059 |
4 | www.ninr.nih.gov | 28805 | 450 | 0.937337 |
5 | www.nlm.nih.gov | 83849 | 36171 | 25.883 |
Now, this is a measure of "pages per minute per host."
Some of these are nearing optimal. For example, nlm.nih.gov is 25.8 ppm. The theoretical max is 30ppm, because I fetch once every two seconds.
However, for others, it is around 3ppm.
This seems abysmal.
But.
All of these are in one queue.
The logic is as follows:
-
walk
finds a URL. - It passes the URL to
entree
. -
entree
checks the guestbook. - If it has been recently updated, we skip. (There's a bit more here, but this is good enough for getting on with.)
- If it hasn't, we continue.
-
entree
enqueues the URL forfetch
-
fetch
grabs a URL from the queue -
fetch
asks if the host has been hit less than 2s ago (against a local/in-memory dictionary) - If the host has been hit, that worker sleeps until it is time to fetch the content
- If not, we continue immediately
- We grab the content (if it is an allowed MIME type, etc.)
-
fetch
updates the Guestbook and enqueueswalk
andextract
jobs for this page
The likely slowdown here is that we are inserting a lot of jobs into the fetch
queue for all of those domains. And, because workers are sleeping, that means that the 100 workers I was using might all be sleeping on a domain.
So. This is why some smaller domains, in the face of a larger domain (or a more link-rich domain) will be "starved" in this model.
Before I think about "fixes," I want to see some graphs. This is working, and doing "the right thing," but the model could, I think, have faster throughput.
I want to see some graphs.
This is one way of visualizing what happened. Each dot is a fetch.
The y
axis is the host, and the x
axis is the time that the fetch happened. The left-most is the "first" fetch, and the right-most is the "last" fetch. The timing is less important than the relative nature. Looking a ncats.nih.gov
, we can see how the fetches get spread out over time. This is because the queue is being thrashed by other domains, and the pages aren't coming around. So, the PPM value for ncats
is 2, but that is because nlm
is thrashing through the fetch
queue constantly. (As are some other domains.) So while nlm
has a PPM of 25, other domains are actually being starved.
The ideal fix is to have one queue per domain (and, a few workers per queue). Then, we can come close/hit the theoretical 30PPM limit per domain.
I'm attempting a pool-of-pools approach, where there are 10 fetch
queues, and they are round-robin'd. This should see a speedup, because each pool of 10 workers is assigned to a smaller set of values to fetch (and, we bounce them around the queues). This might mean that the overall throughput per-domain is higher.
The current approach---one queue with all the domains on it---could go faster if I 1) have a large number of workers, and 2) I thrash the queue. This means that if we see 100 nlm
jobs that have arrived too soon, they get thrown to the back of the queue, and we more quickly process other domains. It feels like a bad solution, but we'll see. Experiment #7 will demonstrate the round-robin queues.
I could have made some graphs.
I modified fetch
.
Now, it does this:
- It spawns 10
fetch
workers - It spawns 10
fetch-0
workers - It spawns 10
fetch-1
workers... - ...
- It spawns 10
fetch-9
workers
(in other words, it does workers x workers)
Now, whenever a domain isn't available, I throw it at another queue. Now, I'm not sleeping (I should be), so this will maximally thrash the queue(s). But, I want to see what happens.
The watcher shows this:
queue | state | count
---------+-----------+------
entree | completed | 1519
entree | running | 6
extract | running | 1
extract | available | 1
extract | completed | 166
fetch | available | 14001
fetch | running | 10
fetch | completed | 313
fetch-0 | available | 298
fetch-0 | running | 2
fetch-0 | completed | 229
fetch-1 | available | 554
fetch-1 | completed | 219
fetch-2 | available | 527
fetch-2 | completed | 237
fetch-2 | running | 8
fetch-3 | available | 265
fetch-3 | completed | 233
fetch-3 | running | 19
fetch-4 | completed | 218
fetch-4 | running | 10
fetch-4 | available | 384
fetch-5 | completed | 215
fetch-5 | available | 338
fetch-5 | running | 8
fetch-6 | available | 425
fetch-6 | running | 3
fetch-6 | completed | 226
fetch-7 | completed | 161
fetch-7 | available | 37
fetch-7 | running | 7
fetch-8 | running | 9
fetch-8 | completed | 239
fetch-8 | available | 1289
fetch-9 | completed | 255
fetch-9 | running | 2
fetch-9 | available | 332
pack | available | 1
pack | completed | 211
walk | retryable | 1
walk | running | 1
walk | completed | 166
Now, I have a lot of fetchers. And, because they're all playing off the same "when did we last see this domain" structure, they're all being polite. But, each different queue will be picked up by a different set of workers, meaning that I should see better throughput than a single set of workers, sleeping, on a single set of URLs.
In truth, I wonder if this would be simpler if I just had more workers, and I was more willing to thrash the queue. That is, 1000 workers who are more willing to thrash the queue would probably be faster than 10 pools of 10.
But, I'll allow this experiment to play out while I'm building some graphs of experiment 6.
... hours later...
Late in the run, this is what the queues looked like, focusing in only on available
:
queue | state | count
---------+-----------+-------
fetch | available | 11
fetch-0 | available | 78
fetch-1 | available | 149
fetch-2 | available | 491
fetch-4 | available | 99
fetch-5 | available | 62
fetch-6 | available | 94
fetch-7 | available | 264
fetch-8 | available | 10755
fetch-9 | available | 20
And, there's two domains cycling: nlm.nih.gov
and cancer.gov
. They tick by one per second, alternating.
So, the rest ran. Which is good... if there were more domains queued, they, too would run. However, this looks weird... there are 10 workers in each of those queue pools, and they're just shuffling jobs between each-other.
I'm going to pull the guestbook at this point and run some analysis, even though there are 10K pages left to fetch, I suspect.
Here is the Experiment 6/Run-01 plot:
And Experiment 7/Run-02:
Smaller domains finished faster using the round-robin approach.
And experiment 7/Run-03:
This is a per-domain queueing approach, where each domain gets its own queue, and 5 workers. (The incoming queue gets 10 workers.)
Note the x-axis; the largest domain is nowhere near done yet.
The per-domain and round-robin have similar density for low-page-count domains. However, the round-robin involves a lot of queue thrashing. In this model, I have one client for fetch
, and one client for all of the domains. The client managing the domains has a 500ms backoff, which means that it is hitting the queues very infrequently compared to the other approach. And, each domain has its own worker pool. So, the result is that each domain maintains a good PPM. Having per-domain queues is better in all but one case.
ppm-1 is the number of pages-per-minute with a naive queuing approach.
ppm-2 is with round-robin queuing, where anything delayed is shuffled over to the next queue.
ppm-3 is where every domain has its own queue for slow (1-page-per-2s) fetching.
2 vs 1 is how many times faster the round-robin is than the naive approach.
3 vs 1 is how many times faster the per-domain queuing approach is.
While the two are similar... well, round-robin is more general, but thrashes the queue hard. per-domain is conceptually clear, and thrashes the queue less, while being more performant. I'm inclined to go that direction for production.
host | ppm-1 | ppm-2 | ppm-3 | 2 vs 1 | 3 vs 1 | 3 better? |
---|---|---|---|---|---|---|
ncats.nih.gov | 2.12 | 4.36 | 5.65 | 2.06 | 2.67 | yes |
nida.nih.gov | 4.53 | 6.35 | 10.76 | 1.40 | 2.38 | yes |
www.cancer.gov | 13.51 | 26.38 | 27.89 | 1.95 | 2.06 | yes |
www.cc.nih.gov | 2.72 | 2.76 | 9.31 | 1.01 | 3.42 | yes |
www.cit.nih.gov | 1.38 | 1.38 | 1.68 | 1.00 | 1.22 | yes |
www.fic.nih.gov | 5.71 | 12.33 | 17.89 | 2.16 | 3.14 | yes |
www.genome.gov | 7.81 | 23.68 | 26.00 | 3.03 | 3.33 | yes |
www.nccih.nih.gov | 3.77 | 17.24 | 9.03 | 4.58 | 2.40 | no |
www.nei.nih.gov | 4.27 | 19.45 | 20.14 | 4.56 | 4.72 | yes |
www.nhlbi.nih.gov | 6.46 | 22.24 | 26.59 | 3.44 | 4.11 | yes |
www.nia.nih.gov | 0.13 | 0.29 | 0.47 | 2.18 | 3.59 | yes |
www.niaaa.nih.gov | 3.48 | 3.19 | 16.53 | 0.92 | 4.75 | yes |
www.niaid.nih.gov | 0.34 | 0.45 | 0.92 | 1.32 | 2.69 | yes |
www.niams.nih.gov | 2.96 | 5.88 | 13.46 | 1.99 | 4.55 | yes |
www.nibib.nih.gov | 3.66 | 11.56 | 17.52 | 3.16 | 4.78 | yes |
www.nichd.nih.gov | 8.86 | 26.74 | 27.28 | 3.02 | 3.08 | yes |
www.nidcr.nih.gov | 0.99 | 2.50 | 8.22 | 2.53 | 8.30 | yes |
www.niddk.nih.gov | 7.36 | 7.86 | 19.01 | 1.07 | 2.58 | yes |
www.niehs.nih.gov | 12.91 | 26.68 | 27.79 | 2.07 | 2.15 | yes |
www.nigms.nih.gov | 3.75 | 13.63 | 17.72 | 3.63 | 4.72 | yes |
www.nimh.nih.gov | 4.36 | 5.20 | 13.86 | 1.19 | 3.18 | yes |
www.nimhd.nih.gov | 4.59 | 15.01 | 18.96 | 3.27 | 4.13 | yes |
www.ninds.nih.gov | 0.14 | 0.24 | 0.42 | 1.71 | 2.97 | yes |
www.ninr.nih.gov | 0.94 | 0.46 | 1.82 | 0.49 | 1.94 | yes |
www.nlm.nih.gov | 25.88 | 31.67 | 53.08 | 1.22 | 2.05 | yes |
This is an idea if other approaches do not seem to be efficient...
Another approach is to have "backoff queues."
The incoming queue processes as fast as possible. Domains that are ready are fetched.
If a domain isn't ready, it goes to the next queue. It processes one URL every 100ms. So, it is a slower queue.
If the domain is ready, it is fetched. Otherwise, it goes to the next queue, which is a 200ms queue.
Then 400. 800. 1600.
At that point, we are fetching one URL every 1.6s from that queue. It should have a high hit rate. This will also be a long queue, but it should be the case that shorter/smaller domains end up with fewer pages on this queue?
This might have odd/unintended consequences. A domain with lots of URLs will end up clogging that queue, and if anyone else makes that queue, they'll take forever to get that page processed. (But, they should end up with very few pages on the queue.)
A per-domain queueing system would get more pages through, but would still want a backoff. Why? Because we can only hit a given domain so fast, and there's no reason to thrash the queue. Per-domain queues could all check once every 1.5-2s, by definition. But, if we have thousands of domains, it implies thousands of queues. (But... I might only need 2-3 workers/queue, because the parallelism comes from the number of queues, not the number of workers.)