Skip to content

gateway tests

Matthew Jadud edited this page Nov 30, 2024 · 15 revisions

experiments

experiment 5

I was running a series of tests that I now doubt. I've cut them. First, I think I had some SQL doing the wrong thing in terms of individual domain PPM. I need to write some Python to do the right calculations. (And, I want some over-time graphs/histograms anyway.)

The tests did produce high throughput, which I believe was accurate. That is, with 400 fetch workers, I was able to cycle through a lot of content. However, it was 200 domains with hundreds of workers in every service. And, more importantly, it was a short test.

Short tests have lots of URLs to fetch. Long tests get into the tail of a site. Or: if you have 200 sites, and many of them are small, you'll finish them (and your overall throughput will look great), but the one remaining large site will still fetch at one page every two seconds. The rate-over-time is always 2 pages/second for a given domain.

measure value (v5) v4 v3 v2 v1
guestbook pages 49645 20163 23873 18724 25650
duration (minutes) 47 20 28 21 14
overall PPM 1056 1008 852 891 1832
entree total jobs 252733 62581 68376 41196 69174
entree JPM 5521 3129 2442 1961 4941
fetch total jobs 1009510 242156 289043 66215 135531
fetch JPM 22088 12107 10322 3153 9680
extract total jobs 26297 6315 8797 4534 9797
extract JPM 573 315 314 215 700
walk total jobs 26130 6229 8690 4482 9689
walk JPM 580 311 310 213 692

experiment 6

I tried one entree worker, one fetch, one extract, and one walk. At 2 seconds/page, I should see 30ppm as my optimal throughput. I saw closer to 12ppm. Why? Some things are PDFs, and are slow to extract. You see the same pages over-and-over, and those take time to walk and process. So, at 1 worker/service, it is slower than optimal, for expected/imaginable reasons.

This also confirmed that the services were "behaving," especially entree and fetch.

experiment 7

service workers
entree 50
extract 20
fetch 100
walk 20

I ran the NIH urls that we have.

This is a more accurate/honest test. It will take around 24 hours. A few things learned.

  1. I need to create some kind of error log. This is different than logging to NR or similar. This is where, when pages fail for known reasons, that we log them in a DB for later investigation. I now have a number of points where a page can fail to fetch or extract, and I want to log those.
  2. I limited PDFs to 5MB, because I was running out of RAM. With 20 workers, I could get 20 large PDFs at the same time. Of all of the services, extract is the most RAM sensitive.

I saw high throughput to start, and am now down to a single URL, which is closing out the last 3000 items in the queue at one item per 2s. It will take around 1500s/60s/m => 25m.

things to do

  1. Create an error log for diagnostics/partner reporting.
  2. pack at intervals. At this point, I'd like the last domain to have packed every... 10000 pages, or every n minutes, or something. This way, we are providing incremental results, since we'll be crawling lots of things all the time.
  3. I'm wiping completed every three minutes. This is a bit aggressive. But, completed ends up filling the queue database, and having millions of rows in that DB by the end of the day is unnecessary.

filesizes

These numbers were when there were 3K fetch actions left from the NIH run. So, this is 99% of the NIH dataset.

for d in sqlite zstd ; do pushd $d ; stat -c "%n,%s" -- * | column -t -s, ; popd ; done
ncats.nih.gov.sqlite      75907072
nida.nih.gov.sqlite       296427520
search.gov.sqlite         6754304
www.cancer.gov.sqlite     751804416
www.cc.nih.gov.sqlite     261128192
www.cit.nih.gov.sqlite    2113536
www.fac.gov.sqlite        1511424
www.fic.nih.gov.sqlite    368394240
www.genome.gov.sqlite     371859456
www.nccih.nih.gov.sqlite  633974784
www.nei.nih.gov.sqlite    103837696
www.nhlbi.nih.gov.sqlite  390328320
www.niaaa.nih.gov.sqlite  161193984
www.niaid.nih.gov.sqlite  106496
www.niams.nih.gov.sqlite  196116480
www.nia.nih.gov.sqlite    114688
www.nibib.nih.gov.sqlite  92028928
www.nichd.nih.gov.sqlite  4045082624
www.nidcr.nih.gov.sqlite  68005888
www.niddk.nih.gov.sqlite  172163072
www.niehs.nih.gov.sqlite  3007397888
www.nigms.nih.gov.sqlite  259956736
www.nimhd.nih.gov.sqlite  112951296
www.nimh.nih.gov.sqlite   706301952
www.ninds.nih.gov.sqlite  106496
www.ninr.nih.gov.sqlite   19484672
www.nlm.nih.gov.sqlite    151867392

and

ncats.nih.gov.sqlite.zstd      7666732
nida.nih.gov.sqlite.zstd       34773743
search.gov.sqlite.zstd         1123866
www.cancer.gov.sqlite.zstd     107231310
www.cc.nih.gov.sqlite.zstd     16772282
www.cit.nih.gov.sqlite.zstd    163872
www.fac.gov.sqlite.zstd        267103
www.fic.nih.gov.sqlite.zstd    28904089
www.genome.gov.sqlite.zstd     40279650
www.nccih.nih.gov.sqlite.zstd  41647877
www.nei.nih.gov.sqlite.zstd    15527666
www.nhlbi.nih.gov.sqlite.zstd  43916005
www.niaaa.nih.gov.sqlite.zstd  23605389
www.niaid.nih.gov.sqlite.zstd  11797
www.niams.nih.gov.sqlite.zstd  22963428
www.nia.nih.gov.sqlite.zstd    8384
www.nibib.nih.gov.sqlite.zstd  12554245
www.nichd.nih.gov.sqlite.zstd  1111609743
www.nidcr.nih.gov.sqlite.zstd  8830463
www.niddk.nih.gov.sqlite.zstd  29633034
www.niehs.nih.gov.sqlite.zstd  225221029
www.nigms.nih.gov.sqlite.zstd  21056165
www.nimhd.nih.gov.sqlite.zstd  10288537
www.nimh.nih.gov.sqlite.zstd   86354587
www.ninds.nih.gov.sqlite.zstd  10625
www.ninr.nih.gov.sqlite.zstd   2137708
www.nlm.nih.gov.sqlite.zstd    30627062

are the baseline numbers. There are 12GB of SQLite files from the run, but there are 1.8GB of files when they are compressed with zstd. This matters, because we can create read-only databases with zstd.

In deployment, this is the difference between being able to fit all of NIH on a single droplet, or having to use two. The compression seems to range from a factor of 3x to 10x with zstd. In this case, we're looking at being able to fit 3x the content on a single droplet.

Ultimately, we're going to need a balance component (or distribute) that can take a query, and ask multiple servers for results, and then return the agglomeration. The challenge will be that the distributor will want to know that NIH is made up of ~20 domains, and those are all on one or more droplets. However, when we pass an NIH query to a droplet, it needs to know that there might be 20 databases that need to be queried in order to return a result.

Without compression, we're around 2-10ms per query. With compression, we'll be in the 20-50ms/query. If we have to query 20 databases... well, we can use gofuncs or similar, but in a straight line, we'd be looking at 1000ms to resolve the query. So, distributing queries over large agency domains will mean that we're taking around 2s to resolve the query. There may be things we can do in order to provide incremental results, or speed things up (e.g. parallelism). That's both... ok but also not ideal.

However, it is a small/light system, and it scales. As can be seen, it looks like all of NIH fits on a single server.

There's more I want to explore here about performance. This is just the first glance, and some initial thoughts.

all the content in one database

The NIH content is 12GB uncompressed, and 1.8GB compressed. That suggests (?) that I should be able to do the following:

  1. Create a new (agglomerated) nih.gov database
  2. For each subdomain, insert the contents into the agglomerated DB
  3. Compress that, and get a single file of ~1.8GB.

This would make querying easier. I only have to do one query against a single file, and the question of ranking is solved. (That is, the ranking algorithms in FTS5 will work across all the content regardless of domain.) I do need to think about expanding our current table to include domains as well as paths, because the assumption is that a single file contains a single DB. And, it implies we may want to provide a ranking. That is, should the root of nih.gov rank above a subdomain? I don't know.

This is an experiment "to do."

performance

I want to write some Python in a notebook to analyze the guestbook database. I'll first see if I can export that DB to SQLite for analysis. Or, CSV. sling will be useful here.

To start:

sling run --src-conn WORKDB --src-stream 'public.hosts' --tgt-object 'file://hosts.json'

and

sling run --src-conn WORKDB --src-stream 'public.guestbook' --tgt-object 'file://guestbook.json'

created data to work with. (The connections had to be setup.)

host count
17 public.csr.nih.go 1
16 www.cit.nih.gov 42
1 www.nia.nih.gov 162
14 www.ninds.nih.gov 184
8 www.niaid.nih.gov 229
4 www.ninr.nih.gov 450
18 ncats.nih.gov 909
19 www.nidcr.nih.gov 1092
22 www.nibib.nih.gov 1219
2 www.nigms.nih.gov 1555
23 www.cc.nih.gov 2040
15 www.niaaa.nih.gov 2176
24 www.nccih.nih.gov 2295
21 www.niams.nih.gov 2296
6 www.fic.nih.gov 2518
7 www.nei.nih.gov 2633
11 nida.nih.gov 2974
13 www.nimhd.nih.gov 3833
3 www.nimh.nih.gov 5410
10 www.niddk.nih.gov 6374
20 www.nhlbi.nih.gov 7197
9 www.nichd.nih.gov 7238
12 www.niehs.nih.gov 7497
0 www.genome.gov 7675
25 www.cancer.gov 10375
5 www.nlm.nih.gov 36171

Now, I asked and answered these questions...

Q: How long did all of these fetches take?

A: One answer is from the first "last_fetched" until the last "last_fetched"

Q: Is that an accurate measure?

A: Yes, and no. No, because it asks what the largest domain is, and how long that took. That is, a large domain can only fetch one page per 2s, so that means the total duration is that of nlm.nih.gov.

Q: What is a more accurate measure?

A: What is the first and last fetch for each domain?

Q: Is that more accurate?

A: ... yes? It measures how long things take to get through the queue as well. So, it is authentic.

host seconds pages ppm
18 ncats.nih.gov 25761 909 2.11715
11 nida.nih.gov 39405 2974 4.52836
25 www.cancer.gov 46067 10375 13.5129
23 www.cc.nih.gov 44972 2040 2.72169
16 www.cit.nih.gov 1830 42 1.37705
6 www.fic.nih.gov 26478 2518 5.70587
0 www.genome.gov 58988 7675 7.80667
24 www.nccih.nih.gov 36567 2295 3.76569
7 www.nei.nih.gov 36993 2633 4.27054
20 www.nhlbi.nih.gov 66826 7197 6.46186
1 www.nia.nih.gov 74162 162 0.131064
15 www.niaaa.nih.gov 37522 2176 3.47956
8 www.niaid.nih.gov 40060 229 0.342986
21 www.niams.nih.gov 46591 2296 2.95679
22 www.nibib.nih.gov 19958 1219 3.6647
9 www.nichd.nih.gov 48998 7238 8.86322
19 www.nidcr.nih.gov 66145 1092 0.990551
10 www.niddk.nih.gov 51985 6374 7.35674
12 www.niehs.nih.gov 34842 7497 12.9103
2 www.nigms.nih.gov 24848 1555 3.75483
3 www.nimh.nih.gov 74370 5410 4.36466
13 www.nimhd.nih.gov 50098 3833 4.5906
14 www.ninds.nih.gov 78265 184 0.141059
4 www.ninr.nih.gov 28805 450 0.937337
5 www.nlm.nih.gov 83849 36171 25.883

Now, this is a measure of "pages per minute per host."

Some of these are nearing optimal. For example, nlm.nih.gov is 25.8 ppm. The theoretical max is 30ppm, because I fetch once every two seconds.

However, for others, it is around 3ppm.

This seems abysmal.

But.

All of these are in one queue.

The logic is as follows:

  1. walk finds a URL.
  2. It passes the URL to entree.
  3. entree checks the guestbook.
  4. If it has been recently updated, we skip. (There's a bit more here, but this is good enough for getting on with.)
  5. If it hasn't, we continue.
  6. entree enqueues the URL for fetch
  7. fetch grabs a URL from the queue
  8. fetch asks if the host has been hit less than 2s ago (against a local/in-memory dictionary)
  9. If the host has been hit, that worker sleeps until it is time to fetch the content
  10. If not, we continue immediately
  11. We grab the content (if it is an allowed MIME type, etc.)
  12. fetch updates the Guestbook and enqueues walk and extract jobs for this page

The likely slowdown here is that we are inserting a lot of jobs into the fetch queue for all of those domains. And, because workers are sleeping, that means that the 100 workers I was using might all be sleeping on a domain.

So. This is why some smaller domains, in the face of a larger domain (or a more link-rich domain) will be "starved" in this model.

Before I think about "fixes," I want to see some graphs. This is working, and doing "the right thing," but the model could, I think, have faster throughput.

I want to see some graphs.

fetches_by_domain

This is one way of visualizing what happened. Each dot is a fetch.

The y axis is the host, and the x axis is the time that the fetch happened. The left-most is the "first" fetch, and the right-most is the "last" fetch. The timing is less important than the relative nature. Looking a ncats.nih.gov, we can see how the fetches get spread out over time. This is because the queue is being thrashed by other domains, and the pages aren't coming around. So, the PPM value for ncats is 2, but that is because nlm is thrashing through the fetch queue constantly. (As are some other domains.) So while nlm has a PPM of 25, other domains are actually being starved.

The ideal fix is to have one queue per domain (and, a few workers per queue). Then, we can come close/hit the theoretical 30PPM limit per domain.

I'm attempting a pool-of-pools approach, where there are 10 fetch queues, and they are round-robin'd. This should see a speedup, because each pool of 10 workers is assigned to a smaller set of values to fetch (and, we bounce them around the queues). This might mean that the overall throughput per-domain is higher.

The current approach---one queue with all the domains on it---could go faster if I 1) have a large number of workers, and 2) I thrash the queue. This means that if we see 100 nlm jobs that have arrived too soon, they get thrown to the back of the queue, and we more quickly process other domains. It feels like a bad solution, but we'll see. Experiment #7 will demonstrate the round-robin queues.

experiment 7

I could have made some graphs.

I modified fetch.

Now, it does this:

  1. It spawns 10 fetch workers
  2. It spawns 10 fetch-0 workers
  3. It spawns 10 fetch-1 workers...
  4. ...
  5. It spawns 10 fetch-9 workers

(in other words, it does workers x workers)

Now, whenever a domain isn't available, I throw it at another queue. Now, I'm not sleeping (I should be), so this will maximally thrash the queue(s). But, I want to see what happens.

The watcher shows this:

  queue  |   state   | count
---------+-----------+------
 entree  | completed |  1519
 entree  | running   |     6
 extract | running   |     1
 extract | available |     1
 extract | completed |   166
 fetch   | available | 14001
 fetch   | running   |    10
 fetch   | completed |   313
 fetch-0 | available |   298
 fetch-0 | running   |     2
 fetch-0 | completed |   229
 fetch-1 | available |   554
 fetch-1 | completed |   219
 fetch-2 | available |   527
 fetch-2 | completed |   237
 fetch-2 | running   |     8
 fetch-3 | available |   265
 fetch-3 | completed |   233
 fetch-3 | running   |    19
 fetch-4 | completed |   218
 fetch-4 | running   |    10
 fetch-4 | available |   384
 fetch-5 | completed |   215
 fetch-5 | available |   338
 fetch-5 | running   |     8
 fetch-6 | available |   425
 fetch-6 | running   |     3
 fetch-6 | completed |   226
 fetch-7 | completed |   161
 fetch-7 | available |    37
 fetch-7 | running   |     7
 fetch-8 | running   |     9
 fetch-8 | completed |   239
 fetch-8 | available |  1289
 fetch-9 | completed |   255
 fetch-9 | running   |     2
 fetch-9 | available |   332
 pack    | available |     1
 pack    | completed |   211
 walk    | retryable |     1
 walk    | running   |     1
 walk    | completed |   166

Now, I have a lot of fetchers. And, because they're all playing off the same "when did we last see this domain" structure, they're all being polite. But, each different queue will be picked up by a different set of workers, meaning that I should see better throughput than a single set of workers, sleeping, on a single set of URLs.

In truth, I wonder if this would be simpler if I just had more workers, and I was more willing to thrash the queue. That is, 1000 workers who are more willing to thrash the queue would probably be faster than 10 pools of 10.

But, I'll allow this experiment to play out while I'm building some graphs of experiment 6.

... hours later...

Late in the run, this is what the queues looked like, focusing in only on available:

  queue  |   state   | count
---------+-----------+-------
 fetch   | available |    11
 fetch-0 | available |    78
 fetch-1 | available |   149
 fetch-2 | available |   491
 fetch-4 | available |    99
 fetch-5 | available |    62
 fetch-6 | available |    94
 fetch-7 | available |   264
 fetch-8 | available | 10755
 fetch-9 | available |    20

And, there's two domains cycling: nlm.nih.gov and cancer.gov. They tick by one per second, alternating.

So, the rest ran. Which is good... if there were more domains queued, they, too would run. However, this looks weird... there are 10 workers in each of those queue pools, and they're just shuffling jobs between each-other.

I'm going to pull the guestbook at this point and run some analysis, even though there are 10K pages left to fetch, I suspect.

experiment 7 analysis

Here is the Experiment 6/Run-01 plot:

fetches_by_domain

And Experiment 7/Run-02:

fetches_by_domain_run_02

Smaller domains finished faster using the round-robin approach.

And experiment 7/Run-03:

This is a per-domain queueing approach, where each domain gets its own queue, and 5 workers. (The incoming queue gets 10 workers.)

fetches-by-domain-by-run-03

Note the x-axis; the largest domain is nowhere near done yet.

The per-domain and round-robin have similar density for low-page-count domains. However, the round-robin involves a lot of queue thrashing. In this model, I have one client for fetch, and one client for all of the domains. The client managing the domains has a 500ms backoff, which means that it is hitting the queues very infrequently compared to the other approach. And, each domain has its own worker pool. So, the result is that each domain maintains a good PPM. Having per-domain queues is better in all but one case.

ppm-1 is the number of pages-per-minute with a naive queuing approach.

ppm-2 is with round-robin queuing, where anything delayed is shuffled over to the next queue.

ppm-3 is where every domain has its own queue for slow (1-page-per-2s) fetching.

2 vs 1 is how many times faster the round-robin is than the naive approach.

3 vs 1 is how many times faster the per-domain queuing approach is.

While the two are similar... well, round-robin is more general, but thrashes the queue hard. per-domain is conceptually clear, and thrashes the queue less, while being more performant. I'm inclined to go that direction for production.

host ppm-1 ppm-2 ppm-3 2 vs 1 3 vs 1 3 better?
ncats.nih.gov 2.12 4.36 5.65 2.06 2.67 yes
nida.nih.gov 4.53 6.35 10.76 1.40 2.38 yes
www.cancer.gov 13.51 26.38 27.89 1.95 2.06 yes
www.cc.nih.gov 2.72 2.76 9.31 1.01 3.42 yes
www.cit.nih.gov 1.38 1.38 1.68 1.00 1.22 yes
www.fic.nih.gov 5.71 12.33 17.89 2.16 3.14 yes
www.genome.gov 7.81 23.68 26.00 3.03 3.33 yes
www.nccih.nih.gov 3.77 17.24 9.03 4.58 2.40 no
www.nei.nih.gov 4.27 19.45 20.14 4.56 4.72 yes
www.nhlbi.nih.gov 6.46 22.24 26.59 3.44 4.11 yes
www.nia.nih.gov 0.13 0.29 0.47 2.18 3.59 yes
www.niaaa.nih.gov 3.48 3.19 16.53 0.92 4.75 yes
www.niaid.nih.gov 0.34 0.45 0.92 1.32 2.69 yes
www.niams.nih.gov 2.96 5.88 13.46 1.99 4.55 yes
www.nibib.nih.gov 3.66 11.56 17.52 3.16 4.78 yes
www.nichd.nih.gov 8.86 26.74 27.28 3.02 3.08 yes
www.nidcr.nih.gov 0.99 2.50 8.22 2.53 8.30 yes
www.niddk.nih.gov 7.36 7.86 19.01 1.07 2.58 yes
www.niehs.nih.gov 12.91 26.68 27.79 2.07 2.15 yes
www.nigms.nih.gov 3.75 13.63 17.72 3.63 4.72 yes
www.nimh.nih.gov 4.36 5.20 13.86 1.19 3.18 yes
www.nimhd.nih.gov 4.59 15.01 18.96 3.27 4.13 yes
www.ninds.nih.gov 0.14 0.24 0.42 1.71 2.97 yes
www.ninr.nih.gov 0.94 0.46 1.82 0.49 1.94 yes
www.nlm.nih.gov 25.88 31.67 53.08 1.22 2.05 yes

another approach

This is an idea if other approaches do not seem to be efficient...

Another approach is to have "backoff queues."

The incoming queue processes as fast as possible. Domains that are ready are fetched.

If a domain isn't ready, it goes to the next queue. It processes one URL every 100ms. So, it is a slower queue.

If the domain is ready, it is fetched. Otherwise, it goes to the next queue, which is a 200ms queue.

Then 400. 800. 1600.

At that point, we are fetching one URL every 1.6s from that queue. It should have a high hit rate. This will also be a long queue, but it should be the case that shorter/smaller domains end up with fewer pages on this queue?

This might have odd/unintended consequences. A domain with lots of URLs will end up clogging that queue, and if anyone else makes that queue, they'll take forever to get that page processed. (But, they should end up with very few pages on the queue.)

A per-domain queueing system would get more pages through, but would still want a backoff. Why? Because we can only hit a given domain so fast, and there's no reason to thrash the queue. Per-domain queues could all check once every 1.5-2s, by definition. But, if we have thousands of domains, it implies thousands of queues. (But... I might only need 2-3 workers/queue, because the parallelism comes from the number of queues, not the number of workers.)