gateway tests

experiments

experiment 5

I was running a series of tests that I now doubt. I've cut them. First, I think I had some SQL doing the wrong thing in terms of individual domain PPM. I need to write some Python to do the right calculations. (And, I want some over-time graphs/histograms anyway.)

The tests did produce high throughput, which I believe was accurate. That is, with 400 fetch workers, I was able to cycle through a lot of content. However, it was 200 domains with hundreds of workers in every service. And, more importantly, it was a short test.

Short tests have lots of URLs to fetch. Long tests get into the tail of a site. Or: if you have 200 sites, and many of them are small, you'll finish them (and your overall throughput will look great), but the one remaining large site will still fetch at one page every two seconds. The rate-over-time is always 2 pages/second for a given domain.

measure	value (v5)	v4	v3	v2	v1
guestbook pages	49645	20163	23873	18724	25650
duration (minutes)	47	20	28	21	14
overall PPM	1056	1008	852	891	1832
entree total jobs	252733	62581	68376	41196	69174
entree JPM	5521	3129	2442	1961	4941
fetch total jobs	1009510	242156	289043	66215	135531
fetch JPM	22088	12107	10322	3153	9680
extract total jobs	26297	6315	8797	4534	9797
extract JPM	573	315	314	215	700
walk total jobs	26130	6229	8690	4482	9689
walk JPM	580	311	310	213	692

experiment 6

I tried one entree worker, one fetch, one extract, and one walk. At 2 seconds/page, I should see 30ppm as my optimal throughput. I saw closer to 12ppm. Why? Some things are PDFs, and are slow to extract. You see the same pages over-and-over, and those take time to walk and process. So, at 1 worker/service, it is slower than optimal, for expected/imaginable reasons.

This also confirmed that the services were "behaving," especially entree and fetch.

experiment 7

service	workers
entree	50
extract	20
fetch	100
walk	20

I ran the NIH urls that we have.

This is a more accurate/honest test. It will take around 24 hours. A few things learned.

I need to create some kind of error log. This is different than logging to NR or similar. This is where, when pages fail for known reasons, that we log them in a DB for later investigation. I now have a number of points where a page can fail to fetch or extract, and I want to log those.
I limited PDFs to 5MB, because I was running out of RAM. With 20 workers, I could get 20 large PDFs at the same time. Of all of the services, extract is the most RAM sensitive.

I saw high throughput to start, and am now down to a single URL, which is closing out the last 3000 items in the queue at one item per 2s. It will take around 1500s/60s/m => 25m.

things to do

Create an error log for diagnostics/partner reporting.
pack at intervals. At this point, I'd like the last domain to have packed every... 10000 pages, or every n minutes, or something. This way, we are providing incremental results, since we'll be crawling lots of things all the time.
I'm wiping completed every three minutes. This is a bit aggressive. But, completed ends up filling the queue database, and having millions of rows in that DB by the end of the day is unnecessary.

filesizes

These numbers were when there were 3K fetch actions left from the NIH run. So, this is 99% of the NIH dataset.

for d in sqlite zstd ; do pushd $d ; stat -c "%n,%s" -- * | column -t -s, ; popd ; done

ncats.nih.gov.sqlite      75907072
nida.nih.gov.sqlite       296427520
search.gov.sqlite         6754304
www.cancer.gov.sqlite     751804416
www.cc.nih.gov.sqlite     261128192
www.cit.nih.gov.sqlite    2113536
www.fac.gov.sqlite        1511424
www.fic.nih.gov.sqlite    368394240
www.genome.gov.sqlite     371859456
www.nccih.nih.gov.sqlite  633974784
www.nei.nih.gov.sqlite    103837696
www.nhlbi.nih.gov.sqlite  390328320
www.niaaa.nih.gov.sqlite  161193984
www.niaid.nih.gov.sqlite  106496
www.niams.nih.gov.sqlite  196116480
www.nia.nih.gov.sqlite    114688
www.nibib.nih.gov.sqlite  92028928
www.nichd.nih.gov.sqlite  4045082624
www.nidcr.nih.gov.sqlite  68005888
www.niddk.nih.gov.sqlite  172163072
www.niehs.nih.gov.sqlite  3007397888
www.nigms.nih.gov.sqlite  259956736
www.nimhd.nih.gov.sqlite  112951296
www.nimh.nih.gov.sqlite   706301952
www.ninds.nih.gov.sqlite  106496
www.ninr.nih.gov.sqlite   19484672
www.nlm.nih.gov.sqlite    151867392

and

ncats.nih.gov.sqlite.zstd      7666732
nida.nih.gov.sqlite.zstd       34773743
search.gov.sqlite.zstd         1123866
www.cancer.gov.sqlite.zstd     107231310
www.cc.nih.gov.sqlite.zstd     16772282
www.cit.nih.gov.sqlite.zstd    163872
www.fac.gov.sqlite.zstd        267103
www.fic.nih.gov.sqlite.zstd    28904089
www.genome.gov.sqlite.zstd     40279650
www.nccih.nih.gov.sqlite.zstd  41647877
www.nei.nih.gov.sqlite.zstd    15527666
www.nhlbi.nih.gov.sqlite.zstd  43916005
www.niaaa.nih.gov.sqlite.zstd  23605389
www.niaid.nih.gov.sqlite.zstd  11797
www.niams.nih.gov.sqlite.zstd  22963428
www.nia.nih.gov.sqlite.zstd    8384
www.nibib.nih.gov.sqlite.zstd  12554245
www.nichd.nih.gov.sqlite.zstd  1111609743
www.nidcr.nih.gov.sqlite.zstd  8830463
www.niddk.nih.gov.sqlite.zstd  29633034
www.niehs.nih.gov.sqlite.zstd  225221029
www.nigms.nih.gov.sqlite.zstd  21056165
www.nimhd.nih.gov.sqlite.zstd  10288537
www.nimh.nih.gov.sqlite.zstd   86354587
www.ninds.nih.gov.sqlite.zstd  10625
www.ninr.nih.gov.sqlite.zstd   2137708
www.nlm.nih.gov.sqlite.zstd    30627062

are the baseline numbers. There are 12GB of SQLite files from the run, but there are 1.8GB of files when they are compressed with zstd. This matters, because we can create read-only databases with zstd.

In deployment, this is the difference between being able to fit all of NIH on a single droplet, or having to use two. The compression seems to range from a factor of 3x to 10x with zstd. In this case, we're looking at being able to fit 3x the content on a single droplet.

Ultimately, we're going to need a balance component (or distribute) that can take a query, and ask multiple servers for results, and then return the agglomeration. The challenge will be that the distributor will want to know that NIH is made up of ~20 domains, and those are all on one or more droplets. However, when we pass an NIH query to a droplet, it needs to know that there might be 20 databases that need to be queried in order to return a result.

Without compression, we're around 2-10ms per query. With compression, we'll be in the 20-50ms/query. If we have to query 20 databases... well, we can use gofuncs or similar, but in a straight line, we'd be looking at 1000ms to resolve the query. So, distributing queries over large agency domains will mean that we're taking around 2s to resolve the query. There may be things we can do in order to provide incremental results, or speed things up (e.g. parallelism). That's both... ok but also not ideal.

However, it is a small/light system, and it scales. As can be seen, it looks like all of NIH fits on a single server.

There's more I want to explore here about performance. This is just the first glance, and some initial thoughts.

all the content in one database

The NIH content is 12GB uncompressed, and 1.8GB compressed. That suggests (?) that I should be able to do the following:

Create a new (agglomerated) nih.gov database
For each subdomain, insert the contents into the agglomerated DB
Compress that, and get a single file of ~1.8GB.

This would make querying easier. I only have to do one query against a single file, and the question of ranking is solved. (That is, the ranking algorithms in FTS5 will work across all the content regardless of domain.) I do need to think about expanding our current table to include domains as well as paths, because the assumption is that a single file contains a single DB. And, it implies we may want to provide a ranking. That is, should the root of nih.gov rank above a subdomain? I don't know.

This is an experiment "to do."

performance

I want to write some Python in a notebook to analyze the guestbook database. I'll first see if I can export that DB to SQLite for analysis. Or, CSV. sling will be useful here.

To start:

sling run --src-conn WORKDB --src-stream 'public.hosts' --tgt-object 'file://hosts.json'

and

sling run --src-conn WORKDB --src-stream 'public.guestbook' --tgt-object 'file://guestbook.json'

created data to work with. (The connections had to be setup.)

	host	count
17	public.csr.nih.go	1
16	www.cit.nih.gov	42
1	www.nia.nih.gov	162
14	www.ninds.nih.gov	184
8	www.niaid.nih.gov	229
4	www.ninr.nih.gov	450
18	ncats.nih.gov	909
19	www.nidcr.nih.gov	1092
22	www.nibib.nih.gov	1219
2	www.nigms.nih.gov	1555
23	www.cc.nih.gov	2040
15	www.niaaa.nih.gov	2176
24	www.nccih.nih.gov	2295
21	www.niams.nih.gov	2296
6	www.fic.nih.gov	2518
7	www.nei.nih.gov	2633
11	nida.nih.gov	2974
13	www.nimhd.nih.gov	3833
3	www.nimh.nih.gov	5410
10	www.niddk.nih.gov	6374
20	www.nhlbi.nih.gov	7197
9	www.nichd.nih.gov	7238
12	www.niehs.nih.gov	7497
0	www.genome.gov	7675
25	www.cancer.gov	10375
5	www.nlm.nih.gov	36171

Now, I asked and answered these questions...

Q: How long did all of these fetches take?

A: One answer is from the first "last_fetched" until the last "last_fetched"

Q: Is that an accurate measure?

A: Yes, and no. No, because it asks what the largest domain is, and how long that took. That is, a large domain can only fetch one page per 2s, so that means the total duration is that of nlm.nih.gov.

Q: What is a more accurate measure?

A: What is the first and last fetch for each domain?

Q: Is that more accurate?

A: ... yes? It measures how long things take to get through the queue as well. So, it is authentic.

	host	seconds	pages	ppm
18	ncats.nih.gov	25761	909	2.11715
11	nida.nih.gov	39405	2974	4.52836
25	www.cancer.gov	46067	10375	13.5129
23	www.cc.nih.gov	44972	2040	2.72169
16	www.cit.nih.gov	1830	42	1.37705
6	www.fic.nih.gov	26478	2518	5.70587
0	www.genome.gov	58988	7675	7.80667
24	www.nccih.nih.gov	36567	2295	3.76569
7	www.nei.nih.gov	36993	2633	4.27054
20	www.nhlbi.nih.gov	66826	7197	6.46186
1	www.nia.nih.gov	74162	162	0.131064
15	www.niaaa.nih.gov	37522	2176	3.47956
8	www.niaid.nih.gov	40060	229	0.342986
21	www.niams.nih.gov	46591	2296	2.95679
22	www.nibib.nih.gov	19958	1219	3.6647
9	www.nichd.nih.gov	48998	7238	8.86322
19	www.nidcr.nih.gov	66145	1092	0.990551
10	www.niddk.nih.gov	51985	6374	7.35674
12	www.niehs.nih.gov	34842	7497	12.9103
2	www.nigms.nih.gov	24848	1555	3.75483
3	www.nimh.nih.gov	74370	5410	4.36466
13	www.nimhd.nih.gov	50098	3833	4.5906
14	www.ninds.nih.gov	78265	184	0.141059
4	www.ninr.nih.gov	28805	450	0.937337
5	www.nlm.nih.gov	83849	36171	25.883

Now, this is a measure of "pages per minute per host."

Some of these are nearing optimal. For example, nlm.nih.gov is 25.8 ppm. The theoretical max is 30ppm, because I fetch once every two seconds.

However, for others, it is around 3ppm.

This seems abysmal.

But.

All of these are in one queue.

The logic is as follows:

walk finds a URL.
It passes the URL to entree.
entree checks the guestbook.
If it has been recently updated, we skip. (There's a bit more here, but this is good enough for getting on with.)
If it hasn't, we continue.
entree enqueues the URL for fetch
fetch grabs a URL from the queue
fetch asks if the host has been hit less than 2s ago (against a local/in-memory dictionary)
If the host has been hit, that worker sleeps until it is time to fetch the content
If not, we continue immediately
We grab the content (if it is an allowed MIME type, etc.)
fetch updates the Guestbook and enqueues walk and extract jobs for this page

The likely slowdown here is that we are inserting a lot of jobs into the fetch queue for all of those domains. And, because workers are sleeping, that means that the 100 workers I was using might all be sleeping on a domain.

So. This is why some smaller domains, in the face of a larger domain (or a more link-rich domain) will be "starved" in this model.

Before I think about "fixes," I want to see some graphs. This is working, and doing "the right thing," but the model could, I think, have faster throughput.

I want to see some graphs.

fetches_by_domain

This is one way of visualizing what happened. Each dot is a fetch.

The y axis is the host, and the x axis is the time that the fetch happened. The left-most is the "first" fetch, and the right-most is the "last" fetch. The timing is less important than the relative nature. Looking a ncats.nih.gov, we can see how the fetches get spread out over time. This is because the queue is being thrashed by other domains, and the pages aren't coming around. So, the PPM value for ncats is 2, but that is because nlm is thrashing through the fetch queue constantly. (As are some other domains.) So while nlm has a PPM of 25, other domains are actually being starved.

The ideal fix is to have one queue per domain (and, a few workers per queue). Then, we can come close/hit the theoretical 30PPM limit per domain.

I'm attempting a pool-of-pools approach, where there are 10 fetch queues, and they are round-robin'd. This should see a speedup, because each pool of 10 workers is assigned to a smaller set of values to fetch (and, we bounce them around the queues). This might mean that the overall throughput per-domain is higher.

The current approach---one queue with all the domains on it---could go faster if I 1) have a large number of workers, and 2) I thrash the queue. This means that if we see 100 nlm jobs that have arrived too soon, they get thrown to the back of the queue, and we more quickly process other domains. It feels like a bad solution, but we'll see. Experiment #7 will demonstrate the round-robin queues.

experiment 7

I could have made some graphs.

I modified fetch.

Now, it does this:

It spawns 10 fetch workers
It spawns 10 fetch-0 workers
It spawns 10 fetch-1 workers...
...
It spawns 10 fetch-9 workers

(in other words, it does workers x workers)

Now, whenever a domain isn't available, I throw it at another queue. Now, I'm not sleeping (I should be), so this will maximally thrash the queue(s). But, I want to see what happens.

The watcher shows this:

  queue  |   state   | count
---------+-----------+------
 entree  | completed |  1519
 entree  | running   |     6
 extract | running   |     1
 extract | available |     1
 extract | completed |   166
 fetch   | available | 14001
 fetch   | running   |    10
 fetch   | completed |   313
 fetch-0 | available |   298
 fetch-0 | running   |     2
 fetch-0 | completed |   229
 fetch-1 | available |   554
 fetch-1 | completed |   219
 fetch-2 | available |   527
 fetch-2 | completed |   237
 fetch-2 | running   |     8
 fetch-3 | available |   265
 fetch-3 | completed |   233
 fetch-3 | running   |    19
 fetch-4 | completed |   218
 fetch-4 | running   |    10
 fetch-4 | available |   384
 fetch-5 | completed |   215
 fetch-5 | available |   338
 fetch-5 | running   |     8
 fetch-6 | available |   425
 fetch-6 | running   |     3
 fetch-6 | completed |   226
 fetch-7 | completed |   161
 fetch-7 | available |    37
 fetch-7 | running   |     7
 fetch-8 | running   |     9
 fetch-8 | completed |   239
 fetch-8 | available |  1289
 fetch-9 | completed |   255
 fetch-9 | running   |     2
 fetch-9 | available |   332
 pack    | available |     1
 pack    | completed |   211
 walk    | retryable |     1
 walk    | running   |     1
 walk    | completed |   166

Now, I have a lot of fetchers. And, because they're all playing off the same "when did we last see this domain" structure, they're all being polite. But, each different queue will be picked up by a different set of workers, meaning that I should see better throughput than a single set of workers, sleeping, on a single set of URLs.

In truth, I wonder if this would be simpler if I just had more workers, and I was more willing to thrash the queue. That is, 1000 workers who are more willing to thrash the queue would probably be faster than 10 pools of 10.

But, I'll allow this experiment to play out while I'm building some graphs of experiment 6.

... hours later...

Late in the run, this is what the queues looked like, focusing in only on available:

  queue  |   state   | count
---------+-----------+-------
 fetch   | available |    11
 fetch-0 | available |    78
 fetch-1 | available |   149
 fetch-2 | available |   491
 fetch-4 | available |    99
 fetch-5 | available |    62
 fetch-6 | available |    94
 fetch-7 | available |   264
 fetch-8 | available | 10755
 fetch-9 | available |    20

And, there's two domains cycling: nlm.nih.gov and cancer.gov. They tick by one per second, alternating.

So, the rest ran. Which is good... if there were more domains queued, they, too would run. However, this looks weird... there are 10 workers in each of those queue pools, and they're just shuffling jobs between each-other.

I'm going to pull the guestbook at this point and run some analysis, even though there are 10K pages left to fetch, I suspect.

experiment 7 analysis

Here is the Experiment 6/Run-01 plot:

fetches_by_domain

And Experiment 7/Run-02:

fetches_by_domain_run_02

Smaller domains finished faster using the round-robin approach.

And experiment 7/Run-03:

This is a per-domain queueing approach, where each domain gets its own queue, and 5 workers. (The incoming queue gets 10 workers.)

fetches-by-domain-by-run-03

Note the x-axis; the largest domain is nowhere near done yet.

The per-domain and round-robin have similar density for low-page-count domains. However, the round-robin involves a lot of queue thrashing. In this model, I have one client for fetch, and one client for all of the domains. The client managing the domains has a 500ms backoff, which means that it is hitting the queues very infrequently compared to the other approach. And, each domain has its own worker pool. So, the result is that each domain maintains a good PPM. Having per-domain queues is better in all but one case.

ppm-1 is the number of pages-per-minute with a naive queuing approach.

ppm-2 is with round-robin queuing, where anything delayed is shuffled over to the next queue.

ppm-3 is where every domain has its own queue for slow (1-page-per-2s) fetching.

2 vs 1 is how many times faster the round-robin is than the naive approach.

3 vs 1 is how many times faster the per-domain queuing approach is.

While the two are similar... well, round-robin is more general, but thrashes the queue hard. per-domain is conceptually clear, and thrashes the queue less, while being more performant. I'm inclined to go that direction for production.

host	ppm-1	ppm-2	ppm-3	2 vs 1	3 vs 1	3 better?
ncats.nih.gov	2.12	4.36	5.65	2.06	2.67	yes
nida.nih.gov	4.53	6.35	10.76	1.40	2.38	yes
www.cancer.gov	13.51	26.38	27.89	1.95	2.06	yes
www.cc.nih.gov	2.72	2.76	9.31	1.01	3.42	yes
www.cit.nih.gov	1.38	1.38	1.68	1.00	1.22	yes
www.fic.nih.gov	5.71	12.33	17.89	2.16	3.14	yes
www.genome.gov	7.81	23.68	26.00	3.03	3.33	yes
www.nccih.nih.gov	3.77	17.24	9.03	4.58	2.40	no
www.nei.nih.gov	4.27	19.45	20.14	4.56	4.72	yes
www.nhlbi.nih.gov	6.46	22.24	26.59	3.44	4.11	yes
www.nia.nih.gov	0.13	0.29	0.47	2.18	3.59	yes
www.niaaa.nih.gov	3.48	3.19	16.53	0.92	4.75	yes
www.niaid.nih.gov	0.34	0.45	0.92	1.32	2.69	yes
www.niams.nih.gov	2.96	5.88	13.46	1.99	4.55	yes
www.nibib.nih.gov	3.66	11.56	17.52	3.16	4.78	yes
www.nichd.nih.gov	8.86	26.74	27.28	3.02	3.08	yes
www.nidcr.nih.gov	0.99	2.50	8.22	2.53	8.30	yes
www.niddk.nih.gov	7.36	7.86	19.01	1.07	2.58	yes
www.niehs.nih.gov	12.91	26.68	27.79	2.07	2.15	yes
www.nigms.nih.gov	3.75	13.63	17.72	3.63	4.72	yes
www.nimh.nih.gov	4.36	5.20	13.86	1.19	3.18	yes
www.nimhd.nih.gov	4.59	15.01	18.96	3.27	4.13	yes
www.ninds.nih.gov	0.14	0.24	0.42	1.71	2.97	yes
www.ninr.nih.gov	0.94	0.46	1.82	0.49	1.94	yes
www.nlm.nih.gov	25.88	31.67	53.08	1.22	2.05	yes

another approach

This is an idea if other approaches do not seem to be efficient...

Another approach is to have "backoff queues."

The incoming queue processes as fast as possible. Domains that are ready are fetched.

If a domain isn't ready, it goes to the next queue. It processes one URL every 100ms. So, it is a slower queue.

If the domain is ready, it is fetched. Otherwise, it goes to the next queue, which is a 200ms queue.

Then 400. 800. 1600.

At that point, we are fetching one URL every 1.6s from that queue. It should have a high hit rate. This will also be a long queue, but it should be the case that shorter/smaller domains end up with fewer pages on this queue?

This might have odd/unintended consequences. A domain with lots of URLs will end up clogging that queue, and if anyone else makes that queue, they'll take forever to get that page processed. (But, they should end up with very few pages on the queue.)

A per-domain queueing system would get more pages through, but would still want a backoff. Why? Because we can only hit a given domain so fast, and there's no reason to thrash the queue. Per-domain queues could all check once every 1.5-2s, by definition. But, if we have thousands of domains, it implies thousands of queues. (But... I might only need 2-3 workers/queue, because the parallelism comes from the number of queues, not the number of workers.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gateway tests

experiments

experiment 5

experiment 6

experiment 7

things to do

filesizes

all the content in one database

performance

experiment 7

experiment 7 analysis

another approach

Clone this wiki locally