initial tests

I want to see how different numbers of workers impact performance.

Experiments

en10,ex30,f100,p10,w10
en50,ex50,f150,p10,w50
multiple services

Some suspicions...

experiment 1

I set the config to

service	workers
entree	10
extract	30
fetch	100
pack	10
walk	10

I then ran

run-list.bash va-200.txt

which has 200 va.gov domains.

The polite backoff was set to 2 seconds (meaning, I shouldn't hit any one subdomain more than once every two seconds).

JPM = jobs per minute

measure	value
guestbook pages	25650
duration (minutes)	14
overall PPM	1832
entree JPM	4941
entree total jobs	69174
fetch JPM	9680
fetch total jobs	135531
extract JPM	700
extract total jobs	9797
walk JPM	692
walk total jobs	9689

I saw a total PPM of 1832. I calculate this by:

duration is calculated as the earliest last_fetched value in the guestbook subtracted from the newest entry in the table.
Take the count of the number of entries in the guestbook (pages), and divide this by the duration

Walk experienced a number of failures -- not many, but some. This may have to do with a bug that was observed yesterday in the goquery parsing. There may be situations where a failed page fetch is creating an empty object in the S3 bucket (?), and then we fetch the empty object and try and parse it. (Or, something.)

I noted some domains in the VA space that had x509 cert problems; this would account for a crawl that yielded pages that would not fetch. So, some investigation is required. (And, those kinds of errors probably need to be logged in a way that we can use them to investigate, and then inform Federal partners.)

experiment 2

The point of the second experiment was to run the same domains with more workers. I had to add a pgxpool to entree, as the increase in workers exposed a concurrency issue in EvaluateEntree. So, these are not the same code.

service	workers
entree	50
extract	50
fetch	150
pack	10
walk	50

I am still suspicious of what happens if we have too many packers. So, for now, I'm keeping that number low. I am further suspicious that things ran longer? (I can't remember if I stopped the previous experiment.) This was stopped after 30m, roughly.

measure	value
guestbook pages	32675
duration (minutes)	29
overall PPM	1126
entree JPM	3472
entree total jobs	100713
fetch JPM	7746
fetch total jobs	224641
extract JPM	518
extract total jobs	9339
walk JPM	479
walk total jobs	13899

So, the PPM was lower, but... the runtime was longer. The PPM is a measure of total guestbook entries over the duration from last to first. As things slow down, the apparent rate will slow down. (That is, if we find ourselves walking a lot of HEAD requests, and doing fewer GET requests, we will do a smaller number of inserts, and the density will go down. Anyway.)

Without some repeatability, I'm not sure I can get a good measure. I'll use a smaller set of sites, and see if I can get a repeatable number.

host	host	duration	count	ppm
www.va.gov	350	00:29:07.63452	5634	194
news.va.gov	311	00:28:59.413167	3615	129
www.cfm.va.gov	694	00:28:53.334614	2498	89
www.hsrd.research.va.gov	481	00:29:12.130379	2315	79
www.research.va.gov	1027	00:28:02.625225	1871	66
department.va.gov	1257	00:28:57.112199	1848	66

multiple services

In this experiment, I added a multi.yaml Docker config.

https://medium.com/@vinodkrane/microservices-scaling-and-load-balancing-using-docker-compose-78bf8dc04da9

I then spawned the stack with this configuration:

docker compose -f multi.yaml up \
  --scale entree=2 \
  --scale fetch=2 \
  --scale extract=2 \
  --scale walk=2

The entree service should still be polite; we only fetch against a domain if 2 seconds have passed. So, the fetch rate should not increase (per se). However, our total processing capability should go up.

At 6 minutes, the effective overall PPM was around 3200.

At 12 minutes, it was 2300. I seriously suspect the queue thrashing is slowing things down, and also wonder if we're running out of "fresh" URLs (which makes the number look lower?).

At approximately 22 minutes, stopped.

host	host	duration	count	ppm
www.va.gov	350	00:21:50.76345	6691	318
news.va.gov	311	00:21:32.693246	4132	196
www.cfm.va.gov	694	00:21:59.184202	2604	124
va.ng.mil	1146	00:21:58.52216	2578	122
www.hsrd.research.va.gov	481	00:21:51.007061	2528	120
www.research.va.gov	1027	00:21:12.910271	1899	90
department.va.gov	1257	00:22:02.370991	1550	70
www.mirecc.va.gov	360	00:21:01.910669	1263	60
www.cem.va.gov	385	00:22:14.338689	1150	52
digital.va.gov	1167	00:22:18.169547	1016	46
www.publichealth.va.gov	793	00:22:12.555373	1001	45

I have no idea how I'm calculating a PPM of... more than 30? on a given domain. This doesn't make sense. I'm supposed to be preventing a DDoS attack. This calculation (now that I look at it) makes me wonder how I ever have a PPM greater than 30 on a given domain?

The overall PPM was 1666.

measure	value
guestbook pages	36656
duration (minutes)	22
overall PPM	1666
entree JPM	6219
entree total jobs	136826
fetch JPM	9903
fetch total jobs	217869
extract JPM	957
extract total jobs	12444
walk JPM	831
walk total jobs	18286

fetch is operating near maximum throughput; doubling the number of extract and walk services roughly doubled the throughput of those services. This suggests that... hm. I wonder if just increasing the worker pool to 100 from 50 would double the throughput of extract and walk. The RAM usage will go up, for sure... extract loads things into RAM to do extraction. Multiple services might be "cheaper"?

Anyway. I need to look at queue thrashing. And, I might want to clear the queue periodically. completed jobs don't need to be kept around.

suspicions

I have a few different suspicions/thoughts.

If entree has 50 workers, the largest number of domains that will pass through the front door close to the 2s deadline will be... roughly 50. And, it might be worse than that, because we're thrashing the queue. (In other words, we're waiting for things to come around on a very noisy queue.) It could be at the host level (which is only thousands of domains, at worst tens-of-thousands) the 2s timeout should be determined by an in-memory map, as opposed to going out to the hosts table, and... perhaps the worker should just wait until it is time to fetch. This way, an entree service can hold (say) 200, 400 domains at once, and potentially be sleeping for fractions of a second on each. This feels like a better way than requeueing jobs, and waiting for them to come back around.

I wonder/must be (?) hitting domains faster. I think. I am not certain that I'm being polite at this point. Although... if the entree services are doing their job?... then, it should still be the case that a given domain is getting hit once every 2s, and therefore, the pair are doing more work, but not necessarily to a single domain?

Parallel work at scale is hard. I need better visibility into what is going on in the aggregate (e.g. how do I know that a given domain is being treated nicely?) in order to be confident if we want to scale up like this.

SQL

The following SQL was used for generating numbers in the analysis

entree

select (select count(*) from river_job where kind='entree' and state='completed') /
(select extract(minute from (
	select maxf-minf from
		(select max(rj.finalized_at) as maxf, min(rj.finalized_at) as minf from river_job rj
		where kind='entree')))) as entree_jobs_per_minute,
	(select count(*) from river_job where kind='entree' and state='completed') as entree_total

fetch

select (select count(*) from river_job where kind='fetch' and state='completed') /
(select extract(minute from (
	select maxf-minf from
		(select max(rj.finalized_at) as maxf, min(rj.finalized_at) as minf from river_job rj
		where kind='fetch')))) as fetch_jobs_per_minute,
	(select count(*) from river_job where kind='fetch' and state='completed') as fetch_total

extract

select (select count(*) from river_job where kind='extract' and state='completed') /
(select extract(minute from (
	select maxf-minf from
		(select max(rj.finalized_at) as maxf, min(rj.finalized_at) as minf from river_job rj
		where kind='extract')))) as extract_jobs_per_minute,
	(select count(*) from river_job where kind='extract' and state='completed') as extract_total

walk

select (select count(*) from river_job where kind='walk' and state='completed') /
(select extract(minute from (
	select maxf-minf from
		(select max(rj.finalized_at) as maxf, min(rj.finalized_at) as minf from river_job rj
		where kind='walk')))) as walk_jobs_per_minute,
		(select count(*) from river_job where kind='walk' and state='completed') as walk_total

PPM per domain

The SQL to get a PPM per domain:

select 
	hname.host, gb.host, max(gb.last_fetched)-min(gb.last_fetched) as duration,
	count(gb.path) as count, 
	count(gb.path)/nullif(extract(minute from max(gb.last_fetched)-min(gb.last_fetched)), 0) as ppm
from guestbook gb, hosts hname
where hname.id = gb.host
group by gb.host, hname.host
order by count desc, ppm asc

which yields something like:

host	host	duration	count	ppm
www.va.gov	350	00:13:48.627931	3603	277.15
news.va.gov	311	00:13:32.876906	2267	174.38
www.research.va.gov	1027	00:13:07.186824	1803	138.69
department.va.gov	1257	00:14:03.299157	1744	124.57
www.hsrd.research.va.gov	481	00:14:00.124224	1653	118.07
www.cfm.va.gov	694	00:14:06.303193	1421	101.50
va.ng.mil	1146	00:14:03.421357	1109	79.21
www.mirecc.va.gov	360	00:13:37.558012	1050	80.77

And, then

select count(gb.path) / extract(minute from (select maxf-minf from
(select max(gb.last_fetched) as maxf, min(gb.last_fetched) as minf
from guestbook gb) as duration)) as ppm from guestbook gb

which yields a single number (overall pages per minute, or ppm)

In pieces:

select count(gb.path) from guestbook gb

select extract(minute from (select maxf-minf from
(select max(gb.last_fetched) as maxf, min(gb.last_fetched) as minf
from guestbook gb) as duration)) minutes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly