Skip to content

initial tests

Matt Jadud edited this page Nov 24, 2024 · 1 revision

I want to see how different numbers of workers impact performance.

Experiments

  1. en10,ex30,f100,p10,w10
  2. en50,ex50,f150,p10,w50
  3. multiple services

Some suspicions...

experiment 1

I set the config to

service workers
entree 10
extract 30
fetch 100
pack 10
walk 10

I then ran

run-list.bash va-200.txt

which has 200 va.gov domains.

The polite backoff was set to 2 seconds (meaning, I shouldn't hit any one subdomain more than once every two seconds).

JPM = jobs per minute

measure value
guestbook pages 25650
duration (minutes) 14
overall PPM 1832
entree JPM 4941
entree total jobs 69174
fetch JPM 9680
fetch total jobs 135531
extract JPM 700
extract total jobs 9797
walk JPM 692
walk total jobs 9689

I saw a total PPM of 1832. I calculate this by:

  1. duration is calculated as the earliest last_fetched value in the guestbook subtracted from the newest entry in the table.
  2. Take the count of the number of entries in the guestbook (pages), and divide this by the duration

Walk experienced a number of failures -- not many, but some. This may have to do with a bug that was observed yesterday in the goquery parsing. There may be situations where a failed page fetch is creating an empty object in the S3 bucket (?), and then we fetch the empty object and try and parse it. (Or, something.)

I noted some domains in the VA space that had x509 cert problems; this would account for a crawl that yielded pages that would not fetch. So, some investigation is required. (And, those kinds of errors probably need to be logged in a way that we can use them to investigate, and then inform Federal partners.)

experiment 2

The point of the second experiment was to run the same domains with more workers. I had to add a pgxpool to entree, as the increase in workers exposed a concurrency issue in EvaluateEntree. So, these are not the same code.

service workers
entree 50
extract 50
fetch 150
pack 10
walk 50

I am still suspicious of what happens if we have too many packers. So, for now, I'm keeping that number low. I am further suspicious that things ran longer? (I can't remember if I stopped the previous experiment.) This was stopped after 30m, roughly.

measure value
guestbook pages 32675
duration (minutes) 29
overall PPM 1126
entree JPM 3472
entree total jobs 100713
fetch JPM 7746
fetch total jobs 224641
extract JPM 518
extract total jobs 9339
walk JPM 479
walk total jobs 13899

So, the PPM was lower, but... the runtime was longer. The PPM is a measure of total guestbook entries over the duration from last to first. As things slow down, the apparent rate will slow down. (That is, if we find ourselves walking a lot of HEAD requests, and doing fewer GET requests, we will do a smaller number of inserts, and the density will go down. Anyway.)

Without some repeatability, I'm not sure I can get a good measure. I'll use a smaller set of sites, and see if I can get a repeatable number.

host host duration count ppm
www.va.gov 350 00:29:07.63452 5634 194
news.va.gov 311 00:28:59.413167 3615 129
www.cfm.va.gov 694 00:28:53.334614 2498 89
www.hsrd.research.va.gov 481 00:29:12.130379 2315 79
www.research.va.gov 1027 00:28:02.625225 1871 66
department.va.gov 1257 00:28:57.112199 1848 66

multiple services

In this experiment, I added a multi.yaml Docker config.

https://medium.com/@vinodkrane/microservices-scaling-and-load-balancing-using-docker-compose-78bf8dc04da9

I then spawned the stack with this configuration:

docker compose -f multi.yaml up \
  --scale entree=2 \
  --scale fetch=2 \
  --scale extract=2 \
  --scale walk=2

The entree service should still be polite; we only fetch against a domain if 2 seconds have passed. So, the fetch rate should not increase (per se). However, our total processing capability should go up.

At 6 minutes, the effective overall PPM was around 3200.

At 12 minutes, it was 2300. I seriously suspect the queue thrashing is slowing things down, and also wonder if we're running out of "fresh" URLs (which makes the number look lower?).

At approximately 22 minutes, stopped.

host host duration count ppm
www.va.gov 350 00:21:50.76345 6691 318
news.va.gov 311 00:21:32.693246 4132 196
www.cfm.va.gov 694 00:21:59.184202 2604 124
va.ng.mil 1146 00:21:58.52216 2578 122
www.hsrd.research.va.gov 481 00:21:51.007061 2528 120
www.research.va.gov 1027 00:21:12.910271 1899 90
department.va.gov 1257 00:22:02.370991 1550 70
www.mirecc.va.gov 360 00:21:01.910669 1263 60
www.cem.va.gov 385 00:22:14.338689 1150 52
digital.va.gov 1167 00:22:18.169547 1016 46
www.publichealth.va.gov 793 00:22:12.555373 1001 45

I have no idea how I'm calculating a PPM of... more than 30? on a given domain. This doesn't make sense. I'm supposed to be preventing a DDoS attack. This calculation (now that I look at it) makes me wonder how I ever have a PPM greater than 30 on a given domain?

The overall PPM was 1666.

measure value
guestbook pages 36656
duration (minutes) 22
overall PPM 1666
entree JPM 6219
entree total jobs 136826
fetch JPM 9903
fetch total jobs 217869
extract JPM 957
extract total jobs 12444
walk JPM 831
walk total jobs 18286

fetch is operating near maximum throughput; doubling the number of extract and walk services roughly doubled the throughput of those services. This suggests that... hm. I wonder if just increasing the worker pool to 100 from 50 would double the throughput of extract and walk. The RAM usage will go up, for sure... extract loads things into RAM to do extraction. Multiple services might be "cheaper"?

Anyway. I need to look at queue thrashing. And, I might want to clear the queue periodically. completed jobs don't need to be kept around.

suspicions

I have a few different suspicions/thoughts.

If entree has 50 workers, the largest number of domains that will pass through the front door close to the 2s deadline will be... roughly 50. And, it might be worse than that, because we're thrashing the queue. (In other words, we're waiting for things to come around on a very noisy queue.) It could be at the host level (which is only thousands of domains, at worst tens-of-thousands) the 2s timeout should be determined by an in-memory map, as opposed to going out to the hosts table, and... perhaps the worker should just wait until it is time to fetch. This way, an entree service can hold (say) 200, 400 domains at once, and potentially be sleeping for fractions of a second on each. This feels like a better way than requeueing jobs, and waiting for them to come back around.

I wonder/must be (?) hitting domains faster. I think. I am not certain that I'm being polite at this point. Although... if the entree services are doing their job?... then, it should still be the case that a given domain is getting hit once every 2s, and therefore, the pair are doing more work, but not necessarily to a single domain?

Parallel work at scale is hard. I need better visibility into what is going on in the aggregate (e.g. how do I know that a given domain is being treated nicely?) in order to be confident if we want to scale up like this.

SQL

The following SQL was used for generating numbers in the analysis

entree

select (select count(*) from river_job where kind='entree' and state='completed') /
(select extract(minute from (
	select maxf-minf from
		(select max(rj.finalized_at) as maxf, min(rj.finalized_at) as minf from river_job rj
		where kind='entree')))) as entree_jobs_per_minute,
	(select count(*) from river_job where kind='entree' and state='completed') as entree_total

fetch

select (select count(*) from river_job where kind='fetch' and state='completed') /
(select extract(minute from (
	select maxf-minf from
		(select max(rj.finalized_at) as maxf, min(rj.finalized_at) as minf from river_job rj
		where kind='fetch')))) as fetch_jobs_per_minute,
	(select count(*) from river_job where kind='fetch' and state='completed') as fetch_total

extract

select (select count(*) from river_job where kind='extract' and state='completed') /
(select extract(minute from (
	select maxf-minf from
		(select max(rj.finalized_at) as maxf, min(rj.finalized_at) as minf from river_job rj
		where kind='extract')))) as extract_jobs_per_minute,
	(select count(*) from river_job where kind='extract' and state='completed') as extract_total

walk

select (select count(*) from river_job where kind='walk' and state='completed') /
(select extract(minute from (
	select maxf-minf from
		(select max(rj.finalized_at) as maxf, min(rj.finalized_at) as minf from river_job rj
		where kind='walk')))) as walk_jobs_per_minute,
		(select count(*) from river_job where kind='walk' and state='completed') as walk_total

PPM per domain

The SQL to get a PPM per domain:

select 
	hname.host, gb.host, max(gb.last_fetched)-min(gb.last_fetched) as duration,
	count(gb.path) as count, 
	count(gb.path)/nullif(extract(minute from max(gb.last_fetched)-min(gb.last_fetched)), 0) as ppm
from guestbook gb, hosts hname
where hname.id = gb.host
group by gb.host, hname.host
order by count desc, ppm asc

which yields something like:

host host duration count ppm
www.va.gov 350 00:13:48.627931 3603 277.15
news.va.gov 311 00:13:32.876906 2267 174.38
www.research.va.gov 1027 00:13:07.186824 1803 138.69
department.va.gov 1257 00:14:03.299157 1744 124.57
www.hsrd.research.va.gov 481 00:14:00.124224 1653 118.07
www.cfm.va.gov 694 00:14:06.303193 1421 101.50
va.ng.mil 1146 00:14:03.421357 1109 79.21
www.mirecc.va.gov 360 00:13:37.558012 1050 80.77

And, then

select count(gb.path) / extract(minute from (select maxf-minf from
(select max(gb.last_fetched) as maxf, min(gb.last_fetched) as minf
from guestbook gb) as duration)) as ppm from guestbook gb

which yields a single number (overall pages per minute, or ppm)

In pieces:

select count(gb.path) from guestbook gb

select extract(minute from (select maxf-minf from
(select max(gb.last_fetched) as maxf, min(gb.last_fetched) as minf
from guestbook gb) as duration)) minutes

Pooling notes...

Clone this wiki locally