-
Notifications
You must be signed in to change notification settings - Fork 0
initial tests
I want to see how different numbers of workers impact performance.
Some suspicions...
I set the config to
service | workers |
---|---|
entree | 10 |
extract | 30 |
fetch | 100 |
pack | 10 |
walk | 10 |
I then ran
run-list.bash va-200.txt
which has 200 va.gov domains.
The polite backoff was set to 2 seconds (meaning, I shouldn't hit any one subdomain more than once every two seconds).
JPM = jobs per minute
measure | value |
---|---|
guestbook pages | 25650 |
duration (minutes) | 14 |
overall PPM | 1832 |
entree JPM | 4941 |
entree total jobs | 69174 |
fetch JPM | 9680 |
fetch total jobs | 135531 |
extract JPM | 700 |
extract total jobs | 9797 |
walk JPM | 692 |
walk total jobs | 9689 |
I saw a total PPM of 1832. I calculate this by:
-
duration
is calculated as the earliestlast_fetched
value in theguestbook
subtracted from the newest entry in the table. - Take the
count
of the number of entries in the guestbook (pages), and divide this by the duration
Walk experienced a number of failures -- not many, but some. This may have to do with a bug that was observed yesterday in the goquery
parsing. There may be situations where a failed page fetch is creating an empty object in the S3 bucket (?), and then we fetch the empty object and try and parse it. (Or, something.)
I noted some domains in the VA space that had x509 cert problems; this would account for a crawl that yielded pages that would not fetch. So, some investigation is required. (And, those kinds of errors probably need to be logged in a way that we can use them to investigate, and then inform Federal partners.)
The point of the second experiment was to run the same domains with more workers. I had to add a pgxpool
to entree
, as the increase in workers exposed a concurrency issue in EvaluateEntree
. So, these are not the same code.
service | workers |
---|---|
entree | 50 |
extract | 50 |
fetch | 150 |
pack | 10 |
walk | 50 |
I am still suspicious of what happens if we have too many packers. So, for now, I'm keeping that number low. I am further suspicious that things ran longer? (I can't remember if I stopped the previous experiment.) This was stopped after 30m, roughly.
measure | value |
---|---|
guestbook pages | 32675 |
duration (minutes) | 29 |
overall PPM | 1126 |
entree JPM | 3472 |
entree total jobs | 100713 |
fetch JPM | 7746 |
fetch total jobs | 224641 |
extract JPM | 518 |
extract total jobs | 9339 |
walk JPM | 479 |
walk total jobs | 13899 |
So, the PPM was lower, but... the runtime was longer. The PPM is a measure of total guestbook entries over the duration from last to first. As things slow down, the apparent rate will slow down. (That is, if we find ourselves walking a lot of HEAD
requests, and doing fewer GET
requests, we will do a smaller number of inserts, and the density will go down. Anyway.)
Without some repeatability, I'm not sure I can get a good measure. I'll use a smaller set of sites, and see if I can get a repeatable number.
host | host | duration | count | ppm |
---|---|---|---|---|
www.va.gov | 350 | 00:29:07.63452 | 5634 | 194 |
news.va.gov | 311 | 00:28:59.413167 | 3615 | 129 |
www.cfm.va.gov | 694 | 00:28:53.334614 | 2498 | 89 |
www.hsrd.research.va.gov | 481 | 00:29:12.130379 | 2315 | 79 |
www.research.va.gov | 1027 | 00:28:02.625225 | 1871 | 66 |
department.va.gov | 1257 | 00:28:57.112199 | 1848 | 66 |
In this experiment, I added a multi.yaml
Docker config.
I then spawned the stack with this configuration:
docker compose -f multi.yaml up \
--scale entree=2 \
--scale fetch=2 \
--scale extract=2 \
--scale walk=2
The entree
service should still be polite; we only fetch against a domain if 2 seconds have passed. So, the fetch rate should not increase (per se). However, our total processing capability should go up.
At 6 minutes, the effective overall PPM was around 3200.
At 12 minutes, it was 2300. I seriously suspect the queue thrashing is slowing things down, and also wonder if we're running out of "fresh" URLs (which makes the number look lower?).
At approximately 22 minutes, stopped.
host | host | duration | count | ppm |
---|---|---|---|---|
www.va.gov | 350 | 00:21:50.76345 | 6691 | 318 |
news.va.gov | 311 | 00:21:32.693246 | 4132 | 196 |
www.cfm.va.gov | 694 | 00:21:59.184202 | 2604 | 124 |
va.ng.mil | 1146 | 00:21:58.52216 | 2578 | 122 |
www.hsrd.research.va.gov | 481 | 00:21:51.007061 | 2528 | 120 |
www.research.va.gov | 1027 | 00:21:12.910271 | 1899 | 90 |
department.va.gov | 1257 | 00:22:02.370991 | 1550 | 70 |
www.mirecc.va.gov | 360 | 00:21:01.910669 | 1263 | 60 |
www.cem.va.gov | 385 | 00:22:14.338689 | 1150 | 52 |
digital.va.gov | 1167 | 00:22:18.169547 | 1016 | 46 |
www.publichealth.va.gov | 793 | 00:22:12.555373 | 1001 | 45 |
I have no idea how I'm calculating a PPM of... more than 30? on a given domain. This doesn't make sense. I'm supposed to be preventing a DDoS attack. This calculation (now that I look at it) makes me wonder how I ever have a PPM greater than 30 on a given domain?
The overall PPM was 1666.
measure | value |
---|---|
guestbook pages | 36656 |
duration (minutes) | 22 |
overall PPM | 1666 |
entree JPM | 6219 |
entree total jobs | 136826 |
fetch JPM | 9903 |
fetch total jobs | 217869 |
extract JPM | 957 |
extract total jobs | 12444 |
walk JPM | 831 |
walk total jobs | 18286 |
fetch
is operating near maximum throughput; doubling the number of extract
and walk
services roughly doubled the throughput of those services. This suggests that... hm. I wonder if just increasing the worker pool to 100 from 50 would double the throughput of extract
and walk
. The RAM usage will go up, for sure... extract
loads things into RAM to do extraction. Multiple services might be "cheaper"?
Anyway. I need to look at queue thrashing. And, I might want to clear the queue periodically. completed
jobs don't need to be kept around.
I have a few different suspicions/thoughts.
If entree
has 50 workers, the largest number of domains that will pass through the front door close to the 2s deadline will be... roughly 50. And, it might be worse than that, because we're thrashing the queue. (In other words, we're waiting for things to come around on a very noisy queue.) It could be at the host level (which is only thousands of domains, at worst tens-of-thousands) the 2s timeout should be determined by an in-memory map, as opposed to going out to the hosts
table, and... perhaps the worker should just wait until it is time to fetch. This way, an entree
service can hold (say) 200, 400 domains at once, and potentially be sleeping for fractions of a second on each. This feels like a better way than requeueing jobs, and waiting for them to come back around.
I wonder/must be (?) hitting domains faster. I think. I am not certain that I'm being polite at this point. Although... if the entree services are doing their job?... then, it should still be the case that a given domain is getting hit once every 2s, and therefore, the pair are doing more work, but not necessarily to a single domain?
Parallel work at scale is hard. I need better visibility into what is going on in the aggregate (e.g. how do I know that a given domain is being treated nicely?) in order to be confident if we want to scale up like this.
The following SQL was used for generating numbers in the analysis
select (select count(*) from river_job where kind='entree' and state='completed') /
(select extract(minute from (
select maxf-minf from
(select max(rj.finalized_at) as maxf, min(rj.finalized_at) as minf from river_job rj
where kind='entree')))) as entree_jobs_per_minute,
(select count(*) from river_job where kind='entree' and state='completed') as entree_total
select (select count(*) from river_job where kind='fetch' and state='completed') /
(select extract(minute from (
select maxf-minf from
(select max(rj.finalized_at) as maxf, min(rj.finalized_at) as minf from river_job rj
where kind='fetch')))) as fetch_jobs_per_minute,
(select count(*) from river_job where kind='fetch' and state='completed') as fetch_total
select (select count(*) from river_job where kind='extract' and state='completed') /
(select extract(minute from (
select maxf-minf from
(select max(rj.finalized_at) as maxf, min(rj.finalized_at) as minf from river_job rj
where kind='extract')))) as extract_jobs_per_minute,
(select count(*) from river_job where kind='extract' and state='completed') as extract_total
select (select count(*) from river_job where kind='walk' and state='completed') /
(select extract(minute from (
select maxf-minf from
(select max(rj.finalized_at) as maxf, min(rj.finalized_at) as minf from river_job rj
where kind='walk')))) as walk_jobs_per_minute,
(select count(*) from river_job where kind='walk' and state='completed') as walk_total
The SQL to get a PPM per domain:
select
hname.host, gb.host, max(gb.last_fetched)-min(gb.last_fetched) as duration,
count(gb.path) as count,
count(gb.path)/nullif(extract(minute from max(gb.last_fetched)-min(gb.last_fetched)), 0) as ppm
from guestbook gb, hosts hname
where hname.id = gb.host
group by gb.host, hname.host
order by count desc, ppm asc
which yields something like:
host | host | duration | count | ppm |
---|---|---|---|---|
www.va.gov | 350 | 00:13:48.627931 | 3603 | 277.15 |
news.va.gov | 311 | 00:13:32.876906 | 2267 | 174.38 |
www.research.va.gov | 1027 | 00:13:07.186824 | 1803 | 138.69 |
department.va.gov | 1257 | 00:14:03.299157 | 1744 | 124.57 |
www.hsrd.research.va.gov | 481 | 00:14:00.124224 | 1653 | 118.07 |
www.cfm.va.gov | 694 | 00:14:06.303193 | 1421 | 101.50 |
va.ng.mil | 1146 | 00:14:03.421357 | 1109 | 79.21 |
www.mirecc.va.gov | 360 | 00:13:37.558012 | 1050 | 80.77 |
And, then
select count(gb.path) / extract(minute from (select maxf-minf from
(select max(gb.last_fetched) as maxf, min(gb.last_fetched) as minf
from guestbook gb) as duration)) as ppm from guestbook gb
which yields a single number (overall pages per minute, or ppm)
In pieces:
select count(gb.path) from guestbook gb
select extract(minute from (select maxf-minf from
(select max(gb.last_fetched) as maxf, min(gb.last_fetched) as minf
from guestbook gb) as duration)) minutes