Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace replicator filters by selector objects #11192

Open
amaltaro opened this issue Jun 22, 2022 · 4 comments · Fixed by #12143 · May be fixed by #12150
Open

Replace replicator filters by selector objects #11192

amaltaro opened this issue Jun 22, 2022 · 4 comments · Fixed by #12143 · May be fixed by #12150

Comments

@amaltaro
Copy link
Contributor

amaltaro commented Jun 22, 2022

Impact of the new feature
CouchDB and WMAgent

Is your feature request related to a problem? Please describe.
Not a problem, but according to CouchDB experts, the selector objects deliver a better performance than javascript filter functions. Further information about selector objects can be found at:
https://docs.couchdb.org/en/latest/replication/replicator.html#selector-objects

Describe the solution you'd like
Evaluate the 3 standard WMAgent replications and see if a filter function can be replaced by an equivalent selector object.

UPDATE: note that the ParentQueueUrl filter can likely be removed, given that the source or target database will always be pointing to one specific environment.

Describe alternatives you've considered
Not do anything.

Additional context
None

@amaltaro
Copy link
Contributor Author

We are currently not able to start a fresh new database replication from global workqueue to local workqueue_inbox, as the HTTP request times out at 5min.

The database itself does not have much deleted documents:

  "doc_del_count": 219678,
  "doc_count": 30614,

so I would not expect any problems to go through the list of deleted documents within the 5min timeout.

Checking this on the node running couchdb workqueue database, I do see a very slow response for the relevant HTTP call, e.g.:

$ curl "localhost:5984/workqueue/_changes?filter=WorkQueue%2FqueueFilter&parentUrl=https%3A%2F%2Fcmsweb.cern.ch%2Fcouchdb%2Fworkqueue&childUrl=http%3A%2F%2Fvocms0282.cern.ch%3A5984&feed=continuous&style=all_docs&since=0&timeout=300000" > vocms0282.log
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  867k    0  867k    0     0   1079      0 --:--:--  0:13:42 --:--:--    41

so we can likely discard any effect from the reverse proxy/APS.

Having a brief chat with CouchDB developers in Slack, they acknowledge this to be something to be improved in CouchDB, such that we can save checkpoints as it iterates through deleted documents as well.

Their suggestion is to actually implement the selector model, which is evaluated in erlang, hence it is expected to be much faster.

@amaltaro
Copy link
Contributor Author

It turns out the json file didn't make it to the WMAgent PyPi package, and this is the exception we get when we start a WMAgent 2.3.7 release:

2024-10-17 14:25:07,024:140717629667136:CRITICAL:Harness:PostMortem: choked when initializing with error: Could not find CouchDB replication JSON file at: /usr/local/lib/python3.8/site-packages/WMComponent/AgentStatusWatcher/replication_selector.json
  File "/usr/local/lib/python3.8/site-packages/WMCore/Agent/Harness.py", line 416, in startComponent
    self.prepareToStart()
  File "/usr/local/lib/python3.8/site-packages/WMCore/Agent/Harness.py", line 319, in prepareToStart
    self.preInitialization()
  File "/usr/local/lib/python3.8/site-packages/WMComponent/AgentStatusWatcher/AgentStatusWatcher.py", line 44, in preInitialization
    myThread.workerThreadManager.addWorker(AgentStatusPoller(self.config),
  File "/usr/local/lib/python3.8/site-packages/WMComponent/AgentStatusWatcher/AgentStatusPoller.py", line 72, in __init__
    raise RuntimeError(f"Could not find CouchDB replication JSON file at: {replicationFile}")

and here are all the json files that we can find in the installation area of the agent (with pip):

(WMAgent-2.3.7) [xxx@xxx:current]$ find /usr/local/ -name \*.json
/usr/local/lib/python3.8/site-packages/decorator-5.1.1.dist-info/pbr.json
/usr/local/lib/python3.8/site-packages/jsonschema/schemas/draft3.json
/usr/local/lib/python3.8/site-packages/jsonschema/schemas/draft4.json
/usr/local/lib/python3.8/site-packages/jsonschema/schemas/draft6.json
/usr/local/lib/python3.8/site-packages/jsonschema/schemas/draft7.json
/usr/local/lib/python3.8/site-packages/retry-0.9.2.dist-info/metadata.json
/usr/local/lib/python3.8/site-packages/retry-0.9.2.dist-info/pbr.json
/usr/local/lib/python3.8/site-packages/stevedore-5.3.0.dist-info/pbr.json
/usr/local/lib/python3.8/site-packages/zmq/utils/compiler.json
/usr/local/lib/python3.8/site-packages/zmq/utils/config.json
/usr/local/data/couchapps/LogDB/couchapp.json
/usr/local/data/couchapps/SummaryStats/couchapp.json
/usr/local/data/couchapps/T0Request/couchapp.json
/usr/local/data/couchapps/WMStats/rewrites.json
/usr/local/data/couchapps/WorkQueue/couchapp.json
/usr/local/data/couchapps/WorkQueue/rewrites.json

For now, Dario and I grabbed https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/AgentStatusWatcher/replication_selector.json and placed it directly into vocms0192, such that we can move forward with this validation.

I am reopening this issue to have this resolved and available in the next patch release (likely 2.3.7.1).

@amaltaro amaltaro reopened this Oct 17, 2024
@amaltaro amaltaro linked a pull request Oct 17, 2024 that will close this issue
@amaltaro
Copy link
Contributor Author

As reported in our WM Team mattermost channel, the ParentQueueUrl filter imposes a problem when we have multiple instances of global workqueue running under different domains, but talking to the same database. For instance, the integration CouchDB database is shared between cmsweb-testbed and cmsweb-preprod, depending on which global workqueue element creates those elements, it could have either:

            "ParentQueueUrl": "https://cmsweb-testbed.blah",

or

            "ParentQueueUrl": "https://cmsweb-preprod.blah",

If in our filter we say we want cmswe-testbed parent queue, that means any preprod WQE would not be replicated to the agent, even though the agent could start "negotiating" those elements.

Having said that, I have updated the initial description here and I think we should loose the workqueue filter replication to stop passing this ParentQueueUrl filter.

@amaltaro
Copy link
Contributor Author

And now I understand why we ended up with cmsweb-prod in the workqueue element parent queue.

While trying to make the replication work, I decided to change the agent configuration from cmsweb to cmsweb-prod:

config.WorkQueueManager.queueParams = {'ParentQueueCouchUrl': 'https://cmsweb-prod...

which is then used to negotiate WQEs between local and global, setting the ParentQueueUrl accordingly in this line:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WorkQueue/WorkQueue.py#L483

With that said, it looks like my previous comment is wrong and we would not fail to acquire workflow when there are multiple domains/instances talking to the same database. Nonetheless, I still think it is better to stop using ParentQueueUrl, as the source/target database is already doing the job. In addition, this makes the replication filter cheaper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment