Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace WorkQueue replication filter by selector function #12143

Merged
merged 2 commits into from
Oct 17, 2024

Conversation

amaltaro
Copy link
Contributor

@amaltaro amaltaro commented Oct 11, 2024

Fixes #11192

Status

ready

Description

As the title says, stop using a JavaScript filter in favor of a native Erlang selector function, for database replication between (T0)WMAgent and central services.
A new JSON file has been provided with the relevant selector configuration, using the same name as the filter javascript file in the couchapps package.

Note that string field with dot separators has to be escaped, otherwise it gets expanded in the selector to a nested JSON, failing to match any documents.

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

None

External dependencies / deployment changes

Both vocms0255 and submit11 have been running with an old first commit.

@cmsdmwmbot
Copy link

Can one of the admins verify this patch?

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 25 warnings and errors that must be fixed
    • 7 warnings
    • 201 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 38 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15296/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 25 warnings and errors that must be fixed
    • 7 warnings
    • 201 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 38 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15297/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 tests no longer failing
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 25 warnings and errors that must be fixed
    • 7 warnings
    • 201 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 38 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15298/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor Author

I think the current testbed rules are making the validation of this patch very hard (with wrong ingress rules), so I decided to apply it to the fresh new agent running in vocms0255 and, even though the replication managed to go through within the frontend 5min limit, I don't see any of the expected documents replicated to the agent, unfortunately!

My suspicious is that the selector field 'WMCore.WorkQueue.DataStructs.WorkQueueElement.WorkQueueElement' is actually getting expanded to a nested data structure. So, instead of matching against that specific field in a workqueue element, it is expanding it to a nested structure like:

{"WMCore":
    "WorkQueue":
        "DataStructs": ...
    }

which of course won't ever match any documents.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 25 warnings and errors that must be fixed
    • 7 warnings
    • 201 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 42 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15299/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 tests no longer failing
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 25 warnings and errors that must be fixed
    • 7 warnings
    • 201 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 46 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15300/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor Author

I finally managed to get it working and it has been deployed to vocms0255 for now (as condor pressure was really bad over the past few days). All commits have been squashed in only 1, but further work is still necessary for:

  1. remove the workqueue filter function, if no longer needed
  2. migrate the wmstats replication filter to selector mode as well.

@amaltaro
Copy link
Contributor Author

After verifying vocms0255, everything seems to be fine with this patch. So I went ahead and applied it to submit11 as well.
Replication went through and the agent is already creating work for the WQEs that were pending replication since last week.

@amaltaro
Copy link
Contributor Author

I still have to converge on the wmstats replication, but any feedback on the already implemented feature for workqueue would be appreciated.
Note that I have most of the replication logic in the component configuration. If people prefer, I can move to the previous logic, which is creating the parent/child queue logic in the component source code.

Copy link
Member

@mapellidario mapellidario left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Alan!

I trust your tests, so I will only comment on what I perceive the logic of this PR to be.

If I understood correctly, the new code achieves the same thing as before: the selector replaces src/couchapps/WorkQueue/filters/queueFilter.js:

  • excludes soft deleted documents (aka marked as deleted but not removed from the DB, yet?)
  • selects the type "[...].WorkQueueElement"
  • adds the properties ParentQueueURL and ChildQueueURL to the result

I have only a simple comment, but it is not a blocker, feel free to ignore it if you do not agree.

@@ -61,6 +61,9 @@ def __init__(self, config):
self.topicAMQ = getattr(config.AgentStatusWatcher, "topicAMQ", None)
self.hostPortAMQ = getattr(config.AgentStatusWatcher, "hostPortAMQ", [('cms-mb.cern.ch', 61313)])

# WorkQueue replication
self.workqueueSelector = getattr(config.AgentStatusWatcher, "workqueueSelector", {})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am inclined to think that we should not import the selector from the configuration but keep it in the code here. I do not expect the selector to be different across wmagent instances, and it was part of "code" and not "configuration" before, so i do not see any reason to move the selector logic away from where it will be used.

Am i missing anything?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the replication logic is not supposed to change and having it in the code is likely better. I have been debating this as well, so it's good to have a 2nd opinion. I am going to update this PR later with those changes. Thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't mix Erlang code with Python, keep Erlang code apart since if you merge it here you'll be forced to change Python code every time you change a selector. While if you keep it separate you only may need to change it once and restart the Python code. From my experience, e.g. DBS using embedded SQL statements, it is much harder to maintain the code with embedded snippets than otherwise. I prefer to keep such stuff separate, e.g. for that separate area exists which can keep JS, images, templates, erlang, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we can call it an Erlang code, in the end it is just a JSON with rules. However, the values of the rule are defined during WMAgent runtime (deployment), so separating it can potentially add more confusion than what it is right now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I provided my feedback for consideration. Given that the content is in JSON format, it might be tempting to incorporate it directly into Python code. However, doing so could blur the distinction, as Python does not differentiate between JSON and Python dictionaries, treating both the same way. This could create ambiguity—whether it's a standalone JSON structure or a Python map or object.

If maintaining clear separation between models (i.e., distinguishing what belongs to Python versus Erlang or other systems) is a priority, then I believe such separation is beneficial. However, I'll leave the final decision up to you. If you will find putting JSON in place I suggest to put appropriate comment that it is JSON (not python map) and where it belongs (Erlang).

# NOTE that we need to escape the 'type' field to avoid JSON dot notation interpretation
config.AgentStatusWatcher.workqueueSelector = {'_deleted': {"$exists": False},
'type': "WMCore.WorkQueue.DataStructs.WorkQueueElement.WorkQueueElement",
'WMCore\.WorkQueue\.DataStructs\.WorkQueueElement\.WorkQueueElement':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few things here:

  • why do you need to escape dots, is it CouchDB things (for my education)?
  • why not to use as a key just WorkQueueElement, why full Python path?
  • lines below, are you sure that config will always provides queueParams dictionary with your keys? Is it better to check that first before assigning to selector?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need to escape dots, is it CouchDB things (for my education)?

Because if I don't escape it, CouchDB will interpret it as a JSON dot notation and will try to expand it to {'WMCore': {'Workqueue': {'DataStructs':...}}}, which will never match against the field with dots in the string. Further details in: https://docs.couchdb.org/en/stable/api/database/find.html#subfields

why not to use as a key just WorkQueueElement, why full Python path?

I would love to change it! Or not to even put it inside that key, but that would require some decent amount of refactoring in WMCore.

lines below, are you sure that config will always provides queueParams dictionary with your keys? Is it better to check that first before assigning to selector?

It is a WMAgent configuration, so yes, queueParams will always be there. If it is not, everything broke during the agent deployment/execution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I am planning to delete this configuration and instead use a json file to load these selector filters in the component (provided in my 3rd and 4th commits in here). You might want to have a look and share your thoughts.

@@ -61,6 +61,9 @@ def __init__(self, config):
self.topicAMQ = getattr(config.AgentStatusWatcher, "topicAMQ", None)
self.hostPortAMQ = getattr(config.AgentStatusWatcher, "hostPortAMQ", [('cms-mb.cern.ch', 61313)])

# WorkQueue replication
self.workqueueSelector = getattr(config.AgentStatusWatcher, "workqueueSelector", {})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't mix Erlang code with Python, keep Erlang code apart since if you merge it here you'll be forced to change Python code every time you change a selector. While if you keep it separate you only may need to change it once and restart the Python code. From my experience, e.g. DBS using embedded SQL statements, it is much harder to maintain the code with embedded snippets than otherwise. I prefer to keep such stuff separate, e.g. for that separate area exists which can keep JS, images, templates, erlang, etc.

'filter': wqfilter, 'query_params': query_params})
self.replicatorDocs.append({'source': localQInboxURL, 'target': parentQURL,
'filter': wqfilter, 'query_params': query_params})
self.replicatorDocs.append({'source': parentQueueUrl,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if replicator docs already have such source or target? is it better to check it first. I know that CouchDB will create a revision, but may be you don't want that. It depends on logic of handling multiple versions of the same document.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replication documents are deleted before new replications are defined (see the beginning of this method).

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 26 warnings and errors that must be fixed
    • 7 warnings
    • 203 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 46 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15323/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 26 warnings and errors that must be fixed
    • 7 warnings
    • 203 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 46 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15326/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 25 warnings and errors that must be fixed
    • 7 warnings
    • 196 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 41 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15328/artifact/artifacts/PullRequestReport.html

Update parameter names to reflect WQE

Nested workqueue structure

Escape dots in the field name

Escape dots in the right field

Escape dots in AgentStatusPoller as well

Load JSON file with replications in AgentStatusPoller

Fix location of json filter; correct WQ filter reference

Remove component configuration; perform safe print for replication
@amaltaro
Copy link
Contributor Author

I ran a bunch of DMWM tests in vocms0261 and I can now confirm that all the 3 replications are working well (wmagent_summary and the bidirectional workqueue one).

@LinaresToine we are likely cutting a new release today, then I will ask you to also validate this new selector function for the agent (I think it is the tier0_request replication).

@amaltaro
Copy link
Contributor Author

Given that it had already been approved by Dario in the previous review, I will just move forward with this PR.
Nonetheless, I would be happy to hear about any follow up on this even after merge.

One thing that I am not sure is where is the best place for the new json file. Under src/couchapps would be natural, but this is not part of the application that gets "deployed" on CouchDB, so I prefer not to have it there.

@amaltaro amaltaro merged commit f34b241 into dmwm:master Oct 17, 2024
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace replicator filters by selector objects
4 participants