Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consume raw/generator unmerged dump data in MSUnmerged #12059

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

amaltaro
Copy link
Contributor

@amaltaro amaltaro commented Jul 29, 2024

Fixes #12061

Status

In development

Description

The scope of this PR has inflated quite a lot, as I started just investigating the adoption of compressed RucioConMon data into MSUnmerged. Summary of changes are:

  • make a new log record as an RSE goes through the pipeline
  • modified RucioConMon to actually stream/generate data when it's retrieved in binary mode (raw data compressed)
  • tweaked WMStatsServer to actually return a unique list of protected LFNs
  • by default, make MSUnmerged retrieve compressed data from RucioConMon (zipped option)
  • removed MSUnmerged.filterUnmergedFiles method - logic now embedded in getUnmergedFiles and _isDeletable
  • refactored cleanRSE method by first trying to delete the whole directory. If it fails, then list all the content in that directory and delete its content by slices. Finally, try to remove the (now) empty directory.
  • provided a new method called _listDir to list root files (only) inside a directory
  • lastly, provides a script called test_gfal.py to test directory removal with gfal (simulating same similar behavior as MSUnmerged)

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

Gzipped support was added with this PR:
#11142

but never adopted by MSUnmerged. With this PR, we actually adopt it.

External dependencies / deployment changes

Gzipped data is not currently functional:

>>> allUnmerged = rucioConMon.getRSEUnmerged("T1_ES_PIC_Disk", zipped=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/site-packages/WMCore/Services/RucioConMon/RucioConMon.py", line 127, in getRSEUnmerged
    rseUnmerged = self._getResultZipped(uri,  callname=callname, clearCache=True)
  File "/usr/local/lib/python3.8/site-packages/WMCore/Services/RucioConMon/RucioConMon.py", line 92, in _getResultZipped
    data = self._getResult(uri, callname, clearCache, args, binary=True)
  File "/usr/local/lib/python3.8/site-packages/WMCore/Services/RucioConMon/RucioConMon.py", line 73, in _getResult
    results = gzip.decompress(istream.read())
  File "/usr/local/lib/python3.8/gzip.py", line 548, in decompress
    return f.read()
  File "/usr/local/lib/python3.8/gzip.py", line 292, in read
    return self._buffer.read(size)
  File "/usr/local/lib/python3.8/gzip.py", line 479, in read
    if not self._read_gzip_header():
  File "/usr/local/lib/python3.8/gzip.py", line 427, in _read_gzip_header
    raise BadGzipFile('Not a gzipped file (%r)' % magic)
gzip.BadGzipFile: Not a gzipped file (b'/s')

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 2 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 7 warnings
    • 36 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 4 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15125/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor Author

I have this patch applied to the production pod, named ms-unmer-t1-7dcf5f577b-dcj4w and it is also using the following configuration change (inside the pod):

data.skipRSEs = ['T1_US_FNAL_Disk', "T1_DE_KIT_Disk", "T1_RU_JINR_Disk", "T1_UK_RAL_Disk", "T1_IT_CNAF_Disk"]

At least CNAF and JINR are right now failing with permission issues.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
  • Python3 Pylint check: succeeded
    • 7 warnings
    • 36 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 4 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15126/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 7 warnings
    • 59 comments to review
  • Pylint py3k check: failed
    • 2 errors and warnings that should be fixed
  • Pycodestyle check: succeeded
    • 14 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15128/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 3 warnings and errors that must be fixed
    • 7 warnings
    • 59 comments to review
  • Pylint py3k check: failed
    • 2 errors and warnings that should be fixed
  • Pycodestyle check: succeeded
    • 14 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15129/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor Author

amaltaro commented Aug 1, 2024

The auth/authz issues that we had before were actually coming from a missing environment variable in the manage script (X509_USER_PROXY). Fix has been provided here: dmwm/CMSKubernetes#1532

On what concerns FNAL, I have been testing both the json and the raw format, and the memory RAM allocation gain is not very substantial (about 20%), as one can see with a simple script I wrote:

(WMAgent-2.3.4) [***@vocms0282:install]$ python get_unmer.py 
Size of all unmerged in unzipped format: 2489 MB and type <class 'list'>
['/store/unmerged/RunIISummer20ULPrePremix/Neutrino_E-10_gun/PREMIX/BParking_106X_upgrade2018_realistic_v16_L1v1-v1/2560000/CD81A5B4-B20C-2840-92B5-8FE53886885C.root', '/store/unmerged/RunIISummer20ULPrePremix/Neutrino_E-10_gun/PREMIX/BParking_106X_upgrade2018_realistic_v16_L1v1-v1/2560001/0DB454A7-E044-304B-AA56-2FBE12EC156E.root']

Size of all unmerged in zipped format: 1946 MB and type <class 'bytes'>
b'/s'

Lastly, in terms of network traffic pulling the unmerged dump for FNAL, we can see an outstanding improvement both in time and data. Here are the results:

(WMAgent-2.3.4) [cmst1@vocms0282:install]$ scurl "https://cmsweb.cern.ch/rucioconmon/unmerged/files?rse=T1_US_FNAL_Disk&format=json" > out_FNAL.json
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1984M    0 1984M    0     0  71.6M      0 --:--:--  0:00:27 --:--:-- 64.3M

(WMAgent-2.3.4) [cmst1@vocms0282:install]$ scurl "https://cmsweb.cern.ch/rucioconmon/unmerged/files?rse=T1_US_FNAL_Disk&format=raw" > out_FNAL.gzip
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  225M    0  225M    0     0   162M      0 --:--:--  0:00:01 --:--:--  162M

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 3 warnings and errors that must be fixed
    • 7 warnings
    • 58 comments to review
  • Pylint py3k check: failed
    • 2 errors and warnings that should be fixed
  • Pycodestyle check: succeeded
    • 13 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15136/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 tests no longer failing
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 3 warnings and errors that must be fixed
    • 7 warnings
    • 68 comments to review
  • Pylint py3k check: failed
    • 2 errors and warnings that should be fixed
  • Pycodestyle check: succeeded
    • 13 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15141/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 5 new failures
    • 1 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 3 warnings and errors that must be fixed
    • 7 warnings
    • 69 comments to review
  • Pylint py3k check: failed
    • 2 errors and warnings that should be fixed
  • Pycodestyle check: succeeded
    • 13 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15142/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 5 new failures
    • 1 tests no longer failing
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 3 warnings and errors that must be fixed
    • 7 warnings
    • 68 comments to review
  • Pylint py3k check: failed
    • 2 errors and warnings that should be fixed
  • Pycodestyle check: succeeded
    • 14 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15143/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 5 new failures
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 5 warnings and errors that must be fixed
    • 7 warnings
    • 69 comments to review
  • Pylint py3k check: failed
    • 2 errors and warnings that should be fixed
  • Pycodestyle check: succeeded
    • 14 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15162/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 6 new failures
    • 1 tests added
  • Python3 Pylint check: failed
    • 5 warnings and errors that must be fixed
    • 7 warnings
    • 69 comments to review
  • Pylint py3k check: failed
    • 2 errors and warnings that should be fixed
  • Pycodestyle check: succeeded
    • 14 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15161/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 5 new failures
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 5 warnings and errors that must be fixed
    • 7 warnings
    • 71 comments to review
  • Pylint py3k check: failed
    • 2 errors and warnings that should be fixed
  • Pycodestyle check: succeeded
    • 14 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15163/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 6 new failures
    • 3 tests no longer failing
    • 1 tests added
  • Python3 Pylint check: failed
    • 5 warnings and errors that must be fixed
    • 7 warnings
    • 71 comments to review
  • Pylint py3k check: failed
    • 2 errors and warnings that should be fixed
  • Pycodestyle check: succeeded
    • 14 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15164/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 5 new failures
    • 1 tests added
  • Python3 Pylint check: failed
    • 3 warnings and errors that must be fixed
    • 7 warnings
    • 70 comments to review
  • Pylint py3k check: failed
    • 2 errors and warnings that should be fixed
  • Pycodestyle check: succeeded
    • 14 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15168/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 5 new failures
    • 2 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 3 warnings and errors that must be fixed
    • 7 warnings
    • 70 comments to review
  • Pylint py3k check: failed
    • 2 errors and warnings that should be fixed
  • Pycodestyle check: succeeded
    • 14 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15173/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 5 new failures
    • 2 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 3 warnings and errors that must be fixed
    • 7 warnings
    • 70 comments to review
  • Pylint py3k check: failed
    • 2 errors and warnings that should be fixed
  • Pycodestyle check: succeeded
    • 14 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15174/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 4 new failures
    • 2 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 3 warnings and errors that must be fixed
    • 7 warnings
    • 70 comments to review
  • Pylint py3k check: failed
    • 2 errors and warnings that should be fixed
  • Pycodestyle check: succeeded
    • 14 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15176/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 7 tests deleted
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 3 warnings and errors that must be fixed
    • 7 warnings
    • 70 comments to review
  • Pylint py3k check: failed
    • 2 errors and warnings that should be fixed
  • Pycodestyle check: succeeded
    • 14 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15181/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 7 tests deleted
    • 1 tests no longer failing
    • 1 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 3 warnings and errors that must be fixed
    • 7 warnings
    • 70 comments to review
  • Pylint py3k check: failed
    • 2 errors and warnings that should be fixed
  • Pycodestyle check: succeeded
    • 14 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15191/artifact/artifacts/PullRequestReport.html

@amaltaro amaltaro changed the title Log each step in the pipeline cleaning up RSE Consume raw/generator unmerged dump data in MSUnmerged Sep 8, 2024
filterUnmergedFiles method no longer exists

Fix check for isDeletable

Fix key name for dirsDeletedFail

check if ctx object exist before freeing it
temporarily remove integration tag for unit tests

fix RucioConMon unit test

fix MSUnmerged unit tests

resolve MSUnmerged unit tests
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 2 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 3 warnings and errors that must be fixed
    • 9 warnings
    • 81 comments to review
  • Pylint py3k check: failed
    • 2 errors and warnings that should be fixed
  • Pycodestyle check: succeeded
    • 21 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15275/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor Author

amaltaro commented Oct 4, 2024

I might have to do some polishing, but I think the bulk logic is in place now and I welcome any feedback.

@@ -0,0 +1,42 @@
#!/usr/bin/env python
import logging
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SELF-REMINDER: This script will be removed before merging this PR.

Copy link
Contributor

@vkuznet vkuznet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code seems very complex to me and very hard to follow. As such to properly define its logic someone needs to dig very deep into code flow. I suggest to further refactor the code into smaller functions with clear scope, e.g. if function is given an fobj which can be either a file or dir, the code can be re-organized to use concurrent patter on nested dir structure. The function behavior can be well defined with respect to its action, e.g.

  • for directory get list of its content and call itself
  • for a file, unlink the file and return the status

Without knowing exact logic how clean-up procedure should be done, which exceptions should be made, the review of the logic is very cumbersome.

if dirLfn in rse['dirs']['deletedSuccess']:
self.logger.info("RSE: %s, dir: %s already successfully deleted.", rse['name'], dirLfn)
continue
for idx, dirLfn in enumerate(rse['dirs']['toDelete']):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell the toDelete parts of the rse['dirs] object is never cleaned up even though the logic below deletes dirs/files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of loop I suggest to clearly define function(s) to delete an object. The function can take fobj (either dir or file) and perform the operation and return the status. Then code can be parallelized (to speed up nested operations) and deletion can be done concurrently.

@@ -306,9 +304,24 @@ def _execute(self, rseList):
def cleanRSE(self, rse):
"""
The method to implement the actual deletion of files for an RSE.
Order of deletion attempts is:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be beneficial to clearly define logic of deletion, here you outline the steps instead of logic. But the underlying code seems very complex with different rules of what and how it should delete. I suggest to provide description of such logic in the doc string.

@vkuznet
Copy link
Contributor

vkuznet commented Oct 9, 2024

Minor comment: the code relies on class methods/functions with underscores. I expect that those should be protected methods or not directly accessible, but python does not provide such isolation and notations neither enforce it. I don't know the rationale behind this notations.

@amaltaro
Copy link
Contributor Author

amaltaro commented Oct 9, 2024

I appreciate your code review, thanks Valentin!

Making this code parallel is out of scope, the main issue to be addressed is memory consumption and removal optimizations where possible. As you stated, it is already very complex and I would rather not make further changes to increase it even more.

I will see how to break some implementations into smaller functions/methods and will also provide a diagram for data removal.

@todor-ivanov
Copy link
Contributor

hi @amaltaro,

in regards to:

the main issue to be addressed is memory consumption and removal optimizations where possible

May I ask two questions.

  • What does it mean removal optimizations where possible, and how would one expect to increase code efficiency by removing optimizations ....this was kind of strange.
  • Do you have any memory measurement before and after you have done your code changes - talking only about the bits related to code refactoring here not including the addition of a streamer logic, because the later would definitely have memory impact.

Let me put it that way, I was surprised to see you refactoring the logic of the service starting from an issue related to creating a streamer to feed the component. This messes things a lot and we could never see the benefits or even the needs of this. Such efforts should be separated. As we know in the team we always requested high issue granularity and to split effort by subject. I've been constantly flagged to put code changes in a PR with this very same argument.
To me it seems an effort driven mostly by a desire to refactor code styling and I cannot see the clear benefit of this. Especially without any measurement of resource consumption before and after the change, which should be possible because the current code is still working one way or another.

So in order to continue with this my suggestion would be to split the changes thematically in two issues:

  • first one related to refactoring generators and pointers logic
  • second one related to including streamer logic

Then we measure CPU && Memory consumption (measuring only one of the parameters gives no clear picture) before and after applying the patches to a standalone instance of the service/component (definitely not in Kubernetes), this is the only way to see the actual effect. And I stand behind my words on this - we cannot measure component resource consumption only by monitoring Kubernetes cluster behavior. We should profile the code in much better details - which was exactly what I was doing constantly on every change before we merge for at least a month back then when I was including these optimizations in the code.

@todor-ivanov
Copy link
Contributor

todor-ivanov commented Oct 11, 2024

one more thing @amaltaro
about this bullet from the PR description:

refactored cleanRSE method by first trying to delete the whole directory. If it fails, then list all the content in that directory and delete its content by slices. Finally, try to remove the (now) empty directory.

This is already in place. If you are simply changing the way how to do it with this PR, that's another story, but this is definitely nothing new for MSUnmerged

@amaltaro
Copy link
Contributor Author

@todor-ivanov thank you for looking into this.

What does it mean removal optimizations where possible, and how would one expect to increase code efficiency by removing optimizations ....this was kind of strange.

I need to update that comment, because what it is actually doing right now is:

  1. trying to remove the parent directory, if it fails
  2. tries to delete the subdirectories, if it fails
  3. get a list of contents (files) in all these subdirectories and remove by 100 file slices.
  4. finally, try to remove any leftover directory (which should now be empty).

I will update the PR description with this. I also need to revisit what was already in place and see the differences.

So in order to continue with this my suggestion would be to split the changes thematically in two issues:

  • first one related to refactoring generators and pointers logic
  • second one related to including streamer logic

I don't think it makes sense to separate those developments. Having a generator object is of no help if we actually load the data into memory. So implementing a generator and consuming the generator properly has to be bound. Otherwise we have the same faith of the compressed RucioConMon feature, which was implemented but never adopted in MSUnmerged.

Said that, my goal was indeed just to consume a generator, but memory footprint was still extremely high because of the parsing and assignment of directories and files to data structures.

For the CPU and memory footprint, I actually performed an isolated study with RucioConMon only, which was actually discussed in this transient PR:
#12089

I think it should be simple enough to run the current MSUnmerged only with those changes, versus those changes + modified MSUnmerged. I will try to get back to this next week.

@todor-ivanov
Copy link
Contributor

trying to remove the parent directory, if it fails
tries to delete the subdirectories, if it fails
get a list of contents (files) in all these subdirectories and remove by 100 file slices.
finally, try to remove any leftover directory (which should now be empty).

I will update the PR description with this. I also need to revisit what was already in place and see the differences.

All of this is already in place .... I do not see a point of rewriting this logic.... Well if you like it your way .. of course there is a point.

@todor-ivanov
Copy link
Contributor

What needs to be done is:

  • First to loosen the initial constraints of this service and
  • Second to isolate its (respectively ours) responsibility.

Let me rephrase with some more details:

  • We do not need to depend on RucioConMon... neither its cycle (which basically introduces a lot of failed behavior on our side) nor the level of granularity of the scans it produces.... Since Reqmgr does not care about the contents of any directory deeper than 3 levels down the tree starting from the workflow name - meaning it stops at the highest possible level in the output - the dataset name..... and there is a solid reason for that - because from this point down the amount of information grows exponentially. Hence the disability of the service to scale ..... No matter how many tricks we apply once we get into the spiral of recursions the complexity function is going to be nothing less but O(exp). So we do not need all this information to be transferred to us.... we should never even try recursive deletions on a file by file basis ... and shoot ourselves in the thumb with remote recursive tree traversals (which on top of everything adds up the protocol overhead etc. etc...). We must stop at the level at which Reqmgr protects the data... and cut the branch there - ergo delete everything directly from this level and now down bellow. Which means we simply do not need these GBs of scanned data on a file by file basis.
  • When it comes to failures to cut the branch at the level we know is protected by Reqmgr or if we have other like cert issues, any failure gets logged and alarmed to Site Support, and we never chaise those one by one on site by site basis .... we simply need to deliver previously agreed set of failures + alarms to the relevant team (if needed create the proper REST interfaces for that - which already exist btw, and we already accumulate this information). And then the relevant team is to be assigned to communicate the resource issues with the resource providers. If we try to own all the problems of the system that would simply be a disaster if you ask me ...We won't hold such pressure.

@todor-ivanov
Copy link
Contributor

But,... if we strictly insist on keeping the dependency on RucioConMon.... ok, then we should download the file or stream it (our choice) , parse it in a separate thread to the level of granularity our service speaks and record it in a shorter file wich would reduce the size from Gbs to Mbs. And this logic is already in the current implementation of MSUnmerged, What I suggest here is simply to exchange one resource with the other - swap memory to i/o expenses. What should be the action item to achieve this with minimal effort would be to transform the MSUnmerge service in Producer consumer mode which we already very well know. And the current code should not be difficult to be split in two pieces without heavy logic rewriting - one thread to parse and preserve the file with reduced contents, one thread to consume it benefiting from all the possible resource optimizations that come out of this.

@amaltaro
Copy link
Contributor Author

If we don't rely on RucioConMon, how do we know which directories have been created in which storage? Are you saying that the scanning logic implemented in Rucio should become part of WMCore? I hope not!

Second, we have no choice but go down to the file level once a directory fails to be deleted (which many times it could be because it has too many content in it, from what I have seen in the logs).

About your last bullet, I totally agree that we have no way to communicate issues with the sites. However, we are responsible for the MSUnmerged functionality. Everyone has to take up responsibility in this process.

@todor-ivanov
Copy link
Contributor

todor-ivanov commented Oct 11, 2024

If we don't rely on RucioConMon, how do we know which directories have been created in which storage? Are you saying that the scanning logic implemented in Rucio should become part of WMCore? I hope not!

We are already doing worse .. Once we fail a deletion we start all of this process recursively on a depth first logic && remotely..... which is a disaster. So What I suggest is to do this scan ourselves, but to stop at level -3. The level at which our services talk and never dive deeper. This is a fairly fast process and easy, because at that level of the tree the amount of information is still in the lower (almost linear) part of the exponent.... this is something we can confidently manage ourselves. And as I said we are already doing it .... much much much deeper..... so this needs no proof of concept.

@amaltaro
Copy link
Contributor Author

We only scan deeper, not to the top level directories. In addition, we only scan in case of failures, because otherwise there is no way we can delete files.

Just to make sure we are speaking the same language, an example of file to be deleted would be:

          "/store/unmerged/RunIISummer20UL17SIM/DPSToJpsiJpsi_FourMuonFilter_SoftQCDnonD_TuneCP5_13TeV-pythia8-evtgen/GEN-SIM/106X_mc2017_realistic_v6-v3/00000/blah.root",

Are you saying that we need to list content up to /store/unmerged/RunIISummer20UL17SIM/DPSToJpsiJpsi_FourMuonFilter_SoftQCDnonD_TuneCP5_13TeV-pythia8-evtgen/GEN-SIM only? If so, you understand that to get to this point, we need to recursively list directories starting at the root directory /store/unmerged/, right?

I feel like we are deviating from the original goal of this issue, which is:

  • to stop breaking the service and make it able to actually delete unmerged files.

We can have the redesign of MSUnmerged discussion in a different moment. For now, let us try to resolve this one issue please.

@todor-ivanov
Copy link
Contributor

Second, we have no choice but go down to the file level once a directory fails to be deleted (which many times it could be because it has too many content in it, from what I have seen in the logs).

Yes we have. It was announced that all sites are now supposed to be using WEBDav, which btw in the past has hit us heavily, and exactly in that context I did some research and the outcome was that we could be able to confidently and rightfully ask the recursive deletions to be supported from the sites (similarly to that we ask the write permissions at the unmerged area by certificate role) and we should not own all the protocol level complexities. All of which were supposed to be encapsulated and isolated from us by using gfal - which is the whole meaning of its sole existence btw.

@todor-ivanov
Copy link
Contributor

We can have the redesign of MSUnmerged discussion in a different moment. For now, let us try to resolve this one issue please

Your code change does not resolve an issue. With your code change you are already redesigning the service, but without having this discussion which I intentionally triggered here.

@amaltaro
Copy link
Contributor Author

I guess you didn't see the memory plots I provided in here then:
#12089 (comment)

I will try to get the performance measurements you requested at some point next week. Thanks

@todor-ivanov
Copy link
Contributor

about this:

Are you saying that we need to list content up to /store/unmerged/RunIISummer20UL17SIM/DPSToJpsiJpsi_FourMuonFilter_SoftQCDnonD_TuneCP5_13TeV-pythia8-evtgen/GEN-SIM

I may be citing the wrong tree level... (could be -4 or close), because I do not remember it by heart. The point is, we should simply check the depth of the protectedlfn REST API from reqmgr and stop there at that level.

@todor-ivanov
Copy link
Contributor

I guess you didn't see the memory plots I provided in here then:
#12089 (comment)

Alan, I saw them.... I have been running similar tests hundreds of times.... I said it few comments above. If we insist to keep depending on RucioConMon, be it, but lets separate it in a different thread.

And running any test on the current code status would be speculative.... because you did not isolate the changes in different steps so it would be difficult to attribute the results either to some code optimizations or because you changed the way how you feed the service. So they would not be a justification of you rewriting the rest of the logic of the service.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MSUnmerged T1 service crashing when pulling large storage json dump
4 participants