-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
collect file corruption report from jobs and act on the info #7883
Comments
Waiting to get experience before deciding if/how to progress |
keep this open |
the version "which reports" is now deployed in production with v3.231006 I will enable reporting by
on all schedulers. |
Chek enabled on Oct 9. Stil no corruption reported. DId I do something wrong ? |
first "non from me" event !
But after This proves that tools is working and that so far it looks good to have SUSPICIOUS as separated from CORRUPTED. |
today another bunch of fatal errors from HammerCloud, of the kind, all at T2_US_MIT
All of them went away in following, automatic, job resubmissions. We already knew that MIT is not top-of-the-class as robustness, but other than that... have not learn anything yet. |
Still the only corruption reports are the read errors at MIT
I will
|
More than a month into this. Only recorded problems are form HammerCloud transient read errors (usually fixed in next automatic retry). All such errors have exit code 8021 (means file open is OK and there was a read error). I think I have spent too much time on this already.
that script #!/bin/bash
mkdir -p /tmp/belforte
cd /tmp/belforte
rm -rf ProcessBadFilesList.py
wget -q https://github.com/dmwm/CRABServer/raw/master/scripts/Utils/ProcessBadFilesList.py
python3 ProcessBadFilesList.py > /dev/null |
DAMN I was fooled by the fact that HC jobs could write reports, but directory tree I have now done |
corrupted file reports are now coming in like rain... oh well. |
I have now setup automatic daily reporting of bad input files, check of disk replicas size and Adler, and suggestion of which replica can be marked as BAD after making sure that there is a tape replica and handed this over to CMSRucio and CMS DM folks. Daily reports are available at links in https://monit-grafana.cern.ch/d/CsnjLe6Mk/crab-overview?orgId=11&refresh=1h&from=1700063958182&to=1700236758182&viewPanel=116 |
after #7867 CRAB PostJob will create one line on CERN EOS everytime a corrupted file is found.
As indicated in #7548 (comment) we need a script to run periodically to create daily (?) tallies, and possibly report to Rucio.
Breakdown:
The text was updated successfully, but these errors were encountered: