Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

collect file corruption report from jobs and act on the info #7883

Closed
5 tasks done
belforte opened this issue Sep 18, 2023 · 11 comments
Closed
5 tasks done

collect file corruption report from jobs and act on the info #7883

belforte opened this issue Sep 18, 2023 · 11 comments

Comments

@belforte
Copy link
Member

belforte commented Sep 18, 2023

after #7867 CRAB PostJob will create one line on CERN EOS everytime a corrupted file is found.
As indicated in #7548 (comment) we need a script to run periodically to create daily (?) tallies, and possibly report to Rucio.

Breakdown:

  • have a parsers + summary-make script
  • deploy in production
  • collect some experience
  • decide how to integrate calls to Rucio (mark suspicious ? mark bad ? need privileged account ?
  • finalize the script and make it run automatically e.g. in TW
@belforte belforte self-assigned this Sep 18, 2023
belforte added a commit to belforte/CRABServer that referenced this issue Sep 18, 2023
belforte added a commit to belforte/CRABServer that referenced this issue Sep 18, 2023
belforte added a commit to belforte/CRABServer that referenced this issue Sep 18, 2023
@belforte
Copy link
Member Author

Waiting to get experience before deciding if/how to progress

@belforte
Copy link
Member Author

belforte commented Oct 3, 2023

keep this open

@belforte belforte reopened this Oct 3, 2023
@belforte
Copy link
Member Author

belforte commented Oct 9, 2023

the version "which reports" is now deployed in production with v3.231006 I will enable reporting by

sudo touch /etc/use_corruption_check

on all schedulers.
This way we can quickly roll back if the change to PostJob result in some problems.

@belforte
Copy link
Member Author

Chek enabled on Oct 9. Stil no corruption reported. DId I do something wrong ?

@belforte
Copy link
Member Author

belforte commented Oct 17, 2023

first "non from me" event !
An HammerCloud job reported a suspicious file

belforte@lxplus810/BadInputFiles> cat suspicious/new/231017_080559:sciaba_crab_HC-208-T2_AT_Vienna-105008-20231017100501.job.76.0.json|jq
{
  "DID": "cms:/store/mc/HC/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v2/00000/E238F708-2176-E711-B5DF-FA163EAD7847.root",
  "RSE": "T2_AT_Vienna",
  "exitCode": 8021,
  "message": [
    "== CMSSW:       [a] Fatal Root Error: @SUB=TBasket::Streamer",
    "== CMSSW: The value of fNbytes is incorrect (-322747004) ; trying to recover by setting it to zero",
    "NOT CLEARLY CORRUPTED, OTHER ROOT ERROR ?",
    "DID Identification may not be corrrect",
    "stdout: https://cmsweb.cern.ch:8443/scheddmon/0144/sciaba/231017_080559:sciaba_crab_HC-208-T2_AT_Vienna-105008-20231017100501/job_out.76.0.txt",
    "postjob: https://cmsweb.cern.ch:8443/scheddmon/0144/sciaba/231017_080559:sciaba_crab_HC-208-T2_AT_Vienna-105008-20231017100501/postjob.76.0.txt"
  ]
}
belforte@lxplus810/BadInputFiles> 

But after rucio get --rses T2_AT_Vienna cms:/store/mc/HC/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v2/00000/E238F708-2176-E711-B5DF-FA163EAD7847.root the file appears OK and indeed CRAB resubmitted that jobs
and 76.1 completed OK

This proves that tools is working and that so far it looks good to have SUSPICIOUS as separated from CORRUPTED.

@belforte
Copy link
Member Author

today another bunch of fatal errors from HammerCloud, of the kind, all at T2_US_MIT

== CMSSW: ----- Begin Fatal Exception 18-Oct-2023 14:40:04 EDT-----------------------
== CMSSW: An exception of category 'FileReadError' occurred while
== CMSSW:    [0] Rethrowing an exception that happened on a different thread.
== CMSSW:    [1] Reading branch HcalNoiseSummary_hcalnoise__RECO.
== CMSSW:    Additional Info:
== CMSSW:       [a] Fatal Root Error: @SUB=TBranchElement::GetBasket
== CMSSW: File: ///mnt/hadoop/cms/store/mc/HC/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v2/00000/0A11E55B-0876-E711-9647-FA163EB8F562.root at byte:153972288, branch:HcalNoiseSummary_hcalnoise__RECO.present, entry:1485, badread=1, nerrors=1, basketnumber=9
== CMSSW:
== CMSSW: ----- End Fatal Exception -------------------------------------------------

All of them went away in following, automatic, job resubmissions.
Only one job kept failing 3 times, but direct inspection (rucio get plus gfal-sum) adler32 checksum is fine.

We already knew that MIT is not top-of-the-class as robustness, but other than that... have not learn anything yet.

@belforte
Copy link
Member Author

belforte commented Oct 31, 2023

Still the only corruption reports are the read errors at MIT badread=1, nerrors=1, basketnumber=... also one case with badread=0, nerrors=1..
and the only suspicion are again at MIT with the very cryptic

    "== CMSSW:       [a] Fatal Root Error: @SUB=TBasket::ReadBasketBuffers",
    "== CMSSW: fNbytes = 581760, fKeylen = 119, fObjlen = 600644, noutot = 0, nout=0, nin=581641, nbuf=600644",

I will

  • change the "detection" code to move the former from corrupted to suspicious.
  • review the conditions under which I now set exit code to 8022

@belforte
Copy link
Member Author

belforte commented Nov 7, 2023

More than a month into this. Only recorded problems are form HammerCloud transient read errors (usually fixed in next automatic retry). All such errors have exit code 8021 (means file open is OK and there was a read error).
Plain corruption like truncated file onlly happened in my test and has exit code 8020 (error in open).

I think I have spent too much time on this already.
Will try to keep checking summaries once a month, but that's all.
I have added this to my acrontab

`# every week make a summary of CRAB corrupted file reports
30 05 */7 * * lxplus8.cern.ch /afs/cern.ch/user/b/belforte/utils/MakeCorruptedFileReportsSummary.sh

that script MakeCorruptedFileReportsSummary.sh is:

#!/bin/bash
mkdir -p /tmp/belforte
cd /tmp/belforte
rm -rf ProcessBadFilesList.py
wget -q https://github.com/dmwm/CRABServer/raw/master/scripts/Utils/ProcessBadFilesList.py
python3 ProcessBadFilesList.py > /dev/null

@belforte
Copy link
Member Author

DAMN I was fooled by the fact that HC jobs could write reports, but directory tree /eos/cms/temp/user/BadInputFiles lacked the needed permission for users to write. I guess Sciaba identity has special privileges on EOS.
I had tested that user jobs could write initially, but forgot about the needed config. when moving to a better-named directory.

I have now done chmod g+w -R and chmod o+w -R and tested again with a user job.
Need to reset the "wait and see".

@belforte
Copy link
Member Author

corrupted file reports are now coming in like rain... oh well.

@belforte
Copy link
Member Author

belforte commented Nov 17, 2023

I have now setup automatic daily reporting of bad input files, check of disk replicas size and Adler, and suggestion of which replica can be marked as BAD after making sure that there is a tape replica and handed this over to CMSRucio and CMS DM folks.
References:
https://its.cern.ch/jira/browse/CMSTRANSF-766
dmwm/CMSRucio#403

Daily reports are available at links in https://monit-grafana.cern.ch/d/CsnjLe6Mk/crab-overview?orgId=11&refresh=1h&from=1700063958182&to=1700236758182&viewPanel=116

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant