Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate automatic suspicious replica recovery #403

Closed
ericvaandering opened this issue Jan 13, 2023 · 58 comments
Closed

Investigate automatic suspicious replica recovery #403

ericvaandering opened this issue Jan 13, 2023 · 58 comments
Assignees
Labels
waiting task in pending for external deps

Comments

@ericvaandering
Copy link
Member

Would need something done with traces (declaring things suspicious, maybe some logic to deal with xrootd and specific exit codes?)

Then need to run the replica recoverer daemon.

@yuyiguo
Copy link
Member

yuyiguo commented Jan 17, 2023

@ericvaandering ,
Can you give more details on this issue?

@belforte
Copy link
Member

@yuyiguo @ericvaandering I guess I am the one who started this. Shall we try to have a quick chat on zoom ?
I am generally/usually available in your early morning (8-10), But if you prefer that I expand here.. sure, let me know.

@yuyiguo
Copy link
Member

yuyiguo commented Jan 20, 2023

Yes, I am available today. Let me know when we can chat.

@belforte
Copy link
Member

wel... today I had things to do. Let's plan it a little bit so that Eric may also join. 10 min should suffice

@belforte
Copy link
Member

what about tomorrow in your 8-10am window ?

@belforte
Copy link
Member

belforte commented Jan 24, 2023

Adding some description after a chat with Eric and Yuyi (Katy was also there)
@KatyEllis FYI

where we start from

  • in this meeting Dimitrios Christidis reported in slide 11 that
    • For each failed transfer, if the error message matches some pattern, then the replica is marked as suspicious. The Rucio Replica Recoverer daemon processes these and does automatic recovery

what we want to do

  • use that Replica Recover daemon to automatically check and fix files which we suspect of being corrupted because they make cmsRun(CMSSW) raise a fatal exception when reading.

things we know

  1. cmsRun sends UDP messages when it closed files which it read successfully, those are turned somehow/somewhere into messages to AMQ which we (CMS) consume (some daemon which Yuyi is in control of)
  2. the path above is not viable to report read failures, since in that case cmsRun just exit via a fatal exception and has no way to go through the UDP-sending part
  3. Writing to AMQ requires authentication, so it is secure and can be tuned/throttled rate wise
  4. Sending UDP messages is free for all to do and if too many are sent some get lost and there's never DDOS risk. Should something wrong/bad/malicious happen, an extra ton "I read this file" will do not permanent harm, but a flood of "this file is bad" may be very very bad.
  5. there are other AMQ queues which are consumed by Yuyi's daemon and where we could send the information about bad files. In particular one referred to as "the job report parser which reads from ES and writes to AMQ
  6. Rucio python client has declare_suspicious_file_replicas(self, pfns, reason) in lib/rucio/client/ by which a replica (PFN) can be declared suspicious, and a (hopefully matching) declare_suspicious_file_replicas(self, pfns, reason) in lib/rucio/api
  7. Same as above can also be done via a REST call.
  8. There is also a way to declare a file as suspicious via a non-authenticated URL (?? did I misunderstand ?)
  9. We are not running the Rucio Replica Recoverer daemon now, but can (and should)
  10. Currently jobwrappers (WMA or CRAB) communicate with the world via the FJR which is returned to scheduler and sent to WMArchive (and from there fo HDFS and/or ES but Stefano does not know) and via condor classAds which are read by monitoring spider and sent to ES and used e.g. to populate Grafana dashboards in CMS Monitoring.

things we do not know, but want to know

  1. what exactly happens when a replica is declared suspicious (is it still used as source ? processed by workflows ? E.g. is the state of the (Rucio) dataset affected ? CRAB only processes datasets with status AVAILABLE
  2. how exactly data are currently fed to AMQ: by which tools, run by whom, which data they are etc.
  3. would it be possible (sensible?) to send read errors to AMQ and have "Yuyi's daemon" use those messages to declare supsicious replicas ?
  4. if we decide instead to declare replicas directly from some upstream process (instead of sending a message to AMQ), do we have the needed information and authorization there ? CRAB server currently authenticates with account crab_server or with crab_tape_recall. Other daemons/processes which currently write to AMQ ? What would be the risk in granting to CRAB server the needed privilege ?
  5. we detect errors via a "local read PFN" and will have the file LFN, but declare_suspicious_file_replica needs a PFN in .. which format ? Can we really send {'scope': , 'name': , 'rse': <rse_name>} instead ?
  6. do we have multiple option to declare suspicious replicas in Rucio ? Or in the end python client, url where it post, call in api...all proceeds via the same code (in core ? ). In the end, what exactly happens ?

things we should do (i.e. The Plan)

  1. complete our knowledge of topics listed above (Rucio part: Yuyi. AMQ: Yuyi. FJR to AMQ: Stefano)
  2. evaluate and prototype a change to job wrapper that identifies read error and put info in FJR, exit code and classAds (exit code goes already into a classAd, but the failing PFN does not) (Stefano)
  3. evaluate what would be the best way for job wrapper to report read errors so that replica can be declared suspicious. Stefano will propose a plan once he has the needed knowledge. Stefano prefers a solution where CRAB and WMA signal errors by parsing cmsRun stderr in grid nodes, but reporting to Rucio is done in a single place common to both.
  4. agree on the plan (all) and assign specific TODO issues.

@dciangot dciangot added bug triage Issues that need investigation before action labels Feb 22, 2023
@dciangot
Copy link
Contributor

did this issue follow up somewhere else? Or is it just stale and we still need to validate @belforte proposal with the various tasks?

@belforte
Copy link
Member

is still on my todo list and I am not tracking it elsewhere. I should. Then this can be put on hold until I have a proposal.

@belforte
Copy link
Member

currently this breaks up as

@yuyiguo
Copy link
Member

yuyiguo commented Feb 22, 2023

This is on my to-do list too but in a low priority.

@belforte
Copy link
Member

First thing is to flag jobs which hit corrupted files, and monitor it, so we quantify problem.

@yuyiguo
Copy link
Member

yuyiguo commented Mar 28, 2023

What is the definition of "suspicious replicas"? If a transfer failed, fts will try to retransfer it. If the failure is permanent, how Replica Recover daemon can fix it? Why CMSSW or processing jobs will read a suspicious replica?

@belforte
Copy link
Member

Hi @yuyiguo . Let me try to list here what I (think that I) understand. Hopefully it answers some of your questions.

  1. If we plan to use declare_suspicious_file_replicas definition is up to us.
  2. Current wisdom is to check cmsRun stdout for fatal errors during root readout, we may have to collect some telling strings and improve with time.
  3. ATLAS does a rucio download of each files to the WN before reading, so Rucio will verify checksum there and IIUC automatically mark replica as suspicious. I have tested this myself [1] and the error is duly detected, but I think something else is needed to mark the replica as suspicious, Most likely there's something else going one when ATLAS jobs try to read files and we need to ask Dimitrios to be more explicit than in his talk mentioned in comment above. Anyhow we do not download files from storage to WN's.
  4. We do not know why replicas get corrupted, most likely it is due to some problem with site's storage systems, as you say FTS transfers are expected to verify checksums. While of course, sometimes a bit can still be flipped in last step of file write/close and one only notices when file is read back.
  5. data processing jobs do not know that a given replica is corrupted, so they read it and fail. I do not know what will happen if replica is marked suspicious. We only process datasets/blocks which are marked as AVAILABLE in a Disk RSE in Rucio, I can't say how a SUSPICIOUS replica affects that. Anyhow currently bad replicas stay on disk until manually removed by DM operators.
  6. and finally, about how Replica Recover can fix it, all that I know is this and I can't say if this is really what is in use for ATLAS. I have not tried to read the daemon code itself, but it has more comments with details. I definitely have no clue what metadata means here.

Hope this helps !

[1]

belforte@lxplus701/belforte> rucio download cms:/store/mc/RunIIFall17MiniAODv2/PMSSM_set_2_prompt_2_TuneCP2_13TeV-pythia8/MINIAODSIM/PUFall17Fast_GridpackScan_94X_mc2017_realistic_v15-v1/30000/BAA88CF6-7C17-ED11-8883-34800D2F7FF0.root --rses T2_US_UCSD
2023-03-27 17:57:51,728	INFO	Processing 1 item(s) for input
2023-03-27 17:57:51,847	INFO	No preferred protocol impl in rucio.cfg: No section: 'download'
2023-03-27 17:57:51,848	INFO	Using main thread to download 1 file(s)
2023-03-27 17:57:51,848	INFO	Preparing download of cms:/store/mc/RunIIFall17MiniAODv2/PMSSM_set_2_prompt_2_TuneCP2_13TeV-pythia8/MINIAODSIM/PUFall17Fast_GridpackScan_94X_mc2017_realistic_v15-v1/30000/BAA88CF6-7C17-ED11-8883-34800D2F7FF0.root
2023-03-27 17:57:51,866	INFO	Trying to download with davs and timeout of 4713s from T2_US_UCSD: cms:/store/mc/RunIIFall17MiniAODv2/PMSSM_set_2_prompt_2_TuneCP2_13TeV-pythia8/MINIAODSIM/PUFall17Fast_GridpackScan_94X_mc2017_realistic_v15-v1/30000/BAA88CF6-7C17-ED11-8883-34800D2F7FF0.root 
2023-03-27 17:57:51,936	INFO	Using PFN: davs://redirector.t2.ucsd.edu:1095/store/mc/RunIIFall17MiniAODv2/PMSSM_set_2_prompt_2_TuneCP2_13TeV-pythia8/MINIAODSIM/PUFall17Fast_GridpackScan_94X_mc2017_realistic_v15-v1/30000/BAA88CF6-7C17-ED11-8883-34800D2F7FF0.root
2023-03-27 17:59:58,713	WARNING	Checksum validation failed for file: cms:/store/mc/RunIIFall17MiniAODv2/PMSSM_set_2_prompt_2_TuneCP2_13TeV-pythia8/MINIAODSIM/PUFall17Fast_GridpackScan_94X_mc2017_realistic_v15-v1/30000/BAA88CF6-7C17-ED11-8883-34800D2F7FF0.root
2023-03-27 17:59:58,714	WARNING	Download attempt failed. Try 1/2
2023-03-27 18:02:02,656	WARNING	Checksum validation failed for file: cms:/store/mc/RunIIFall17MiniAODv2/PMSSM_set_2_prompt_2_TuneCP2_13TeV-pythia8/MINIAODSIM/PUFall17Fast_GridpackScan_94X_mc2017_realistic_v15-v1/30000/BAA88CF6-7C17-ED11-8883-34800D2F7FF0.root
2023-03-27 18:02:02,657	WARNING	Download attempt failed. Try 2/2
2023-03-27 18:02:02,670	ERROR	Failed to download file cms:/store/mc/RunIIFall17MiniAODv2/PMSSM_set_2_prompt_2_TuneCP2_13TeV-pythia8/MINIAODSIM/PUFall17Fast_GridpackScan_94X_mc2017_realistic_v15-v1/30000/BAA88CF6-7C17-ED11-8883-34800D2F7FF0.root
2023-03-27 18:02:02,672	ERROR	None of the requested files have been downloaded.
belforte@lxplus701/belforte>

@belforte
Copy link
Member

btw, the file in my example above has been fixed by Felipe and download is now OK.
https://mattermost.web.cern.ch/cms-o-and-c/pl/uw4mauamr3dxmf1aebpfsybcjo
https://ggus.eu/?mode=ticket_info&ticket_id=160902

@ivmfnal
Copy link
Contributor

ivmfnal commented Aug 1, 2023

Some time ago I wrote this document with the proposal how we can handle this.
Has anybody had a chance to read it ?
Do we need to revisit that proposal ? I am a bit confused about the state of that proposal.

@ivmfnal ivmfnal self-assigned this Aug 1, 2023
@ivmfnal
Copy link
Contributor

ivmfnal commented Aug 2, 2023

Basically here is my proposal:

  1. Detection
    • Add code to WM and other similar activities to report a replica as "suspicious" to Rucio on a read error via standard Rucio interfaces (API or CLI)
    • Write instructions for individual users how to do the same
  2. Recovery. Configure existing Rucio suspicious_replica_recoverer to declare the replica as "bad" after so many detections. Once the replica is declared "bad", Rucio will re-replicate it. We can discuss separately the case when this replica is in fact last replica for the file.

I would like to get some feedback on this proposal. Perhaps we can add more details to it.
Once we agree on the plan, we can work out the details.

@belforte
Copy link
Member

belforte commented Aug 2, 2023

thanks Igor, and apologies for not having replied earlier.
I had indeed read your document and fully agree with it and the plan outlined above.
I have in, and reasonably close to the top of my list, to do the 1. above in CRAB: dmwm/CRABServer#7548

I would like to see this at work from automatic/automated tools for a while, before we think about enabling users, at that point we may have to introduce some way to "trust machines more than humans" IMHO.

One thing that I expect we can talk about later, but let me mention now:
we can detect both missing and possibly corrupted files, and tell one from the other.
Should we also flag the missing ones (i.e. clean open failures w/o a zip error or wrong checksum) as suspicious and sort of try to shortcut the CE ? I am a bit worried when looking at CE page by how many sites simply fail week after week to give any useful result, looks only half sites or so have a "done". I am not saying to abandon that effort, simplt complement it.

We can surely resume this once I have code which parses CMSSW stderr !

@ivmfnal
Copy link
Contributor

ivmfnal commented Aug 2, 2023

Stefano @belforte , thanks for the reply and the feedback. I appreciate it.

I was thinking that it would make a sense to have another meeting (I think we had one already some time ago) among involved people to re-sync, discuss use cases and maybe come up with an action plan.

I think we need to get at least @yuyiguo @dynamic-entropy (Rahul) @klannon there. I would invite @ericvaandering too but he is on vacation. Who am I missing ?

@ivmfnal
Copy link
Contributor

ivmfnal commented Aug 2, 2023

@belforte said: "we can detect both missing and possibly corrupted files, and tell one from the other."

I think it is important to differentiate between several types of failures.

I would add another dimension to this:

  • potentially recoverable failures - those which have chances to be caused by some transient condition like networking failure or the site downtime
  • non-recoverable failures - e.g. checksum/file size/file format error - conditions which are unlikely to fix themselves

My understanding of the problem is that we want to use the "suspicious" replica state in first case for a while before we declare the replica "bad" if things do not improve, whereas if we believe this is not recoverable error, then go straight to "bad" replica state.

@belforte
Copy link
Member

belforte commented Aug 3, 2023

I do not think we need a (longish( meeting now. I'd like to have code which parses stdout/stder and does a few "mark as suspiciou" first. There can be questions arising during that which we can address as needed. In a way, I have my action plan.
Maybe simply a 5-min slot in usual CMS-Rucio dev meeting to make sure that everybody agrees with the plan which you outlined ? I think you missed @dciangot , anyhow this is not urgent IMHO.

As to the recoverable vs. non-recoverable. Yes, I know, we already discussed in the meeting where you firstly presented this. The problem here is how to be sure that the specific error is really a bad file, not a transient problem in the storage server, is the file really truncated, or was a connection dropped somewhere ? So again, I'd like to get experience with the simpler path first. All in all CRAB already retries file read failures 3 times (and WMA 10, IIUC), so if we e.g. say "3 time suspicious = bad", it may be good enough.

@ivmfnal
Copy link
Contributor

ivmfnal commented Aug 3, 2023

Just as FYI, here is the suspicious replica recoverer config for ATLAS:

[
    {
        "action": "declare bad",
        "datatype": ["HITS"],
        "scope": [],
        "scope_wildcard": []
    },
    {
        "action": "ignore",
        "datatype": ["RAW"],
        "scope": [],
        "scope_wildcard": []
    }
]

https://gitlab.cern.ch/atlas-adc-ddm/flux-rucio/-/blob/master/releases/base/secrets/suspicious_replica_recoverer.json

@ericvaandering
Copy link
Member Author

ericvaandering commented Sep 11, 2023 via email

@belforte
Copy link
Member

belforte commented Sep 11, 2023 via email

@ivmfnal
Copy link
Contributor

ivmfnal commented Sep 11, 2023

would not it be easier to have CRAB be specific which replica failed to read ?

@belforte
Copy link
Member

CRAB needs that CMSSW if the error happened opening a local file or a fallback one.
See my example. The corrupted file is at CERN, cmsRun only mentions FNAL :-(

@ivmfnal
Copy link
Contributor

ivmfnal commented Sep 11, 2023

I know.
Would not it be easier to have CRAB tell which replica was corrupt - the remote or the local ?

@belforte
Copy link
Member

which means: I can run that daemon ! Yes, I know. I may have to do something of that sort anyhow.
Mostly I am realizing that we do no have a good classification/reporting of the file open/read errors.

@ericvaandering
Copy link
Member Author

I wasn't suggesting that you need to write or run it. Just the process you are suggesting. WMAgent will probably run into the same issue.

So I understand the problem:

CRAB gets this error from CMSSW but CMSSW does not give enough information to know if the file is read locally or remotely?

Then, even if we knew "remotely" I could imagine problems knowing which remote file was read. CMSSW may not have any way of knowing that.

@belforte
Copy link
Member

Correct. Usually when xroot succeds in opening a remote file, the new PFN is printed in cmsRun log, but much to my disappointment, when the file open fails, nothing gets printed.
In a way since the file is remote to begin with, cmssw handles it like a "first time open" and fails with 8020, not as a "fallback open" which would result in 8028.

Of course many times the file will be opened locally and the site where job runs is sufficient informaiton. But I am not sure how to reliably tell.

I think we nedd some "exploratory work" so am going to simply reports "suspected corruptions" as files on CERN EOS to be consumed by some script (crontab e,g,) which can cross check and possibly make a rucio call, so that script can check the multiple replicas, if any.

IIUC Dima's plans NANO* aside, we are going to have a single disk copy of files.

@belforte
Copy link
Member

(somehow) later on we can think of incorporating all that code into something that parses cmsrun stdout and makes a call to Rucio on the fly. But I am not sure that we want to run Rucio client in WorkerNodes. It should be available since it is on cvmfs.
I guess I will not post more for a bit, while I try to map what's the situation out there.

@KatyEllis
Copy link
Contributor

With my site admin hat on, I would also advocate for clearer information on when a job fails (whether production or analysis) due to local or remote reads.

@dynamic-entropy dynamic-entropy removed the triage Issues that need investigation before action label Sep 29, 2023
@ericvaandering
Copy link
Member Author

I finally found my notes from a discussion with @ivmfnal and Cedric regarding this, so I paste them here. This would help us declare suspicious replicas. What we'd need to start investigating again is how to move things from suspicious to bad. There was a way to do that too and I've forgotten.

# rucio-admin config get --section conveyor --option suspicious_pattern
[conveyor]
suspicious_pattern=.*No such file or directory.*,.*CHECKSUM MISMATCH Source and destination checksums do not match.*,.*SOURCE CHECKSUM MISMATCH.*,.*Unable to read file - wrong file checksum.*,.*checksum verification failed.*
# rucio-admin config get --section kronos --option bad_files_patterns
[kronos]
bad_files_patterns=.*No such file or directory.*,.*no such file or directory.*,.*CHECKSUM MISMATCH Source and destination checksums do not match.*,.*SOURCE CHECKSUM MISMATCH.*,.*Unable to read file - wrong file checksum.*,.*checksum verification failed.*,.*direct_access.*,.*Copy failed with mode 3rd pull, with error: Transfer failed: failure: Server returned nothing.*,.*HTTP 404 : File not found.*

Here is the part of the code responsible for the declaration in the conveyor : https://github.com/rucio/rucio/blob/master/lib/rucio/daemons/conveyor/finisher.py#L301-L331
and in kronos : https://github.com/rucio/rucio/blob/master/lib/rucio/daemons/tracer/kronos.py#L131-L147
As you can see, for kronos, the traces are expected to have some specific key/values

@belforte
Copy link
Member

I will report in here as well what I wrote in the MM channel.
CRAB has been running now for 11 days in a mode where it reports (via files on EOS) all occurences of FatalRootError,
so far only a few cases of "error in reading from an otherwise good file replica" have surfaced.
dmwm/CRABServer#7883 (comment)

I suggest to collect more experience before doing more work. The fear (at least from my side) was that there are way more problems that the few that users report. But if we stay at a handful a year, manual action is less effort than we already spent. I have plans to write a script to facilitate suspicious file checking, so that users (or CRAB team) can report trustable information to DataManagement operators who have the needed permissions to mark replicas bad.

@ivmfnal
Copy link
Contributor

ivmfnal commented Oct 20, 2023

@ericvaandering based on my conversation with Cedric, some time ago I made this proposal. The idea is to use the replica recoverer to move replicas from suspicious to bad.

Here is my summary on how to configure the replica recoverer: #403 (comment)

@belforte
Copy link
Member

I had the wrong permission in the directory where read failure error reports are collected.
I have now collected 11 truncated files in a few hours :-(
I have a script which runs overall those reports, but so fare only makes daily summaries, not particularly good to read. @ivmfnal are we ready for me to add a suspicious replica report ? In cases where root says file blahblah.root is truncated at 1542455296 bytes: should be 2672978503, there is not much to argue, I presume. I can add size check via gfal-ls e.g. but once that says that side is different from what Rucio memorized... no point is doing more, right ?
IIUC we can't report a reason for suspicious replicas, shall we look into marking those as BAD directly ?

@belforte
Copy link
Member

I have now setup some automated reporting pipeline, and waiting for suggestions from DM ops and experts in here on how to progress.
Please see: https://its.cern.ch/jira/browse/CMSTRANSF-766

I think am stuck since:

  1. I do not have permission to mark replica as BAD myself, nor really want to
  2. If those bad replicas are not marked and resolved "prompty" we end up with some kind of "home grown DB" to keep track which is the last thing I want to do
  3. my scripts are meant to give us a sense of what's going on and show what is possible. But are not really a solid tool to be used for years to come. I know that I am not a good developer and will be more than happy if someone takes over.

Overall I think that in respect to Atlas we have:

  • better inormation (truncated, not a ROOT file, potential read error)
  • no indication of the source from CMSSW due to xroot fall-back, when we detect a possibly corrupted LFN we do not know where the replica is
  • once we go through the trouble of checking all replicas to find the one which potentially caused the error, we know enough to take action if needed

So we need either a smarter Replica Recover daemon or something in the middle like what I scripted here, but I have got very cold about what to expect from current Rucio. The ReplicaRecover could have a great role for replicas found bad or missing during transfers. I am quite puzzled that there is no automatism there yet.

In any case I have reached the limit of what CRAB support can do

Please, step in.

@belforte
Copy link
Member

Since my time and patience are limited, I have setup crontabs to update all of that daily.
Access to information is unified by this Grafana panel
https://monit-grafana.cern.ch/d/CsnjLe6Mk/crab-overview?orgId=11&refresh=1h&from=1700063958182&to=1700236758182&viewPanel=116

IMHO it would make a lot of sense that DM operators have a look daily and if they agree, declare BAD replicas a suggested after all additional checks that they consider relevant.

I hope with this I can stop answering user reports of corrupted files.

@haozturk
Copy link
Contributor

I'll close this one as we have roadmap. Will track this effort at #805

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting task in pending for external deps
Projects
None yet
Development

No branches or pull requests