Hello,
This is about testing a second EOS cluster attached to the existing CTA dev instance. This EOS cluster can talk to only one tape server (getafix-ts) which has the daemon SSS key of the cluster
The file is wrtten on tape: it gets an archive.file_id and it is visible in the list obtained via cta-admin tf ls --vid XXXX. However EOS does not get notified about the archival and as a result does not handle/send a SYNC::ARCHIVED event.
Looking inside the object of the file I saw (in the “jobs” field) a reference to another tape server (cta-ts10) on the same CTA dev instance but cannot talk to the second EOS cluster (has a different SSS key)
“owner”: “ArchiveQueueFailed-clf_test-Maintenance-cta-ts10.scd.rl.ac.uk-2019-20220811-10:38:39-0-8”
Ttying to grep for the objectid in the cta-taped.lg of cta-ts10 I found the following about the owner process (which I dont know what is there for…)
2022-08-18T09:32:13.661955+01:00 cta-ts10 cta-taped: LVL=“INFO” PID=“2019” TID=“2019” MSG=“In ArchiveJob::failReport(): requeued job for report retry.” SubprocessName=“maintenanceHandler” fileId=“4294969644” reportType=“CompletionReport” exceptionMSG=“In EOSReporter::AsyncQueryHandler::HandleResponse(): failed to XrdCl::FileSystem::Query() [FATAL] Auth failed code:204 errNo:0 status:3” copyNb=“1” failureReason=“In EOSReporter::AsyncQueryHandler::HandleResponse(): failed to XrdCl::FileSystem::Query() [FATAL] Auth failed code:204 errNo:0 status:3” requestObject=“ArchiveRequest-Frontend-cta-front02.scd.rl.ac.uk-1135-20220817-17:01:02-0-3” reportRetries=“1” maxReportRetries=“0”
What exactly is going on? I can archive perfeclty ok myself but not another user. Is the removal of this object going to help?
ArchiveQueueFailed-clf_test-Maintenance-cta-ts10.scd.rl.ac.uk-2019-20220811-10:38:39-0-8
In the case of multiple EOS clusters, do the tapeservers need to have the SSS keys of all the EOS instances attached to the same CTA?
Thanks,
George