Archive failed because of a Maintenance process on another tapserver

george_patargias · 18 August 2022 10:00

Hello,

This is about testing a second EOS cluster attached to the existing CTA dev instance. This EOS cluster can talk to only one tape server (getafix-ts) which has the daemon SSS key of the cluster

The file is wrtten on tape: it gets an archive.file_id and it is visible in the list obtained via cta-admin tf ls --vid XXXX. However EOS does not get notified about the archival and as a result does not handle/send a SYNC::ARCHIVED event.

Looking inside the object of the file I saw (in the “jobs” field) a reference to another tape server (cta-ts10) on the same CTA dev instance but cannot talk to the second EOS cluster (has a different SSS key)

“owner”: “ArchiveQueueFailed-clf_test-Maintenance-cta-ts10.scd.rl.ac.uk-2019-20220811-10:38:39-0-8”

Ttying to grep for the objectid in the cta-taped.lg of cta-ts10 I found the following about the owner process (which I dont know what is there for…)

2022-08-18T09:32:13.661955+01:00 cta-ts10 cta-taped: LVL=“INFO” PID=“2019” TID=“2019” MSG=“In ArchiveJob::failReport(): requeued job for report retry.” SubprocessName=“maintenanceHandler” fileId=“4294969644” reportType=“CompletionReport” exceptionMSG=“In EOSReporter::AsyncQueryHandler::HandleResponse(): failed to XrdCl::FileSystem::Query() [FATAL] Auth failed code:204 errNo:0 status:3” copyNb=“1” failureReason=“In EOSReporter::AsyncQueryHandler::HandleResponse(): failed to XrdCl::FileSystem::Query() [FATAL] Auth failed code:204 errNo:0 status:3” requestObject=“ArchiveRequest-Frontend-cta-front02.scd.rl.ac.uk-1135-20220817-17:01:02-0-3” reportRetries=“1” maxReportRetries=“0”

What exactly is going on? I can archive perfeclty ok myself but not another user. Is the removal of this object going to help?

ArchiveQueueFailed-clf_test-Maintenance-cta-ts10.scd.rl.ac.uk-2019-20220811-10:38:39-0-8

In the case of multiple EOS clusters, do the tapeservers need to have the SSS keys of all the EOS instances attached to the same CTA?

Thanks,

George

mdavis · 22 August 2022 07:00

Hi George,

The scheduler is distributed over all tape servers. It is expected that one tape server will handle the archival request and a different (idle) tape server will handle the reporting for the same job. All tape servers should be able to talk to all EOS instances to facilitate this.

If you want, you can disable the maintenance process tapeserver by tapeserver. (See the latest cta-taped manual page for details.) However, there is no way at present to restrict certain tape servers to certain EOS instances.

In your case above, all retry attempts have been exhausted and the request is now in the failed jobs queue. It will stay there until handled by an operator. See cta-admin failedrequest for commands to query or delete failed jobs.

Cheers,

Michael

george_patargias · 4 October 2022 15:18

Thanks for this Michael

George