Hi Michael,
First of all, thanks for the explanation and you were right that removing this parameter wouldn’t solve our problems, as our happiness has lasted only a few hours and we have seen that disabling this parameters doesn’t make our system work properly.
We met with our colleagues at DESY to see if they could help, since we are both using dCache+CTA.
After the meeting, Eli took a deeper look at the logs and she saw the part that says SubprocessName="maintenanceHandler"
and she recalled to have seen this in a config file.
We set the parameter UseMaintenanceProcess
to no
just to test, and we wrote the people at DESY to see if they had it. They said no, and we thought it could have something to do with it.
The default value of UseMaintenanceProcess
is yes
, so removing it from the config file wouldn’t change the behavior on the error appearing.
We honestly have no idea were to continue, we can explain to you and the team, if you prefer in a private thread, our setup, settings, etc, to see if you have any possible idea or if you can point us to another possible source of the problems.
We are not using EOS, but it seems that the exception comes from and EOSReporter class process that is used to query back our dCache system to notify the transfer finished correctly, with the particularity that it does it twice, so the first time it works fine, and the second raises an error.
We think for the moment we are not going to look a lot more into it, since we in a kind of cul-de-sac, and we are not testing other things we want to. Of course we can not go into production with this error appearing, but for the moment we-ll leave it a little bit aside for our own mental health
I leave the logs we see here in case in can be of any help.
Dec 13 16:11:55.136914 ctatps001 cta-taped: LVL="ERROR" PID="28593" TID="28593" MSG="In Scheduler::reportArchiveJobsBatch(): failed to report." SubprocessName="maintenanceHandler" fileId="266" reportType="CompletionReport" exceptionMSG="In EOSReporter::AsyncQueryHandler::HandleResponse(): failed to XrdCl::FileSystem::Query() [ERROR] Operation expired code:206 errNo:0 status:1"
Dec 13 16:11:55.161024 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In Helpers::getLockedAndFetchedQueue<ArchiveQueue>(): Successfully found and locked an archive queue." SubprocessName="maintenanceHandler" fileId="266" reportType="CompletionReport" exceptionMSG="In EOSReporter::AsyncQueryHandler::HandleResponse(): failed to XrdCl::FileSystem::Query() [ERROR] Operation expired code:206 errNo:0 status:1" attemptNb="1" queueObject="ArchiveQueueToReportForUser-dteam-Maintenance-ctatps001.pic.es-28593-20221213-16:11:39-0-0" rootFetchNoLockTime="0.000802" rootRelockExclusiveTime="0.008640" rootRefetchTime="0.000337" addOrGetQueueandCommitTime="0.007413" rootUnlockExclusiveTime="0.000479" queueLockTime="0.000670" queueFetchTime="0.000414"
Dec 13 16:11:55.172387 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In ContainerAlgorithms::referenceAndSwitchOwnership(): Requeued a batch of elements." SubprocessName="maintenanceHandler" fileId="266" reportType="CompletionReport" exceptionMSG="In EOSReporter::AsyncQueryHandler::HandleResponse(): failed to XrdCl::FileSystem::Query() [ERROR] Operation expired code:206 errNo:0 status:1" C="ArchiveQueueToReportForUser" tapepool="dteam" containerAddress="ArchiveQueueToReportForUser-dteam-Maintenance-ctatps001.pic.es-28593-20221213-16:11:39-0-0" queueJobsBefore="0" queueBytesBefore="0" queueJobsAfter="1" queueBytesAfter="1048576000" queueLockFetchTime="0.019046" queueProcessAndCommitTime="0.003286" asyncUpdateLaunchTime="0.000155" asyncUpdateCompletionTime="0.003682" requestsUpdatingTime="0.000037" queueUnlockTime="0.000388"
Dec 13 16:11:55.172720 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In ArchiveJob::failReport(): requeued job for report retry." SubprocessName="maintenanceHandler" fileId="266" reportType="CompletionReport" exceptionMSG="In EOSReporter::AsyncQueryHandler::HandleResponse(): failed to XrdCl::FileSystem::Query() [ERROR] Operation expired code:206 errNo:0 status:1" copyNb="1" failureReason="In EOSReporter::AsyncQueryHandler::HandleResponse(): failed to XrdCl::FileSystem::Query() [ERROR] Operation expired code:206 errNo:0 status:1" requestObject="ArchiveRequest-Frontend-ctafrontend01.pic.es-4637-20221212-11:42:22-0-15" reportRetries="1" maxReportRetries="0"
Dec 13 16:11:55.173027 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In Scheduler::reportArchiveJobsBatch(): reported a batch of archive jobs." SubprocessName="maintenanceHandler" totalReports="1" failedReports="1" successfulReports="0" getArchiveJobsToReportTime="0.140234" asyncReportLaunchTime="0.002518" reportCompletionTime="15.038548" reportRecordingInSchedDbTime="0.000055"
Dec 13 16:11:55.173204 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In DiskReportRunner::runOnePass(): did one round of archive reports." SubprocessName="maintenanceHandler" roundTime="15.181584"
Dec 13 16:11:55.174060 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In OStoreDB::ArchiveJob::~ArchiveJob(): will leave the job owned after destruction." agentObject="Maintenance-ctatps001.pic.es-28593-20221213-16:11:39-0" jobObject="ArchiveRequest-Frontend-ctafrontend01.pic.es-4637-20221212-11:42:22-0-15"
Dec 13 16:11:55.289966 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In RootEntry::removeArchiveQueueAndCommit(): removed archive queue." SubprocessName="maintenanceHandler" archiveQueueObject="ArchiveQueueToReportForUser-dteam-Maintenance-ctatps001.pic.es-28593-20221213-16:11:39-0-0"
Dec 13 16:11:55.291861 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In RootEntry::removeArchiveQueueAndCommit(): removed archive queue reference." SubprocessName="maintenanceHandler" tapePool="dteam" queueType="JobsToReportToUser"
Dec 13 16:11:55.292285 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In ContainerTraits<ArchiveQueue_t,ArchiveQueue>::trimContainerIfNeeded(): deleted empty queue" SubprocessName="maintenanceHandler" tapepool="dteam" queueObject="ArchiveQueueToReportForUser-dteam-Maintenance-ctatps001.pic.es-28593-20221213-16:11:39-0-0" queueUnlockTime="0.000378" rootEntryLockTime="0.000487" rootEntryFetchTime="0.000270" rootEntryRemoveArchiveQueueAndCommitTime="0.104882"
Dec 13 16:11:55.293058 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In Algorithms::popNextBatch(): did one round of elements retrieval." SubprocessName="maintenanceHandler" C="ArchiveQueueToReportForUser" tapepool="dteam" containerAddress="ArchiveQueueToReportForUser-dteam-Maintenance-ctatps001.pic.es-28593-20221213-16:11:39-0-0" filesAdded="1" filesBefore="0" filesAfter="1" queueJobsBefore="1" queueBytesBefore="1048576000" queueJobsAfter="0" queueBytesAfter="0" findLockFetchQueueTime="0.001490" jobSelectionTime="0.000499" ownershipAdditionTime="0.000010" asyncUpdateLaunchTime="0.003238" asyncUpdateCompletionTime="0.003439" updateResultProcessingTime="0.000023" containerUpdateTime="0.002829" containerTrimmingTime="0.106705" containerUnlockTime="0.000008" structureProcessingTime="0.000018"
Dec 13 16:11:55.293457 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In Algorithms::popNextBatch(): elements retrieval complete." SubprocessName="maintenanceHandler" C="ArchiveQueueToReportForUser" tapepool="dteam" files="1" findLockFetchQueueTime="0.001490" jobSelectionTime="0.000499" ownershipAdditionTime="0.000010" asyncUpdateLaunchTime="0.003238" asyncUpdateCompletionTime="0.003439" updateResultProcessingTime="0.000023" containerUpdateTime="0.002829" containerTrimmingTime="0.106705" containerUnlockTime="0.000008" structureProcessingTime="0.000018" schedulerDbTime="0.118792" iterationCount="1"
Dec 13 16:12:10.138993 ctatps001 cta-taped: LVL="ERROR" PID="28593" TID="28593" MSG="In Scheduler::reportArchiveJobsBatch(): failed to report." SubprocessName="maintenanceHandler" fileId="266" reportType="CompletionReport" exceptionMSG="In EOSReporter::AsyncQueryHandler::HandleResponse(): failed to XrdCl::FileSystem::Query() [ERROR] Operation expired code:206 errNo:0 status:1"
Dec 13 16:12:10.145407 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In Helpers::getLockedAndFetchedQueue<ArchiveQueue>(): Successfully found and locked an archive queue." SubprocessName="maintenanceHandler" fileId="266" reportType="CompletionReport" exceptionMSG="In EOSReporter::AsyncQueryHandler::HandleResponse(): failed to XrdCl::FileSystem::Query() [ERROR] Operation expired code:206 errNo:0 status:1" attemptNb="1" queueObject="ArchiveQueueFailed-dteam-Maintenance-ctatps001.pic.es-27672-20221212-11:23:15-0-2" rootFetchNoLockTime="0.000851" rootRelockExclusiveTime="0.000000" rootRefetchTime="0.000000" addOrGetQueueandCommitTime="0.000000" rootUnlockExclusiveTime="0.000000" queueLockTime="0.000759" queueFetchTime="0.000482"
Dec 13 16:12:10.157498 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In ContainerAlgorithms::referenceAndSwitchOwnership(): Requeued a batch of elements." SubprocessName="maintenanceHandler" fileId="266" reportType="CompletionReport" exceptionMSG="In EOSReporter::AsyncQueryHandler::HandleResponse(): failed to XrdCl::FileSystem::Query() [ERROR] Operation expired code:206 errNo:0 status:1" C="ArchiveQueueFailed" tapepool="dteam" containerAddress="ArchiveQueueFailed-dteam-Maintenance-ctatps001.pic.es-27672-20221212-11:23:15-0-2" queueJobsBefore="1" queueBytesBefore="1048576000" queueJobsAfter="2" queueBytesAfter="2097152000" queueLockFetchTime="0.002329" queueProcessAndCommitTime="0.004125" asyncUpdateLaunchTime="0.000112" asyncUpdateCompletionTime="0.003651" requestsUpdatingTime="0.000035" queueUnlockTime="0.000461"
Dec 13 16:12:10.157814 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In ArchiveJob::failReport(): stored job in failed container for operator handling." SubprocessName="maintenanceHandler" fileId="266" reportType="CompletionReport" exceptionMSG="In EOSReporter::AsyncQueryHandler::HandleResponse(): failed to XrdCl::FileSystem::Query() [ERROR] Operation expired code:206 errNo:0 status:1" copyNb="1" failureReason="In EOSReporter::AsyncQueryHandler::HandleResponse(): failed to XrdCl::FileSystem::Query() [ERROR] Operation expired code:206 errNo:0 status:1" requestObject="ArchiveRequest-Frontend-ctafrontend01.pic.es-4637-20221212-11:42:22-0-15" reportRetries="2" maxReportRetries="0"
Dec 13 16:12:10.158056 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In Scheduler::reportArchiveJobsBatch(): reported a batch of archive jobs." SubprocessName="maintenanceHandler" totalReports="1" failedReports="1" successfulReports="0" getArchiveJobsToReportTime="0.119599" asyncReportLaunchTime="0.000182" reportCompletionTime="14.864025" reportRecordingInSchedDbTime="0.000017"
Dec 13 16:12:10.158201 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In DiskReportRunner::runOnePass(): did one round of archive reports." SubprocessName="maintenanceHandler" roundTime="14.984011"
Dec 13 16:12:10.158765 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In OStoreDB::ArchiveJob::~ArchiveJob(): will leave the job owned after destruction." agentObject="Maintenance-ctatps001.pic.es-28593-20221213-16:11:39-0" jobObject="ArchiveRequest-Frontend-ctafrontend01.pic.es-4637-20221212-11:42:22-0-15"
Dec 13 16:12:10.158874 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In DiskReportRunner::runOnePass(): finished one pass." SubprocessName="maintenanceHandler" roundCount="0" passTime="30.167261"
Dec 13 16:13:40.526458 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In GarbageCollector::cleanupDeadAgent(): will cleanup dead agent." SubprocessName="maintenanceHandler" agentAddress="DriveProcess-257-ctatps001.pic.es-6156-20221213-12:45:52-0" gcAgentAddress="Maintenance-ctatps001.pic.es-28593-20221213-16:11:39-0" currentHeartbeat="2" GCRequested="false" timeout="120.000000" timer="120.538182" heartbeatAtTimerStart="2"
Dec 13 16:13:40.527771 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In Agent::removeAndUnregisterSelf(): Removed agent object." SubprocessName="maintenanceHandler" agentAddress="DriveProcess-257-ctatps001.pic.es-6156-20221213-12:45:52-0" gcAgentAddress="Maintenance-ctatps001.pic.es-28593-20221213-16:11:39-0" agentObject="DriveProcess-257-ctatps001.pic.es-6156-20221213-12:45:52-0"
Dec 13 16:13:40.531285 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In GarbageCollector::cleanupDeadAgent(): agent entry removed." SubprocessName="maintenanceHandler" agentAddress="DriveProcess-257-ctatps001.pic.es-6156-20221213-12:45:52-0" gcAgentAddress="Maintenance-ctatps001.pic.es-28593-20221213-16:11:39-0"
Dec 13 16:13:40.545606 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In GarbageCollector::cleanupDeadAgent(): will cleanup dead agent." SubprocessName="maintenanceHandler" agentAddress="DriveProcess-257-ctatps001.pic.es-6275-20221213-12:47:04-0" gcAgentAddress="Maintenance-ctatps001.pic.es-28593-20221213-16:11:39-0" currentHeartbeat="2" GCRequested="false" timeout="120.000000" timer="120.549931" heartbeatAtTimerStart="2"
Dec 13 16:13:40.546758 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In Agent::removeAndUnregisterSelf(): Removed agent object." SubprocessName="maintenanceHandler" agentAddress="DriveProcess-257-ctatps001.pic.es-6275-20221213-12:47:04-0" gcAgentAddress="Maintenance-ctatps001.pic.es-28593-20221213-16:11:39-0" agentObject="DriveProcess-257-ctatps001.pic.es-6275-20221213-12:47:04-0"
Dec 13 16:13:40.550540 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In GarbageCollector::cleanupDeadAgent(): agent entry removed." SubprocessName="maintenanceHandler" agentAddress="DriveProcess-257-ctatps001.pic.es-6275-20221213-12:47:04-0" gcAgentAddress="Maintenance-ctatps001.pic.es-28593-20221213-16:11:39-0"
Dec 13 16:13:40.565345 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In GarbageCollector::cleanupDeadAgent(): will cleanup dead agent." SubprocessName="maintenanceHandler" agentAddress="DriveProcess-257-ctatps001.pic.es-6873-20221213-12:51:49-0" gcAgentAddress="Maintenance-ctatps001.pic.es-28593-20221213-16:11:39-0" currentHeartbeat="399" GCRequested="false" timeout="120.000000" timer="120.567059" heartbeatAtTimerStart="399"
Dec 13 16:13:40.566507 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In Agent::removeAndUnregisterSelf(): Removed agent object." SubprocessName="maintenanceHandler" agentAddress="DriveProcess-257-ctatps001.pic.es-6873-20221213-12:51:49-0" gcAgentAddress="Maintenance-ctatps001.pic.es-28593-20221213-16:11:39-0" agentObject="DriveProcess-257-ctatps001.pic.es-6873-20221213-12:51:49-0"
Dec 13 16:13:40.570384 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In GarbageCollector::cleanupDeadAgent(): agent entry removed." SubprocessName="maintenanceHandler" agentAddress="DriveProcess-257-ctatps001.pic.es-6873-20221213-12:51:49-0" gcAgentAddress="Maintenance-ctatps001.pic.es-28593-20221213-16:11:39-0"
Again, thank you, and sorry for bothering with this, but we are surely coming back to it at some point in the future.
Cheers,
Jordi