What is the function of UseMaintenanceProcess parameter?

Hello everyone,

As the title says, I wanted to ask you about this parameter that comes by default at config file /etc/cta/cta-taped.conf:

# Enable MaintenanceProcess for repack-related operations, garbage collection and disk reporting.
taped UseMaintenanceProcess yes

What does it do exactly? What is the difference of being enabled or disabled?
We had this error (EOSReporter fails with CTA+dCache) and we can only get rid of it if we disable this.

My concern is if having this disabled will affect other operations as said in the comment or will have some kind of impact. At our logs I also see that this “MaintenanceHandler” seems to be doing more stuff other than reporting the success of the transfer to the user.

Take into consideration that our CTA instance is working with dCache. Is this parameter perhaps unnecessary in this case and can we keep it disabled?

Thank you very much in advance :slight_smile:

Eli

Hi Eli,

The UseMaintenanceProcess should be used if you want to restrict which cta-taped daemons will run maintenance jobs like garbage collection and reporting.

The CTA workflow is fully distributed across all tape servers, so for example tape server 23 might archive a file and queue the reporting job, which could be picked up by tape server 76. This distributes the load of maintenance jobs to servers which are not busy with archive/retrieve work.

The disadvantage is that if something goes wrong and you want to trace it, you have to figure out which tape servers processed the work. The UseMaintenanceProcess option was provided to allow you to restrict the maintenance jobs to a small pool of tape servers for easier monitoring. In theory you could even have some tape servers fully dedicated to maintenance jobs, though we have not tested this configuration.

However—this option should not be used as a solution to your EOSReporter problem, as if you disable this on all your tape servers, then no maintenance jobs will run. This means no reporting and no garbage collection.

Cheers,

Michael

Hi Michael,

First of all, thanks for the explanation and you were right that removing this parameter wouldn’t solve our problems, as our happiness has lasted only a few hours and we have seen that disabling this parameters doesn’t make our system work properly.

We met with our colleagues at DESY to see if they could help, since we are both using dCache+CTA.
After the meeting, Eli took a deeper look at the logs and she saw the part that says SubprocessName="maintenanceHandler" and she recalled to have seen this in a config file.
We set the parameter UseMaintenanceProcess to no just to test, and we wrote the people at DESY to see if they had it. They said no, and we thought it could have something to do with it.

The default value of UseMaintenanceProcess is yes, so removing it from the config file wouldn’t change the behavior on the error appearing.
We honestly have no idea were to continue, we can explain to you and the team, if you prefer in a private thread, our setup, settings, etc, to see if you have any possible idea or if you can point us to another possible source of the problems.

We are not using EOS, but it seems that the exception comes from and EOSReporter class process that is used to query back our dCache system to notify the transfer finished correctly, with the particularity that it does it twice, so the first time it works fine, and the second raises an error.

We think for the moment we are not going to look a lot more into it, since we in a kind of cul-de-sac, and we are not testing other things we want to. Of course we can not go into production with this error appearing, but for the moment we-ll leave it a little bit aside for our own mental health :slight_smile:

I leave the logs we see here in case in can be of any help.

Dec 13 16:11:55.136914 ctatps001 cta-taped: LVL="ERROR" PID="28593" TID="28593" MSG="In Scheduler::reportArchiveJobsBatch(): failed to report." SubprocessName="maintenanceHandler" fileId="266" reportType="CompletionReport" exceptionMSG="In EOSReporter::AsyncQueryHandler::HandleResponse(): failed to XrdCl::FileSystem::Query() [ERROR] Operation expired code:206 errNo:0 status:1"
Dec 13 16:11:55.161024 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In Helpers::getLockedAndFetchedQueue<ArchiveQueue>(): Successfully found and locked an archive queue." SubprocessName="maintenanceHandler" fileId="266" reportType="CompletionReport" exceptionMSG="In EOSReporter::AsyncQueryHandler::HandleResponse(): failed to XrdCl::FileSystem::Query() [ERROR] Operation expired code:206 errNo:0 status:1" attemptNb="1" queueObject="ArchiveQueueToReportForUser-dteam-Maintenance-ctatps001.pic.es-28593-20221213-16:11:39-0-0" rootFetchNoLockTime="0.000802" rootRelockExclusiveTime="0.008640" rootRefetchTime="0.000337" addOrGetQueueandCommitTime="0.007413" rootUnlockExclusiveTime="0.000479" queueLockTime="0.000670" queueFetchTime="0.000414"
Dec 13 16:11:55.172387 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In ContainerAlgorithms::referenceAndSwitchOwnership(): Requeued a batch of elements." SubprocessName="maintenanceHandler" fileId="266" reportType="CompletionReport" exceptionMSG="In EOSReporter::AsyncQueryHandler::HandleResponse(): failed to XrdCl::FileSystem::Query() [ERROR] Operation expired code:206 errNo:0 status:1" C="ArchiveQueueToReportForUser" tapepool="dteam" containerAddress="ArchiveQueueToReportForUser-dteam-Maintenance-ctatps001.pic.es-28593-20221213-16:11:39-0-0" queueJobsBefore="0" queueBytesBefore="0" queueJobsAfter="1" queueBytesAfter="1048576000" queueLockFetchTime="0.019046" queueProcessAndCommitTime="0.003286" asyncUpdateLaunchTime="0.000155" asyncUpdateCompletionTime="0.003682" requestsUpdatingTime="0.000037" queueUnlockTime="0.000388"
Dec 13 16:11:55.172720 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In ArchiveJob::failReport(): requeued job for report retry." SubprocessName="maintenanceHandler" fileId="266" reportType="CompletionReport" exceptionMSG="In EOSReporter::AsyncQueryHandler::HandleResponse(): failed to XrdCl::FileSystem::Query() [ERROR] Operation expired code:206 errNo:0 status:1" copyNb="1" failureReason="In EOSReporter::AsyncQueryHandler::HandleResponse(): failed to XrdCl::FileSystem::Query() [ERROR] Operation expired code:206 errNo:0 status:1" requestObject="ArchiveRequest-Frontend-ctafrontend01.pic.es-4637-20221212-11:42:22-0-15" reportRetries="1" maxReportRetries="0"
Dec 13 16:11:55.173027 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In Scheduler::reportArchiveJobsBatch(): reported a batch of archive jobs." SubprocessName="maintenanceHandler" totalReports="1" failedReports="1" successfulReports="0" getArchiveJobsToReportTime="0.140234" asyncReportLaunchTime="0.002518" reportCompletionTime="15.038548" reportRecordingInSchedDbTime="0.000055"
Dec 13 16:11:55.173204 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In DiskReportRunner::runOnePass(): did one round of archive reports." SubprocessName="maintenanceHandler" roundTime="15.181584"
Dec 13 16:11:55.174060 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In OStoreDB::ArchiveJob::~ArchiveJob(): will leave the job owned after destruction." agentObject="Maintenance-ctatps001.pic.es-28593-20221213-16:11:39-0" jobObject="ArchiveRequest-Frontend-ctafrontend01.pic.es-4637-20221212-11:42:22-0-15"
Dec 13 16:11:55.289966 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In RootEntry::removeArchiveQueueAndCommit(): removed archive queue." SubprocessName="maintenanceHandler" archiveQueueObject="ArchiveQueueToReportForUser-dteam-Maintenance-ctatps001.pic.es-28593-20221213-16:11:39-0-0"
Dec 13 16:11:55.291861 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In RootEntry::removeArchiveQueueAndCommit(): removed archive queue reference." SubprocessName="maintenanceHandler" tapePool="dteam" queueType="JobsToReportToUser"
Dec 13 16:11:55.292285 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In ContainerTraits<ArchiveQueue_t,ArchiveQueue>::trimContainerIfNeeded(): deleted empty queue" SubprocessName="maintenanceHandler" tapepool="dteam" queueObject="ArchiveQueueToReportForUser-dteam-Maintenance-ctatps001.pic.es-28593-20221213-16:11:39-0-0" queueUnlockTime="0.000378" rootEntryLockTime="0.000487" rootEntryFetchTime="0.000270" rootEntryRemoveArchiveQueueAndCommitTime="0.104882"
Dec 13 16:11:55.293058 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In Algorithms::popNextBatch(): did one round of elements retrieval." SubprocessName="maintenanceHandler" C="ArchiveQueueToReportForUser" tapepool="dteam" containerAddress="ArchiveQueueToReportForUser-dteam-Maintenance-ctatps001.pic.es-28593-20221213-16:11:39-0-0" filesAdded="1" filesBefore="0" filesAfter="1" queueJobsBefore="1" queueBytesBefore="1048576000" queueJobsAfter="0" queueBytesAfter="0" findLockFetchQueueTime="0.001490" jobSelectionTime="0.000499" ownershipAdditionTime="0.000010" asyncUpdateLaunchTime="0.003238" asyncUpdateCompletionTime="0.003439" updateResultProcessingTime="0.000023" containerUpdateTime="0.002829" containerTrimmingTime="0.106705" containerUnlockTime="0.000008" structureProcessingTime="0.000018"
Dec 13 16:11:55.293457 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In Algorithms::popNextBatch(): elements retrieval complete." SubprocessName="maintenanceHandler" C="ArchiveQueueToReportForUser" tapepool="dteam" files="1" findLockFetchQueueTime="0.001490" jobSelectionTime="0.000499" ownershipAdditionTime="0.000010" asyncUpdateLaunchTime="0.003238" asyncUpdateCompletionTime="0.003439" updateResultProcessingTime="0.000023" containerUpdateTime="0.002829" containerTrimmingTime="0.106705" containerUnlockTime="0.000008" structureProcessingTime="0.000018" schedulerDbTime="0.118792" iterationCount="1"
Dec 13 16:12:10.138993 ctatps001 cta-taped: LVL="ERROR" PID="28593" TID="28593" MSG="In Scheduler::reportArchiveJobsBatch(): failed to report." SubprocessName="maintenanceHandler" fileId="266" reportType="CompletionReport" exceptionMSG="In EOSReporter::AsyncQueryHandler::HandleResponse(): failed to XrdCl::FileSystem::Query() [ERROR] Operation expired code:206 errNo:0 status:1"
Dec 13 16:12:10.145407 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In Helpers::getLockedAndFetchedQueue<ArchiveQueue>(): Successfully found and locked an archive queue." SubprocessName="maintenanceHandler" fileId="266" reportType="CompletionReport" exceptionMSG="In EOSReporter::AsyncQueryHandler::HandleResponse(): failed to XrdCl::FileSystem::Query() [ERROR] Operation expired code:206 errNo:0 status:1" attemptNb="1" queueObject="ArchiveQueueFailed-dteam-Maintenance-ctatps001.pic.es-27672-20221212-11:23:15-0-2" rootFetchNoLockTime="0.000851" rootRelockExclusiveTime="0.000000" rootRefetchTime="0.000000" addOrGetQueueandCommitTime="0.000000" rootUnlockExclusiveTime="0.000000" queueLockTime="0.000759" queueFetchTime="0.000482"
Dec 13 16:12:10.157498 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In ContainerAlgorithms::referenceAndSwitchOwnership(): Requeued a batch of elements." SubprocessName="maintenanceHandler" fileId="266" reportType="CompletionReport" exceptionMSG="In EOSReporter::AsyncQueryHandler::HandleResponse(): failed to XrdCl::FileSystem::Query() [ERROR] Operation expired code:206 errNo:0 status:1" C="ArchiveQueueFailed" tapepool="dteam" containerAddress="ArchiveQueueFailed-dteam-Maintenance-ctatps001.pic.es-27672-20221212-11:23:15-0-2" queueJobsBefore="1" queueBytesBefore="1048576000" queueJobsAfter="2" queueBytesAfter="2097152000" queueLockFetchTime="0.002329" queueProcessAndCommitTime="0.004125" asyncUpdateLaunchTime="0.000112" asyncUpdateCompletionTime="0.003651" requestsUpdatingTime="0.000035" queueUnlockTime="0.000461"
Dec 13 16:12:10.157814 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In ArchiveJob::failReport(): stored job in failed container for operator handling." SubprocessName="maintenanceHandler" fileId="266" reportType="CompletionReport" exceptionMSG="In EOSReporter::AsyncQueryHandler::HandleResponse(): failed to XrdCl::FileSystem::Query() [ERROR] Operation expired code:206 errNo:0 status:1" copyNb="1" failureReason="In EOSReporter::AsyncQueryHandler::HandleResponse(): failed to XrdCl::FileSystem::Query() [ERROR] Operation expired code:206 errNo:0 status:1" requestObject="ArchiveRequest-Frontend-ctafrontend01.pic.es-4637-20221212-11:42:22-0-15" reportRetries="2" maxReportRetries="0"
Dec 13 16:12:10.158056 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In Scheduler::reportArchiveJobsBatch(): reported a batch of archive jobs." SubprocessName="maintenanceHandler" totalReports="1" failedReports="1" successfulReports="0" getArchiveJobsToReportTime="0.119599" asyncReportLaunchTime="0.000182" reportCompletionTime="14.864025" reportRecordingInSchedDbTime="0.000017"
Dec 13 16:12:10.158201 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In DiskReportRunner::runOnePass(): did one round of archive reports." SubprocessName="maintenanceHandler" roundTime="14.984011"
Dec 13 16:12:10.158765 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In OStoreDB::ArchiveJob::~ArchiveJob(): will leave the job owned after destruction." agentObject="Maintenance-ctatps001.pic.es-28593-20221213-16:11:39-0" jobObject="ArchiveRequest-Frontend-ctafrontend01.pic.es-4637-20221212-11:42:22-0-15"
Dec 13 16:12:10.158874 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In DiskReportRunner::runOnePass(): finished one pass." SubprocessName="maintenanceHandler" roundCount="0" passTime="30.167261"
Dec 13 16:13:40.526458 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In GarbageCollector::cleanupDeadAgent(): will cleanup dead agent." SubprocessName="maintenanceHandler" agentAddress="DriveProcess-257-ctatps001.pic.es-6156-20221213-12:45:52-0" gcAgentAddress="Maintenance-ctatps001.pic.es-28593-20221213-16:11:39-0" currentHeartbeat="2" GCRequested="false" timeout="120.000000" timer="120.538182" heartbeatAtTimerStart="2"
Dec 13 16:13:40.527771 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In Agent::removeAndUnregisterSelf(): Removed agent object." SubprocessName="maintenanceHandler" agentAddress="DriveProcess-257-ctatps001.pic.es-6156-20221213-12:45:52-0" gcAgentAddress="Maintenance-ctatps001.pic.es-28593-20221213-16:11:39-0" agentObject="DriveProcess-257-ctatps001.pic.es-6156-20221213-12:45:52-0"
Dec 13 16:13:40.531285 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In GarbageCollector::cleanupDeadAgent(): agent entry removed." SubprocessName="maintenanceHandler" agentAddress="DriveProcess-257-ctatps001.pic.es-6156-20221213-12:45:52-0" gcAgentAddress="Maintenance-ctatps001.pic.es-28593-20221213-16:11:39-0"
Dec 13 16:13:40.545606 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In GarbageCollector::cleanupDeadAgent(): will cleanup dead agent." SubprocessName="maintenanceHandler" agentAddress="DriveProcess-257-ctatps001.pic.es-6275-20221213-12:47:04-0" gcAgentAddress="Maintenance-ctatps001.pic.es-28593-20221213-16:11:39-0" currentHeartbeat="2" GCRequested="false" timeout="120.000000" timer="120.549931" heartbeatAtTimerStart="2"
Dec 13 16:13:40.546758 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In Agent::removeAndUnregisterSelf(): Removed agent object." SubprocessName="maintenanceHandler" agentAddress="DriveProcess-257-ctatps001.pic.es-6275-20221213-12:47:04-0" gcAgentAddress="Maintenance-ctatps001.pic.es-28593-20221213-16:11:39-0" agentObject="DriveProcess-257-ctatps001.pic.es-6275-20221213-12:47:04-0"
Dec 13 16:13:40.550540 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In GarbageCollector::cleanupDeadAgent(): agent entry removed." SubprocessName="maintenanceHandler" agentAddress="DriveProcess-257-ctatps001.pic.es-6275-20221213-12:47:04-0" gcAgentAddress="Maintenance-ctatps001.pic.es-28593-20221213-16:11:39-0"
Dec 13 16:13:40.565345 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In GarbageCollector::cleanupDeadAgent(): will cleanup dead agent." SubprocessName="maintenanceHandler" agentAddress="DriveProcess-257-ctatps001.pic.es-6873-20221213-12:51:49-0" gcAgentAddress="Maintenance-ctatps001.pic.es-28593-20221213-16:11:39-0" currentHeartbeat="399" GCRequested="false" timeout="120.000000" timer="120.567059" heartbeatAtTimerStart="399"
Dec 13 16:13:40.566507 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In Agent::removeAndUnregisterSelf(): Removed agent object." SubprocessName="maintenanceHandler" agentAddress="DriveProcess-257-ctatps001.pic.es-6873-20221213-12:51:49-0" gcAgentAddress="Maintenance-ctatps001.pic.es-28593-20221213-16:11:39-0" agentObject="DriveProcess-257-ctatps001.pic.es-6873-20221213-12:51:49-0"
Dec 13 16:13:40.570384 ctatps001 cta-taped: LVL="INFO" PID="28593" TID="28593" MSG="In GarbageCollector::cleanupDeadAgent(): agent entry removed." SubprocessName="maintenanceHandler" agentAddress="DriveProcess-257-ctatps001.pic.es-6873-20221213-12:51:49-0" gcAgentAddress="Maintenance-ctatps001.pic.es-28593-20221213-16:11:39-0"

Again, thank you, and sorry for bothering with this, but we are surely coming back to it at some point in the future.

Cheers,

Jordi