Few Files Selected for writing despite Long Queue

Mwai · 24 October 2023 10:27

Hello,

Anyone with an idea why the scheduler sometimes selects very few files for writing from a tapepool despite having thousands in the queue? In this example, we have around 23k queued but only around ~13 get selected for a mount which then leads to a lot of mounts and dismounts

Oct 24 11:54:12 tpm102 cta-taped: LVL="INFO" PID="22130" TID="22130" MSG="In Scheduler::getNextMountDryRun(): Found a potential mount (archive)" thread="MainThread" tapeDrive="rz1-2,3,1" tapePool="MID-202302" tapeVid="L09876" mountType="ARCHIVE_FOR_USER" existingMountsDistinctTypeForThisTapepool="0" existingMountsBasicTypeForThisVo="0" bytesQueued="196757071817215" minBytesToWarrantMount="500000000000" filesQueued="23300" minFilesToWarrantMount="1000" oldestJobAge="310740" youngestJobAge="879" minArchiveRequestAge="1000" getMountInfoTime="0.016226" getTapeInfoTime="0.043179" candidateSortingTime="0.005268" getTapeForWriteTime="0.006439" decisionTime="0.000078" schedulerDbTime="0.016226" catalogueTime="0.049618"

Oct 24 11:54:13 tpm102 cta-taped: LVL="INFO" PID="22130" TID="22130" MSG="In Algorithms::popNextBatch(): did one round of elements retrieval." thread="MainThread" tapeDrive="rz1-2,3,1" tapeVid="L09876" mountId="2392" C="ArchiveQueueToTransferForUser" tapepool="MID-202302" containerAddress="ArchiveQueueToTransferForUser-MID-202302-Frontend-cta-frontend.desy.de-60178-20231019-14:15:59-0-7" filesAdded="13" bytesAdded="108163318755" filesBefore="0" bytesBefore="0" filesAfter="13" bytesAfter="108163318755" queueJobsBefore="23300" queueBytesBefore="196757071817215" queueJobsAfter="23287" queueBytesAfter="196648908498460" findLockFetchQueueTime="0.001930" jobSelectionTime="0.054522" ownershipAdditionTime="0.000037" asyncUpdateLaunchTime="0.005853" asyncUpdateCompletionTime="0.069838" updateResultProcessingTime="0.000224" containerUpdateTime="0.198190" containerTrimmingTime="0.000057" containerUnlockTime="0.000431" structureProcessingTime="0.004213"

Oct 24 11:54:13 tpm102 cta-taped: LVL="INFO" PID="22130" TID="22130" MSG="In Algorithms::popNextBatch(): elements retrieval complete." thread="MainThread" tapeDrive="rz1-2,3,1" tapeVid="L09876" mountId="2392" C="ArchiveQueueToTransferForUser" tapepool="MID-202302" files="13" findLockFetchQueueTime="0.001930" jobSelectionTime="0.054522" ownershipAdditionTime="0.000037" asyncUpdateLaunchTime="0.005853" asyncUpdateCompletionTime="0.069838" updateResultProcessingTime="0.000224" containerUpdateTime="0.198190" containerTrimmingTime="0.000057" containerUnlockTime="0.000431" structureProcessingTime="0.004213" schedulerDbTime="0.336694" iterationCount="1"

It’s not an often occurrence but mostly affects tapepools with a lot of data

Mwai

rbachman · 25 October 2023 14:22

Hi @Mwai,
it is hard to tell exactly what happened from the details provided. Would you be able to share which version of CTA you are running, a larger view of your logs, and perhaps a view of the queues?

A few things to note:

The getNextMountDryRun() function is intended to determine whether or not there is any useful mount action available for that particular cta-taped instance, without locking the queue. The actual mount is done using getNextMount().
Scheduling priority in CTA is determined by priority, mount type (archive/retrieve), and the job age. Selecting the most files possible for one continuous archival is not in these criteria.
popNextBatch() may be called multiple times during one mount, and will not necessarily select all of the files in one go.

In other words, I believe the above could be ‘working as intended’, and might not necessarily be related to any frequent mount/unmount behaviors you are seeing.

Mwai · 26 October 2023 15:00

hi @rbachman,

Thanks.

We are running 5.8.7-1. I would assume that in this case, popNextBatch() is only called once per mount, with only a few files despite there being more than enough work for the drive. Thus, what should ideally be done in one mount (since the mount criteria is already met) is done in multiple mounts, making it very inefficient.

The mp for this VO:

    "name": "xfelmp",
    "archivePriority": "20",
    "archiveMinRequestAge": "1000",
    "retrievePriority": "10",
    "retrieveMinRequestAge": "3600",

Maybe you can spot sth in the log?

Cheers