Help understanding what is happening on archive

dmitrylitvintsev · 14 July 2025 20:18

Summary

Not seeing files getting archive id for a long time after transfer complteted

Details

  {
    "clientVersion": {
      "ctaVersion": "5.11.9.0-1",
      "xrootdSsiProtobufInterfaceVersion": "v1.26"
    },
    "serverVersion": {
      "ctaVersion": "5.11.9.0-1",
      "xrootdSsiProtobufInterfaceVersion": "v1.26"
    },
    "catalogueConnectionString": "postgresql:postgresql://cta:******@ifdb11.fnal.gov:5468/cta_prd?gssencmode=disable",
    "catalogueVersion": "15.0",
    "isUpgrading": false,
    "schedulerBackendName": "cephUser",
    "instanceName": "prd"
  }
]

Details

This is Fermilab production CTA installation working with dCache as cache layer. We migrated from Enstore to CTA in the beginning of April and it has been running fine. Last week we started to notice frequent dismounts on archive and started to look at the system more vigorously.

I have noticed in the log:

{
  "epoch_time": 1752509471.182319686,
  "local_time": "2025-07-14T11:11:11-0500",
  "hostname": "tpsrvf2201",
  "program": "cta-taped",
  "log_level": "INFO",
  "pid": 338027,
  "tid": 338150,
  "message": "File successfully read from disk",
  "drive_name": "F1_F4B2D2",
  "instance": "prd",
  "sched_backend": "cephUser",
  "thread": "DiskRead",
  "tapeDrive": "F1_F4B2D2",
  "tapeVid": "FA9585",
  "mountId": "139837",
  "vo": "cms",
  "tapePool": "cms.OORun2025RAW",
  "threadID": 0,
  "path": "root://cmsstor807:1095/0000367B58EE527445749619F70D655A7BA8",
  "actualURL": "root://cmsstor807:1095/0000367B58EE527445749619F70D655A7BA8?xrdcl.requuid=9146f910-8883-4d20-a4e9-dedad5ea56df",
  "fileId": 4526908306,
  "readWriteTime": 26.573199,
  "checksumingTime": 0.0,
  "waitFreeMemoryTime": 0.003905,
  "waitDataTime": 0.0,
  "waitReportingTime": 0.0,
  "checkingErrorTime": 0.00115900000000001,
  "openingTime": 0.022237,
  "transferTime": 26.602023,
  "totalTime": 26.602023,
  "dataVolume": 5361627550,
  "globalPayloadTransferSpeedMBps": 201.549617109947,
  "diskPerformanceMBps": 201.549617109947,
  "openRWCloseToTransferTimeRatio": 0.999752387252654
}

which I believe corresponds to a successful archive. However a few hours later this file still has no
archive id:

# cta-admin tf ls -i cms_prd -f 0000367B58EE527445749619F70D655A7BA8
archive id copy no vid fseq block id instance disk fxid size checksum type checksum value storage class owner group creation time 
#

I assume this is because files are inserted into CTA DB in batches. So I assume there is file(s) holding the whole batch. Is that correct?

Question, how can I find out what files are in the batch? For instance I see in the log file (grepping for 0000367B58EE527445749619F70D655A7BA8 :

{
  "epoch_time": 1752509444.682011138,
  "local_time": "2025-07-14T11:10:44-0500",
  "hostname": "tpsrvf2201",
  "program": "cta-taped",
  "log_level": "INFO",
  "pid": 450022,
  "tid": 450022,
  "message": "Handler received SIGCHILD. Propagating to all handlers.",
  "drive_name": "F1_F4B2D2",
  "instance": "prd",
  "sched_backend": "cephUser",
  "capacityInBytes": "18000000000000",
  "logicalLibrary": "F1_LTO9",
  "mediaType": "LTO9",
  "mountAttempted": "1",
  "mountId": "139837",
  "mountType": "ArchiveForUser",
  "stillOpenFileForThread0": "root://cmsstor807:1095/0000367B58EE527445749619F70D655A7BA8?xrdcl.requuid=9146f910-8883-4d20-a4e9-dedad5ea56df",
  "stillOpenFileForThread1": "root://cmsstor817:1095/0000ADB98390C459498284CEB9F8FD8F5534?xrdcl.requuid=e1de0dd1-184e-47b5-8786-167f6380f63b",
  "stillOpenFileForThread2": "root://cmsstor808:1096/00003A8898E9FD084DCA9199274F170D9AF2?xrdcl.requuid=d926b82b-5bb2-40eb-90d6-664def504f10",
  "stillOpenFileForThread4": "root://cmsdata321:1096/0000E008335098BA4E46B83E8ACACE18445B?xrdcl.requuid=24ab240f-3f33-4e33-b0a5-ff7f71166ba3",
  "stillOpenFileForThread5": "root://cmsstor801:1095/0000EDEDE752D5E243F8816549D2DA436C72?xrdcl.requuid=4725afa3-7892-439a-b23f-2b33b6ccd38f",
  "stillOpenFileForThread6": "root://cmsstor807:1094/0000B2104B4B604E40A9912A76AE7F0B8CCB?xrdcl.requuid=7e39bd20-6899-49b6-868c-b93e42f6b468",
  "stillOpenFileForThread7": "root://cmsstor807:1094/000032C2664722C640059DC3B51FE8FEB4B0?xrdcl.requuid=2ab359a5-7f7f-4ef0-995c-78a02a3457b5",
  "stillOpenFileForThread8": "root://cmsstor824:1094/0000773A856E883844E19513F42BA2BCEF85?xrdcl.requuid=ee8bdbb5-5018-49ad-bcdc-96a5f6d487c9",
  "stillOpenFileForThread9": "root://cmsstor820:1096/00001AC7CAD99FAE49CFA824F5FF8D9BC7AC?xrdcl.requuid=1c434280-fd1a-4813-945c-8a2d89a51d66",
  "tapePool": "cms.OORun2025RAW",
  "tapeVid": "FA9585",
  "vendor": "Fujifilm",
  "vo": "cms",
  "volReqId": "139837",
  "SubprocessName": "signalHandler"
}

What does above mean? Does this mean that these 10 files are part of one batch? None of these other files have associated "File successfully read from disk" yet. Am I correct to expect the archive file ids assigned once all 10 files have been transferred to tape? IOW could you help me understand what is “holding” 0000367B58EE527445749619F70D655A7BA8 from being declared on tape.

Interestingly I do not see 'File successfully transmitted to drive' message associated w/ 0000367B58EE527445749619F70D655A7BA8

Thank you,
Dmitry

mdavis · 15 July 2025 07:17

Hi Dmitry,

The file is allocated an archive ID as soon as it arrives in the archive buffer, before it is sent to CTA for archival. Perhaps you mean that the file is not flagged as being on tape for a long time after transfer is completed?

Frequent dismounts on archive: did you establish the reason for this? Are there errors on archive, or is the drive process being starved (running out of work and therefore dismounting)? Do you have enough disk space in the archive buffer?

Although files are inserted into the catalogue in batches, a separate report is generated for each file. The sequence is as follows:

(drive process) → file is written to tape. CTA catalogue updated. Report job queued.

(maintenance process) → Report job popped, notification event sent to disk system (ARCHIVED or ARCHIVE_FAILED)

It’s not clear to me if the delays you are seeing are during archival or during reporting. Can you clarify?

dmitrylitvintsev · 15 July 2025 12:45

Hi Michael,

As often happens, once I open a topic I realize that I might be misinterpreting the log. It looks like the file was transferred to the taped host but hasn’t been written to tape (no mesasge 'File successfully transmitted to drive’, I added that line in the topic later). Why that is - I am still trying to figure out. It looks like something crashed/failed and I missed it.

May be if you could elucidate in that message that I cut&pasted "Handler received SIGCHILD. Propagating to all handlers.”. What part of the workflow does this correspond to? A parent process got notified that child process changed state. What state is it?

Thank you,
Dmitry

mdavis · 15 July 2025 13:06

It’s hard to draw any conclusions from these two log messages in isolation.

File successfully read from disk tells us that the file was queued for archive, the queued job was popped and we got as far as reading it from disk. You need to look at the subsequent log messages to see what happened next. Did it get archived to tape or was there an error? Was the report job queued and was it processed?

Handler received SIGCHILD. Propagating to all handlers. means that the tape server parent process died and killed all its children. You need to look in the logs for the corresponding parent process error message to see the reason.

dmitrylitvintsev · 15 July 2025 15:05

Found it. Mount failed (looking for messages associated with vid):

{
  "epoch_time": 1752509445.370978363,
  "local_time": "2025-07-14T11:10:45-0500",
  "hostname": "tpsrvf2201",
  "program": "cta-taped",
  "log_level": "ERROR",
  "pid": 338027,
  "tid": 338164,
  "message": "Failed to mount the tape for read/write access",
  "drive_name": "F1_F4B2D2",
  "instance": "prd",
  "sched_backend": "cephUser",
  "thread": "TapeWrite",
  "tapeDrive": "F1_F4B2D2",
  "tapeVid": "FA9585",
  "mountId": "139837",
  "vo": "cms",
  "tapePool": "cms.OORun2025RAW",
  "mediaType": "LTO9",
  "logicalLibrary": "F1_LTO9",
  "mountType": "ArchiveForUser",
  "vendor": "Fujifilm",
  "capacityInBytes": 18000000000000,
  "drive_Slot": "smc21",
  "exceptionMessage": "Failed to mount tape for read/write access: vid=FA9585 slot=smc21: Failed to mount tape in SCSI tape-library for read/write access: vid=FA9585 librarySlot=smc21: Received error from rmcd: rmcRc=2204 rmcErrorStream=smc_mount: Asked for FA9585, got reply for FA9585\n smc_mount: Location: data transfer element (0x4)\n smc_mount: SR018 - mount of FA9585 on drive 21 failed : volume in use\n"
}

and

{
  "epoch_time": 1752510645.543802625,
  "local_time": "2025-07-14T11:30:45-0500",
  "hostname": "tpsrvf2201",
  "program": "cta-taped",
  "log_level": "INFO",
  "pid": 450022,
  "tid": 450022,
  "message": "Drive subprocess crashed. Will spawn a new one.",
  "drive_name": "F1_F4B2D2",
  "instance": "prd",
  "sched_backend": "cephUser",
  "capacityInBytes": "18000000000000",
  "logicalLibrary": "F1_LTO9",
  "mediaType": "LTO9",
  "mountAttempted": "1",
  "mountId": "139837",
  "mountType": "ArchiveForUser",
  "stillOpenFileForThread0": "root://cmsstor807:1095/0000367B58EE527445749619F70D655A7BA8?xrdcl.requuid=9146f910-8883-4d20-a4e9-dedad5ea56df",
  "stillOpenFileForThread4": "root://cmsdata321:1096/0000E008335098BA4E46B83E8ACACE18445B?xrdcl.requuid=24ab240f-3f33-4e33-b0a5-ff7f71166ba3",
  "stillOpenFileForThread5": "root://cmsstor801:1095/0000EDEDE752D5E243F8816549D2DA436C72?xrdcl.requuid=4725afa3-7892-439a-b23f-2b33b6ccd38f",
  "stillOpenFileForThread6": "root://cmsstor807:1094/0000B2104B4B604E40A9912A76AE7F0B8CCB?xrdcl.requuid=7e39bd20-6899-49b6-868c-b93e42f6b468",
  "stillOpenFileForThread7": "root://cmsstor807:1094/000032C2664722C640059DC3B51FE8FEB4B0?xrdcl.requuid=2ab359a5-7f7f-4ef0-995c-78a02a3457b5",
  "stillOpenFileForThread8": "root://cmsstor824:1094/0000773A856E883844E19513F42BA2BCEF85?xrdcl.requuid=ee8bdbb5-5018-49ad-bcdc-96a5f6d487c9",
  "stillOpenFileForThread9": "root://cmsstor820:1096/00001AC7CAD99FAE49CFA824F5FF8D9BC7AC?xrdcl.requuid=1c434280-fd1a-4813-945c-8a2d89a51d66",
  "tapePool": "cms.OORun2025RAW",
  "tapeVid": "FA9585",
  "vendor": "Fujifilm",
  "vo": "cms",
  "volReqId": "139837",
  "Error_tapeMountForWrite": "1",
  "wasTapeMounted": "0",
  "mountTime": "0.0",
  "positionTime": "0.0",
  "waitInstructionsTime": "0.640657",
  "waitFreeMemoryTime": "0.0",
  "waitDataTime": "0.0",
  "waitReportingTime": "0.789702",
  "checksumingTime": "0.0",
  "readWriteTime": "0.0",
  "flushTime": "0.0",
  "unloadTime": "0.0",
  "unmountTime": "0.0",
  "encryptionControlTime": "0.00223",
  "transferTime": "1.430359",
  "totalTime": "1186.162341",
  "deliveryTime": "1186.162341",
  "drainingTime": "0.0",
  "dataVolume": "0",
  "filesCount": "0",
  "headerVolume": "0",
  "payloadTransferSpeedMBps": "0.0",
  "driveTransferSpeedMBps": "0.0",
  "tapeDrive": "F1_F4B2D2",
  "subprocessPid": 338027,
  "IfSignaled": true,
  "TermSignal": 9,
  "CoreDump": 0
}

mdavis · 15 July 2025 15:17

So looks like it’s trying to mount tape FA9585 but is unable to do so as the tape is already mounted (or stuck somewhere).

First thing you should do is disable that tape, so CTA stops trying to mount it.

Second, check the logs to see if there are other tapes in this state, or just this one.

Third, use the tape library GUI to find out where the tape is stuck so you can solve the problem.

Once you have unstuck the tape you can re-enable it.

dmitrylitvintsev · 16 July 2025 14:25

These two might be meaningful with relation to volume FA9585:

{"epoch_time":1752504320.263886448,"local_time":"2025-07-14T09:45:20-0500","hostname":"tpsrvf2201","program":"cta-taped","log_level":"WARN","pid":450022,"tid":450022,"message":"In DriveHandler::processTimeout(): Killed subprocess.","drive_name":"F1_F4B2D2","instance":"prd","sched_backend":"cephUser","capacityInBytes":"18000000000000","logicalLibrary":"F1_LTO9","mediaType":"LTO9","mountAttempted":"1","mountId":"139643","mountType":"ArchiveForUser","stillOpenFileForThread0":"root://cmsdata303.fnal.gov:1097/00009497E7EB42DD48E1A4040437F579126F?xrdcl.requuid=1a9b616d-d13c-437f-a512-a9133f4b31ce","stillOpenFileForThread5":"root://cmsstor801.fnal.gov:1095/0000CC721CA0A7D641CE9BD9F7859EC2AA7D?xrdcl.requuid=4fd339c1-28c5-4535-b87b-bcf6bcc02dc1","stillOpenFileForThread6":"root://cmsstor806.fnal.gov:1094/00003749437251A2499EB94595780590E5AA?xrdcl.requuid=79bf7329-ab31-420b-b286-bbc22e189aad","stillOpenFileForThread7":"root://cmsstor817.fnal.gov:1096/0000CF90C382EF9E4F959D2C931E20788BEC?xrdcl.requuid=044cea47-4e45-4f45-96cb-47f347ed89d6","stillOpenFileForThread8":"root://cmsstor810.fnal.gov:1096/0000CC529C82CA6144E5B900AFFB712EB719?xrdcl.requuid=28f67959-98b0-47ab-8eee-287357f1c651","stillOpenFileForThread9":"root://cmsstor823.fnal.gov:1095/0000E90F90120AE041AE82125EC58E698AD9?xrdcl.requuid=271739ed-c2ea-43ee-a794-2da9e16e9446","tapePool":"cms.OORun2025RAW","tapeVid":"FA9585","vendor":"Fujifilm","vo":"cms","volReqId":"139643","Error_tapeMountForWrite":"1","wasTapeMounted":"0","mountTime":"0.0","positionTime":"0.0","waitInstructionsTime":"0.239448","waitFreeMemoryTime":"0.0","waitDataTime":"0.0","waitReportingTime":"0.776173","checksumingTime":"0.0","readWriteTime":"0.0","flushTime":"0.0","unloadTime":"0.0","unmountTime":"0.0","encryptionControlTime":"0.002224","transferTime":"1.015621","totalTime":"1186.168039","deliveryTime":"1186.168038","drainingTime":"0.0","dataVolume":"0","filesCount":"0","headerVolume":"0","payloadTransferSpeedMBps":"0.0","driveTransferSpeedMBps":"0.0","tapeDrive":"F1_F4B2D2","SessionState":"ShuttingDown","SessionType":"Retrieve","TimeoutType":"StateChange","SessionTypeWhenTimeoutDecided":"Retrieve","SessionStateWhenTimeoutDecided":"ShuttingDown","LastDataMovementTime":8190866,"LastHeartbeatTime":8192054,"LastStateChangeTime":8191168,"Now":8192068,"ThisTimeout":8192068,"BeforeStateChangeTimeout_s":-0.014593206,"SubprocessId":287944}

and for pid = 287944:

{"epoch_time":1752503120.176705685,"local_time":"2025-07-14T09:25:20-0500","hostname":"tpsrvf2201","program":"cta-taped","log_level":"ERROR","pid":287944,"tid":288036,"message":"Failed to mount the tape for read/write access","drive_name":"F1_F4B2D2","instance":"prd","sched_backend":"cephUser","thread":"TapeWrite","tapeDrive":"F1_F4B2D2","tapeVid":"FA9585","mountId":"139643","vo":"cms","tapePool":"cms.OORun2025RAW","mediaType":"LTO9","logicalLibrary":"F1_LTO9","mountType":"ArchiveForUser","vendor":"Fujifilm","capacityInBytes":18000000000000,"drive_Slot":"smc21","exceptionMessage":"Failed to mount tape for read/write access: vid=FA9585 slot=smc21: Failed to mount tape in SCSI tape-library for read/write access: vid=FA9585 librarySlot=smc21: Received error from rmcd: rmcRc=2204 rmcErrorStream=smc_mount: Asked for FA9585, got reply for FA9585\n smc_mount: Location: data transfer element (0x4)\n smc_mount: SR018 - mount of FA9585 on drive 21 failed : volume in use\n"}

Note that this is 9:45 timestamp which is way earlier than the "Drive subprocess crashed. Will spawn a new one.”

At this point it seems to me that this topic looks very similar to “slow drives” topic that Scarlett has created. So may be we do not need to pursue this one. The files listed in this topic got finally written overnight.

mdavis · 16 July 2025 15:34

Yes, Scarlett even mentioned the same VID.

These 2 messages show the timeout and subsequent consequences, but not the reason for the timeout. On the other thread we found that the timeout is during writing blocks to tape.

Let’s continue the discussion on the other thread.