Drives are running slowly

Summary

Since Thursday last week the tapes have been mounted and unmounted continuously. Normally the drives will hold on to a tape until all data is written now it seems to unmount and remount the tape all the time even when there is more data to write to that tape.

Details

CTA version: 5.11.9.0-1
Operating System and version: Alma Linus 9.6
Xrootd version: Using dCache
Objectstore backend:Ceph

Relevant logs and/or screenshots

{
“epoch_time”: 1752020946.612130255,
“local_time”: “2025-07-08T19:29:06-0500”,
“hostname”: “tpsrvf2205”,
“program”: “cta-taped”,
“log_level”: “ERROR”,
“pid”: 2765584,
“tid”: 2765584,
“message”: “In Agent::deleteAndUnregisterSelf: agent still owns objects. Here is a part of the list.”,
“drive_name”: “F1_X”,
“instance”: “prd”,
“sched_backend”: “cephUser”,
“agentObject”: “Maintenance-tpsrvf2205-2765584-20250628-07:25:38-0”,
“objects”: “RetrieveRequest-Frontend-ctax-3454028-20250618-15:54:03-0-3553295 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3554330 RetrieveRequest-Frontend-ctax-3454028-20250618-15:54:03-0-3554358 RetrieveRequest-Frontend-ctax-3454028-20250618-15:54:03-0-3554413 RetrieveRequest-Frontend-ctax-3454028-20250618-15:54:03-0-3554477 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3554616 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3554648 RetrieveRequest-Frontend-ctax-3454028-20250618-15:54:03-0-3554934 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3555007 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3555011 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3555100 RetrieveRequest-Frontend-ctax-3454028-20250618-15:54:03-0-3556555 RetrieveRequest-Frontend-ctax-3454028-20250618-15:54:03-0-3556860 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3556867 RetrieveRequest-Frontend-ctax-3454028-20250618-15:54:03-0-3556870 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3556913 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3559215 RetrieveRequest-Frontend-ctax-3454028-20250618-15:54:03-0-3559232 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3559365 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3561331 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3561333 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3561342 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3561347 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3561378 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3586491”,
“startIndex”: 1150,
“endIndex”: 1174,
“totalObjects”: 1290
}
{
“epoch_time”: 1752020946.612216583,
“local_time”: “2025-07-08T19:29:06-0500”,
“hostname”: “tpsrvf2205”,
“program”: “cta-taped”,
“log_level”: “ERROR”,
“pid”: 2765584,
“tid”: 2765584,
“message”: “In Agent::deleteAndUnregisterSelf: agent still owns objects. Here is a part of the list.”,
“drive_name”: “F1_xx”,
“instance”: “prd”,
“sched_backend”: “cephUser”,
“agentObject”: “Maintenance-tpsrvx-2765584-20250628-07:25:38-0”,
“objects”: “RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3587928 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3587959 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3587981 RetrieveRequest-Frontend-ctaxx-3454028-20250618-15:54:03-0-3588033 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3588068 RetrieveRequest-Frontend-ctaxx-3454028-20250618-15:54:03-0-3588157 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3588390 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3589441 RetrieveRequest-Frontend-ctaxx-3454028-20250618-15:54:03-0-3589443 RetrieveRequest-Frontend-ctaxx-3454028-20250618-15:54:03-0-3589663 RetrieveRequest-Frontend-ctaxx-3454028-20250618-15:54:03-0-3589679 RetrieveRequest-Frontend-ctaxx-3454028-20250618-15:54:03-0-3589855 RetrieveRequest-Frontend-ctaxx-3454028-20250618-15:54:03-0-3592652 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3593300 RetrieveRequest-Frontend-ctaxx-3454028-20250618-15:54:03-0-3593315 RetrieveRequest-Frontend-ctaxx-3454028-20250618-15:54:03-0-3593330 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3593352 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3595085 RetrieveRequest-Frontend-ctaxx-3454028-20250618-15:54:03-0-3595093 RetrieveRequest-Frontend-ctaxx-3454028-20250618-15:54:03-0-3595097 RetrieveRequest-Frontend-ctaxx-3454028-20250618-15:54:03-0-3595101 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3595102 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3596976 RetrieveRequest-Frontend-ctaxx-3454028-20250618-15:54:03-0-3596989 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3596996”,
“startIndex”: 1175,
“endIndex”: 1199,
“totalObjects”: 1290
}
{
“epoch_time”: 1752020946.612307238,
“local_time”: “2025-07-08T19:29:06-0500”,
“hostname”: “tpsrvf2205”,
“program”: “cta-taped”,
“log_level”: “ERROR”,
“pid”: 2765584,
“tid”: 2765584,
“message”: “In Agent::deleteAndUnregisterSelf: agent still owns objects. Here is a part of the list.”,
“drive_name”: “F1_xxx”,
“instance”: “prd”,
“sched_backend”: “cephUser”,
“agentObject”: “Maintenance-tpsrvf2205-2765584-20250628-07:25:38-0”,
“objects”: “RetrieveRequest-Frontend-ctaxx-3454028-20250618-15:54:03-0-3597044 RetrieveRequest-Frontend-ctaxx-3454028-20250618-15:54:03-0-3597124 RetrieveRequest-Frontend-ctaxx-3454028-20250618-15:54:03-0-3597190 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3597202 RetrieveRequest-Frontend-ctaxx-3454028-20250618-15:54:03-0-3597204 RetrieveRequest-Frontend-ctaxx-3454028-20250618-15:54:03-0-3663139 RetrieveRequest-Frontend-ctaxx-3454028-20250618-15:54:03-0-3663894 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3665562 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3693289 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3693606 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3704977 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3705518 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3705579 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3783350 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3783351 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3784940 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3785691 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3786215 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3787441 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3787443 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3787444 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3787445 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-3787448 RetrieveRequest-Frontend-ctaxx-3454028-20250618-15:54:03-0-3787449 RetrieveRequest-Frontend-ctaxx-3953274-20250618-15:58:28-0-308850”,
“startIndex”: 1200,
“endIndex”: 1224,
“totalObjects”: 1290
}
{
“epoch_time”: 1752020946.612408135,
“local_time”: “2025-07-08T19:29:06-0500”,
“hostname”: “tpsrvf2205”,
“program”: “cta-taped”,
“log_level”: “ERROR”,
“pid”: 2765584,
“tid”: 2765584,
“message”: “In Agent::deleteAndUnregisterSelf: agent still owns objects. Here is a part of the list.”,
“drive_name”: “F1_xxxx”,
“instance”: “prd”,
“sched_backend”: “cephUser”,
“agentObject”: “Maintenance-tpsrvf2205-2765584-20250628-07:25:38-0”,
“objects”: “RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-319324 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-319327 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-329565 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-330413 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-331946 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-332601 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-333843 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-335897 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-335899 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-340469 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-340475 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-340491 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-344336 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-344337 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-344927 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-350100 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-350171 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-350257 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-350483 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-350833 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-351068 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-357115 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-359094 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-359114 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-359177”,
“startIndex”: 1225,
“endIndex”: 1249,
“totalObjects”: 1290
}
{
“epoch_time”: 1752020946.612528872,
“local_time”: “2025-07-08T19:29:06-0500”,
“hostname”: “tpsrvf2205”,
“program”: “cta-taped”,
“log_level”: “ERROR”,
“pid”: 2765584,
“tid”: 2765584,
“message”: “In Agent::deleteAndUnregisterSelf: agent still owns objects. Here is a part of the list.”,
“drive_name”: “F1_xxxx”,
“instance”: “prd”,
“sched_backend”: “cephUser”,
“agentObject”: “Maintenance-tpsrvf2205-2765584-20250628-07:25:38-0”,
“objects”: “RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-359228 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-359266 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-359342 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-362085 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-364585 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-364590 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-365381 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-365392 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-365477 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-365485 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-371067 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-371101 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-371192 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-372136 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-372167 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-372189 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-372549 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-372606 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-372624 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-372635 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-372650 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-372659 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-374349 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-375986 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-377269”,
“startIndex”: 1250,
“endIndex”: 1274,
“totalObjects”: 1290
}
{
“epoch_time”: 1752020946.612654423,
“local_time”: “2025-07-08T19:29:06-0500”,
“hostname”: “tpsrvf2205”,
“program”: “cta-taped”,
“log_level”: “ERROR”,
“pid”: 2765584,
“tid”: 2765584,
“message”: “In Agent::deleteAndUnregisterSelf: agent still owns objects. Here is a part of the list.”,
“drive_name”: “F1_xxxx”,
“instance”: “prd”,
“sched_backend”: “cephUser”,
“agentObject”: “Maintenance-tpsrvf2205-2765584-20250628-07:25:38-0”,
“objects”: “RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-378097 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-383441 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-383468 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-383507 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-383518 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-383534 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-383542 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-383556 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-383899 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-383945 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-383956 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-383973 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-383975 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-383983 RetrieveRequest-Frontend-cta02-3953274-20250618-15:58:28-0-384077”,
“startIndex”: 1275,
“endIndex”: 1289,
“totalObjects”: 1290
}
{
“epoch_time”: 1752020946.617172154,
“local_time”: “2025-07-08T19:29:06-0500”,
“hostname”: “tpsrvf2205”,
“program”: “cta-taped”,
“log_level”: “CRIT”,
“pid”: 2765584,
“tid”: 2765584,
“message”: “In BackendPopulator::~BackendPopulator(): error deleting agent (cta::exception::Exception). Backtrace follows.”,
“drive_name”: “F1_xxxxx”,
“instance”: “prd”,
“sched_backend”: “cephUser”,
“errorMessage”: “In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-tpsrvf2205-2765584-20250628-07:25:38-0) still owns objects. Here’s the first few: RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1280460 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1280563 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1283258 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1940365 [… trimmed at 3 of 1290]”
}
{
“epoch_time”: 1752020946.617291622,
“local_time”: “2025-07-08T19:29:06-0500”,
“hostname”: “tpsrvf2205”,
“program”: “cta-taped”,
“log_level”: “INFO”,
“pid”: 2765584,
“tid”: 2765584,
“message”: “Stack trace”,
“drive_name”: “F1_xxxx”,
“instance”: “prd”,
“sched_backend”: “cephUser”,
“errorMessage”: “In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-tpsrvf2205-2765584-20250628-07:25:38-0) still owns objects. Here’s the first few: RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1280460 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1280563 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1283258 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1940365 [… trimmed at 3 of 1290]”,
“traceFrameNumber”: 0,
“traceFrame”: “/lib64/libctacommon.so.0(cta::exception::Backtrace::Backtrace(bool)+0x6b) [0x7f62b8ba9c49]”
}
{
“epoch_time”: 1752020946.617343915,
“local_time”: “2025-07-08T19:29:06-0500”,
“hostname”: “tpsrvf2205”,
“program”: “cta-taped”,
“log_level”: “INFO”,
“pid”: 2765584,
“tid”: 2765584,
“message”: “Stack trace”,
“drive_name”: “F1_xxxxx”,
“instance”: “prd”,
“sched_backend”: “cephUser”,
“errorMessage”: “In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-tpsrvf2205-2765584-20250628-07:25:38-0) still owns objects. Here’s the first few: RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1280460 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1280563 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1283258 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1940365 [… trimmed at 3 of 1290]”,
“traceFrameNumber”: 1,
“traceFrame”: “/lib64/libctacommon.so.0(cta::exception::Exception::Exception(std::basic_string_view<char, std::char_traits >, bool)+0x91) [0x7f62b8baad11]”
}
{
“epoch_time”: 1752020946.617406205,
“local_time”: “2025-07-08T19:29:06-0500”,
“hostname”: “tpsrvf2205”,
“program”: “cta-taped”,
“log_level”: “INFO”,
“pid”: 2765584,
“tid”: 2765584,
“message”: “Stack trace”,
“drive_name”: “F1_xxxxx”,
“instance”: “prd”,
“sched_backend”: “cephUser”,
“errorMessage”: “In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-tpsrvf2205-2765584-20250628-07:25:38-0) still owns objects. Here’s the first few: RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1280460 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1280563 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1283258 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1940365 [… trimmed at 3 of 1290]”,
“traceFrameNumber”: 2,
“traceFrame”: “/lib64/libctaobjectstore.so.0(cta::objectstore::Agent::AgentStillOwnsObjects::Exception(std::basic_string_view<char, std::char_traits >, bool)+0x4c) [0x7f62c04bb430]”
}
{
“epoch_time”: 1752020946.617452489,
“local_time”: “2025-07-08T19:29:06-0500”,
“hostname”: “tpsrvf2205”,
“program”: “cta-taped”,
“log_level”: “INFO”,
“pid”: 2765584,
“tid”: 2765584,
“message”: “Stack trace”,
“drive_name”: “F1_xxxxx”,
“instance”: “prd”,
“sched_backend”: “cephUser”,
“errorMessage”: “In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-tpsrvf2205-2765584-20250628-07:25:38-0) still owns objects. Here’s the first few: RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1280460 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1280563 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1283258 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1940365 [… trimmed at 3 of 1290]”,
“traceFrameNumber”: 3,
“traceFrame”: “/lib64/libctaobjectstore.so.0(cta::objectstore::Agent::removeAndUnregisterSelf(cta::log::LogContext&)+0x67c) [0x7f62c04b9984]”
}
{
“epoch_time”: 1752020946.617496075,
“local_time”: “2025-07-08T19:29:06-0500”,
“hostname”: “tpsrvf2205”,
“program”: “cta-taped”,
“log_level”: “INFO”,
“pid”: 2765584,
“tid”: 2765584,
“message”: “Stack trace”,
“drive_name”: “F1_xxxxx”,
“instance”: “prd”,
“sched_backend”: “cephUser”,
“errorMessage”: “In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-tpsrvf2205-2765584-20250628-07:25:38-0) still owns objects. Here’s the first few: RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1280460 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1280563 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1283258 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1940365 [… trimmed at 3 of 1290]”,
“traceFrameNumber”: 4,
“traceFrame”: “/lib64/libctaobjectstore.so.0(cta::objectstore::BackendPopulator::~BackendPopulator()+0x2d3) [0x7f62c060a001]”
}
{
“epoch_time”: 1752020946.617539342,
“local_time”: “2025-07-08T19:29:06-0500”,
“hostname”: “tpsrvf2205”,
“program”: “cta-taped”,
“log_level”: “INFO”,
“pid”: 2765584,
“tid”: 2765584,
“message”: “Stack trace”,
“drive_name”: “F1_xxxx”,
“instance”: “prd”,
“sched_backend”: “cephUser”,
“errorMessage”: “In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-tpsrvf2205-2765584-20250628-07:25:38-0) still owns objects. Here’s the first few: RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1280460 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1280563 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1283258 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1940365 [… trimmed at 3 of 1290]”,
“traceFrameNumber”: 5,
“traceFrame”: “/lib64/libctaobjectstore.so.0(cta::objectstore::BackendPopulator::~BackendPopulator()+0x27) [0x7f62c060a469]”
}
{
“epoch_time”: 1752020946.617587795,
“local_time”: “2025-07-08T19:29:06-0500”,
“hostname”: “tpsrvf2205”,
“program”: “cta-taped”,
“log_level”: “INFO”,
“pid”: 2765584,
“tid”: 2765584,
“message”: “Stack trace”,
“drive_name”: “F1_xxxx”,
“instance”: “prd”,
“sched_backend”: “cephUser”,
“errorMessage”: “In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-tpsrvf2205-2765584-20250628-07:25:38-0) still owns objects. Here’s the first few: RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1280460 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1280563 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1283258 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1940365 [… trimmed at 3 of 1290]”,
“traceFrameNumber”: 6,
“traceFrame”: “/usr/bin/cta-taped() [0x4c16b9]”
}
{
“epoch_time”: 1752020946.617629478,
“local_time”: “2025-07-08T19:29:06-0500”,
“hostname”: “tpsrvf2205”,
“program”: “cta-taped”,
“log_level”: “INFO”,
“pid”: 2765584,
“tid”: 2765584,
“message”: “Stack trace”,
“drive_name”: “F1_xxxx”,
“instance”: “prd”,
“sched_backend”: “cephUser”,
“errorMessage”: “In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-tpsrvf2205-2765584-20250628-07:25:38-0) still owns objects. Here’s the first few: RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1280460 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1280563 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1283258 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1940365 [… trimmed at 3 of 1290]”,
“traceFrameNumber”: 7,
“traceFrame”: “/usr/bin/cta-taped() [0x4bd99b]”
}
{
“epoch_time”: 1752020946.617670567,
“local_time”: “2025-07-08T19:29:06-0500”,
“hostname”: “tpsrvf2205”,
“program”: “cta-taped”,
“log_level”: “INFO”,
“pid”: 2765584,
“tid”: 2765584,
“message”: “Stack trace”,
“drive_name”: “F1_xxxx”,
“instance”: “prd”,
“sched_backend”: “cephUser”,
“errorMessage”: “In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-tpsrvf2205-2765584-20250628-07:25:38-0) still owns objects. Here’s the first few: RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1280460 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1280563 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1283258 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1940365 [… trimmed at 3 of 1290]”,
“traceFrameNumber”: 8,
“traceFrame”: “/usr/bin/cta-taped() [0x4c22a3]”
}
{
“epoch_time”: 1752020946.617711434,
“local_time”: “2025-07-08T19:29:06-0500”,
“hostname”: “tpsrvf2205”,
“program”: “cta-taped”,
“log_level”: “INFO”,
“pid”: 2765584,
“tid”: 2765584,
“message”: “Stack trace”,
“drive_name”: “F1_xxxx”,
“instance”: “prd”,
“sched_backend”: “cephUser”,
“errorMessage”: “In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-tpsrvf2205-2765584-20250628-07:25:38-0) still owns objects. Here’s the first few: RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1280460 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1280563 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1283258 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1940365 [… trimmed at 3 of 1290]”,
“traceFrameNumber”: 9,
“traceFrame”: “/usr/bin/cta-taped() [0x4eaf94]”
}
{
“epoch_time”: 1752020946.617752635,
“local_time”: “2025-07-08T19:29:06-0500”,
“hostname”: “tpsrvf2205”,
“program”: “cta-taped”,
“log_level”: “INFO”,
“pid”: 2765584,
“tid”: 2765584,
“message”: “Stack trace”,
“drive_name”: “F1_xxxx”,
“instance”: “prd”,
“sched_backend”: “cephUser”,
“errorMessage”: “In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-tpsrvf2205-2765584-20250628-07:25:38-0) still owns objects. Here’s the first few: RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1280460 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1280563 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1283258 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1940365 [… trimmed at 3 of 1290]”,
“traceFrameNumber”: 10,
“traceFrame”: “/usr/bin/cta-taped() [0x4e9778]”
}
{
“epoch_time”: 1752020946.617798997,
“local_time”: “2025-07-08T19:29:06-0500”,
“hostname”: “tpsrvf2205”,
“program”: “cta-taped”,
“log_level”: “INFO”,
“pid”: 2765584,
“tid”: 2765584,
“message”: “Stack trace”,
“drive_name”: “F1_xxxx”,
“instance”: “prd”,
“sched_backend”: “cephUser”,
“errorMessage”: “In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-tpsrvf2205-2765584-20250628-07:25:38-0) still owns objects. Here’s the first few: RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1280460 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1280563 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1283258 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1940365 [… trimmed at 3 of 1290]”,
“traceFrameNumber”: 11,
“traceFrame”: “/usr/bin/cta-taped() [0x4ee867]”
}
{
“epoch_time”: 1752020946.617850908,
“local_time”: “2025-07-08T19:29:06-0500”,
“hostname”: “tpsrvf2205”,
“program”: “cta-taped”,
“log_level”: “INFO”,
“pid”: 2765584,
“tid”: 2765584,
“message”: “Stack trace”,
“drive_name”: “F1_xxxx”,
“instance”: “prd”,
“sched_backend”: “cephUser”,
“errorMessage”: “In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-tpsrvf2205-2765584-20250628-07:25:38-0) still owns objects. Here’s the first few: RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1280460 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1280563 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1283258 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1940365 [… trimmed at 3 of 1290]”,
“traceFrameNumber”: 12,
“traceFrame”: “/usr/bin/cta-taped() [0x4ed811]”
}
{
“epoch_time”: 1752020946.617892623,
“local_time”: “2025-07-08T19:29:06-0500”,
“hostname”: “tpsrvf2205”,
“program”: “cta-taped”,
“log_level”: “INFO”,
“pid”: 2765584,
“tid”: 2765584,
“message”: “Stack trace”,
“drive_name”: “F1_xxxx”,
“instance”: “prd”,
“sched_backend”: “cephUser”,
“errorMessage”: “In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-tpsrvf2205-2765584-20250628-07:25:38-0) still owns objects. Here’s the first few: RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1280460 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1280563 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1283258 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1940365 [… trimmed at 3 of 1290]”,
“traceFrameNumber”: 13,
“traceFrame”: “/usr/bin/cta-taped() [0x4a69a0]”
}
{
“epoch_time”: 1752020946.617934091,
“local_time”: “2025-07-08T19:29:06-0500”,
“hostname”: “tpsrvf2205”,
“program”: “cta-taped”,
“log_level”: “INFO”,
“pid”: 2765584,
“tid”: 2765584,
“message”: “Stack trace”,
“drive_name”: “F1_xxxxx”,
“instance”: “prd”,
“sched_backend”: “cephUser”,
“errorMessage”: “In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-tpsrvf2205-2765584-20250628-07:25:38-0) still owns objects. Here’s the first few: RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1280460 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1280563 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1283258 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1940365 [… trimmed at 3 of 1290]”,
“traceFrameNumber”: 14,
“traceFrame”: “/usr/bin/cta-taped() [0x4a66e0]”
}
{
“epoch_time”: 1752020946.617975535,
“local_time”: “2025-07-08T19:29:06-0500”,
“hostname”: “tpsrvf2205”,
“program”: “cta-taped”,
“log_level”: “INFO”,
“pid”: 2765584,
“tid”: 2765584,
“message”: “Stack trace”,
“drive_name”: “F1_xxxx”,
“instance”: “prd”,
“sched_backend”: “cephUser”,
“errorMessage”: “In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-tpsrvf2205-2765584-20250628-07:25:38-0) still owns objects. Here’s the first few: RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1280460 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1280563 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1283258 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1940365 [… trimmed at 3 of 1290]”,
“traceFrameNumber”: 15,
“traceFrame”: “/usr/bin/cta-taped() [0x4a61a1]”
}
{
“epoch_time”: 1752020946.618022150,
“local_time”: “2025-07-08T19:29:06-0500”,
“hostname”: “tpsrvf2205”,
“program”: “cta-taped”,
“log_level”: “INFO”,
“pid”: 2765584,
“tid”: 2765584,
“message”: “Stack trace”,
“drive_name”: “F1_xxxx”,
“instance”: “prd”,
“sched_backend”: “cephUser”,
“errorMessage”: “In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-tpsrvf2205-2765584-20250628-07:25:38-0) still owns objects. Here’s the first few: RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1280460 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1280563 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1283258 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1940365 [… trimmed at 3 of 1290]”,
“traceFrameNumber”: 16,
“traceFrame”: “/usr/bin/cta-taped() [0x490f05]”
}
{
“epoch_time”: 1752020946.618064386,
“local_time”: “2025-07-08T19:29:06-0500”,
“hostname”: “tpsrvf2205”,
“program”: “cta-taped”,
“log_level”: “INFO”,
“pid”: 2765584,
“tid”: 2765584,
“message”: “Stack trace”,
“drive_name”: “F1_xxxx”,
“instance”: “prd”,
“sched_backend”: “cephUser”,
“errorMessage”: “In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-tpsrvf2205-2765584-20250628-07:25:38-0) still owns objects. Here’s the first few: RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1280460 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1280563 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1283258 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1940365 [… trimmed at 3 of 1290]”,
“traceFrameNumber”: 17,
“traceFrame”: “/usr/bin/cta-taped() [0x491ab8]”
}
{
“epoch_time”: 1752020946.618105806,
“local_time”: “2025-07-08T19:29:06-0500”,
“hostname”: “tpsrvf2205”,
“program”: “cta-taped”,
“log_level”: “INFO”,
“pid”: 2765584,
“tid”: 2765584,
“message”: “Stack trace”,
“drive_name”: “F1_xxxx”,
“instance”: “prd”,
“sched_backend”: “cephUser”,
“errorMessage”: “In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-tpsrvf2205-2765584-20250628-07:25:38-0) still owns objects. Here’s the first few: RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1280460 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1280563 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1283258 RetrieveRequest-Frontend-cta01-3454028-20250618-15:54:03-0-1940365 [… trimmed at 3 of 1290]”,
“traceFrameNumber”: 18,
“traceFrame”: “/lib64/libc.so.6(+0x295d0) [0x7f62b7a295d0]”
}

Possible causes

I was thinking based on the critical error message that possible ceph and cta had gotten out of synch maybe but I am not sure how. I know generally when which was round Wednesday/Thursday last week.

Hi Scarlet,

Your summary says “Normally the drives will hold on to a tape until all data is written” but the log entries you provided are for retrieve jobs. Is the problem with archive or retrieve or both?

The log entries are from the maintenance process. What is happening with the drive process?

My guess would be that retrieve jobs are failing and being requeued. You should investigate the cause of the failures. If drives are dismounting tapes for retrieve jobs too early, it could be that there is not enough disk space in the retrieve buffer?

Both sorry. We have Generally more writing so that is what I was focused on. I just took a random drive portion of log from the day I think this issue started happening. I did not check it closely enough so sent the wrong one.

Not sure if this helps but here is another portion of log:
Failed to mount tape for read/write access: vid=FA9585 slot=smc21: Failed to mount tape in SCSI tape-library for read/write access: vid=FA9585 librarySlot=smc21: Received error from rmcd: rmcRc=2204 rmcErrorStream=smc_mount: Asked for FA9585, got reply for FA9585\n smc_mount: Location: data transfer element (0x4)\n smc_mount: SR018 - mount of FA9585 on drive 21 failed : volume in use\n"

There is a problem with tape FA9585. You should disable the tape to unblock the system while you investigate. See my other reply: Help understanding what is happening on archive - #6 by mdavis

The CTA Tape Alerting System will detect this kind of problem and disable the tape automatically so the system doesn’t get snarled like you are seeing.

We’re seeing this for many, many tapes, so I don’t think it’s due to that. The tapeds keep getting killed while a tape is in the drive:

{"epoch_time":1752576467.015896940,"local_time":"2025-07-15T05:47:47-0500","hostname":"tpsrvf2205","program":"cta-taped","log_level":"WARN","pid":3153337,"tid":3153337,"mes
sage":"In DriveHandler::processTimeout(): Killed subprocess.","drive_name":"F1_F4B4D1","instance":"prd","sched_backend":"cephUser","capacityInBytes":"18000000000000","logic
alLibrary":"F1_LTO9","mediaType":"LTO9","mountAttempted":"1","mountId":"141740","mountType":"ArchiveForUser","stillOpenFileForThread0":"root://cmsstor820.fnal.gov:1094/0000
521895334A7B45FC80632CCF67208603?xrdcl.requuid=7a302b10-5ee9-49dc-9fc4-b849b2fbe69e","stillOpenFileForThread1":"root://cmsstor820.fnal.gov:1096/00009EACA8AE725F418094EDE99E
3F8E2E73?xrdcl.requuid=de7138e5-4c7e-4555-9765-c1ca56e5b899","stillOpenFileForThread2":"root://cmsstor801.fnal.gov:1095/0000C41694844C9B4272A5DD491C4CFCE20A?xrdcl.requuid=b
eb11389-d3dd-4d0f-9080-ee3a41abc361","stillOpenFileForThread3":"root://cmsstor825.fnal.gov:1096/00004196F7D7CED94619B3045CE9812536C7?xrdcl.requuid=a3379499-0ec2-4c9e-9b20-f
3017c33edd7","stillOpenFileForThread5":"root://cmsstor812.fnal.gov:1095/0000D8DC032996AE4917B5B67D6BC46816AF?xrdcl.requuid=13a0c100-aa54-49b9-b1f6-e1e0d4746eba","stillOpenF
ileForThread6":"root://cmsstor827.fnal.gov:1095/000041347D6C538E417FB2FB19DDB81EC369?xrdcl.requuid=9468ef79-5b32-4e9b-8362-a6e27c35a439","stillOpenFileForThread7":"root://c
msstor817.fnal.gov:1095/00009CCADEA968AA41E499A4DAC0FEE8205E?xrdcl.requuid=60f05086-7352-49a9-ad4e-b621caa7c7ab","stillOpenFileForThread8":"root://cmsstor827.fnal.gov:1094/
00006965B7C7DCB945D394917A4FE3A77DD1?xrdcl.requuid=9357084d-2256-4f7c-839c-8021cdba8c23","stillOpenFileForThread9":"root://cmsstor805.fnal.gov:1094/0000ACEE5C75CEB84E35B3EE
0424A92E7C70?xrdcl.requuid=25e285d4-0d10-4c84-89a1-d1255cde5d98","tapePool":"cms.OORun2025RAW","tapeVid":"FA9237","vendor":"Fujifilm","vo":"cms","volReqId":"141740","wasTap
eMounted":"1","mountTime":"33.382368","positionTime":"87.91624","waitInstructionsTime":"0.291528","waitFreeMemoryTime":"0.0","waitDataTime":"0.0242869999999998","waitReport
ingTime":"0.000519","checksumingTime":"11.21807","readWriteTime":"70.056489","flushTime":"0.0","unloadTime":"0.0","unmountTime":"0.0","encryptionControlTime":"0.004379","tr
ansferTime":"81.590893","totalTime":"1096.007277","deliveryTime":"1096.007277","drainingTime":"0.0","dataVolume":"22191234431","filesCount":"4","headerVolume":"1920","paylo
adTransferSpeedMBps":"20.2473422363965","driveTransferSpeedMBps":"20.2473439882097","tapeDrive":"F1_F4B4D1","SessionState":"Running","SessionType":"Archive","TimeoutType":"
DataMovement","SessionTypeWhenTimeoutDecided":"Archive","SessionStateWhenTimeoutDecided":"Running","LastDataMovementTime":2398718,"LastHeartbeatTime":2399604,"LastStateChan
geTime":2398629,"Now":2399618,"ThisTimeout":2399618,"BeforeDataMovementTimeout_s":-0.010985567,"BeforeHeartbeatTimeout_s":45.806903644,"SubprocessId":1202507}

As the day has progressed we have been still investigating but things seem to have improved and we are not sure why. When checking the logs we saw symptoms but were having trouble pinning down an actually issue.

{“epoch_time”:1752469198.186260641,“local_time”:“2025-07-13T23:59:58-0500”,“hostname”:“tpsrvf2205”,“program”:“cta-taped”,“log_level”:“INFO”,“pid”:3153337,“tid”:3153337,“message”:“Propagated SIGCHILD.”,“drive_name”:“F1_F4B4D1”,“instance”:“prd”,“sched_backend”:“cephUser”,“capacityInBytes”:“18000000000000”,“logicalLibrary”:“F1_LTO9”,“mediaType”:“LTO9”,“mountAttempted”:“1”,“mountId”:“138917”,“mountType”:“ArchiveForUser”,“stillOpenFileForThread1”:“root://cmsstor820.fnal.gov:1095/000068FD5E1A5BF74183B4C9553F2D722C31?xrdcl.requuid=e2af057b-ffb9-41f6-b38f-10aa6afb5f44”,“stillOpenFileForThread2”:“root://cmsstor807.fnal.gov:1094/0000A7BA7FB472F34B87BC253C2D4C541777?xrdcl.requuid=c4792d08-84d7-402b-9295-650d387c2f90”,“stillOpenFileForThread5”:“root://cmsstor815.fnal.gov:1095/0000D16542837BAB4466A2838BAC4E861115?xrdcl.requuid=316182e8-a854-4163-9e2b-00e263e16de1”,“stillOpenFileForThread6”:“root://cmsstor814.fnal.gov:1095/00008DA0F0856F434D1BAD8BC8F2F829A195?xrdcl.requuid=39148685-520e-4075-b10d-e03a3742e1f3”,“stillOpenFileForThread7”:“root://cmsstor814.fnal.gov:1094/00008326D99D476A4414ADC16F6FA16835C7?xrdcl.requuid=3baf7bb6-d199-4764-b8d0-56e203debafa”,“stillOpenFileForThread8”:“root://cmsstor810.fnal.gov:1095/0000B85F1F97B7044AD19FDCF72F1A456200?xrdcl.requuid=4aff05de-e7d9-4390-a697-5f1c81e2ec0a”,“tapePool”:“cms.OORun2025Prompt”,“tapeVid”:“FA9328”,“vendor”:“Fujifilm”,“vo”:“cms”,“volReqId”:“138917”,“stillOpenFileForThread3”:“root://cmsstor820.fnal.gov:1096/0000AD1B2C58EB56471BAAB67DD437F0F770?xrdcl.requuid=0df975e1-b52a-4e90-bdf7-49ab32523514”,“wasTapeMounted”:“1”,“mountTime”:“31.933763”,“positionTime”:“60.106393”,“waitInstructionsTime”:“0.315143”,“waitFreeMemoryTime”:“0.0”,“waitDataTime”:“1.920995”,“waitReportingTime”:“0.006894”,“checksumingTime”:“126.70649”,“readWriteTime”:“633.397198”,“flushTime”:“12.919832”,“unloadTime”:“0.0”,“unmountTime”:“0.0”,“encryptionControlTime”:“0.004267”,“transferTime”:“775.266552”,“totalTime”:“1531.527783”,“deliveryTime”:“1531.527783”,“drainingTime”:“0.0”,“dataVolume”:“246049419845”,“filesCount”:“70”,“headerVolume”:“33600”,“payloadTransferSpeedMBps”:“160.656190880868”,“driveTransferSpeedMBps”:“160.656212819745”,“stillOpenFileForThread4”:“root://cmsstor825.fnal.gov:1095/0000CE5F7C5844FC4923A548C173A4B4C34B?xrdcl.requuid=ca278060-35d2-454c-8fab-952d61d70794”,“stillOpenFileForThread9”:“root://cmsstor816.fnal.gov:1096/0000256CDE9A779E41D584C5C05E808E42B1?xrdcl.requuid=3a4061c4-7f5b-4f36-a232-2413654e81dc”,“SubprocessName”:“signalHandler”}
{“epoch_time”:1752469198.186389111,“local_time”:“2025-07-13T23:59:58-0500”,“hostname”:“tpsrvf2205”,“program”:“cta-taped”,“log_level”:“INFO”,“pid”:3153337,“tid”:3153337,“message”:“Propagated SIGCHILD.”,“drive_name”:“F1_F4B4D1”,“instance”:“prd”,“sched_backend”:“cephUser”,“capacityInBytes”:“18000000000000”,“logicalLibrary”:“F1_LTO9”,“mediaType”:“LTO9”,“mountAttempted”:“1”,“mountId”:“138917”,“mountType”:“ArchiveForUser”,“stillOpenFileForThread1”:“root://cmsstor820.fnal.gov:1095/000068FD5E1A5BF74183B4C9553F2D722C31?xrdcl.requuid=e2af057b-ffb9-41f6-b38f-10aa6afb5f44”,“stillOpenFileForThread2”:“root://cmsstor807.fnal.gov:1094/0000A7BA7FB472F34B87BC253C2D4C541777?xrdcl.requuid=c4792d08-84d7-402b-9295-650d387c2f90”,“stillOpenFileForThread5”:“root://cmsstor815.fnal.gov:1095/0000D16542837BAB4466A2838BAC4E861115?xrdcl.requuid=316182e8-a854-4163-9e2b-00e263e16de1”,“stillOpenFileForThread6”:“root://cmsstor814.fnal.gov:1095/00008DA0F0856F434D1BAD8BC8F2F829A195?xrdcl.requuid=39148685-520e-4075-b10d-e03a3742e1f3”,“stillOpenFileForThread7”:“root://cmsstor814.fnal.gov:1094/00008326D99D476A4414ADC16F6FA16835C7?xrdcl.requuid=3baf7bb6-d199-4764-b8d0-56e203debafa”,“stillOpenFileForThread8”:“root://cmsstor810.fnal.gov:1095/0000B85F1F97B7044AD19FDCF72F1A456200?xrdcl.requuid=4aff05de-e7d9-4390-a697-5f1c81e2ec0a”,“tapePool”:“cms.OORun2025Prompt”,“tapeVid”:“FA9328”,“vendor”:“Fujifilm”,“vo”:“cms”,“volReqId”:“138917”,“stillOpenFileForThread3”:“root://cmsstor820.fnal.gov:1096/0000AD1B2C58EB56471BAAB67DD437F0F770?xrdcl.requuid=0df975e1-b52a-4e90-bdf7-49ab32523514”,“wasTapeMounted”:“1”,“mountTime”:“31.933763”,“positionTime”:“60.106393”,“waitInstructionsTime”:“0.315143”,“waitFreeMemoryTime”:“0.0”,“waitDataTime”:“1.920995”,“waitReportingTime”:“0.006894”,“checksumingTime”:“126.70649”,“readWriteTime”:“633.397198”,“flushTime”:“12.919832”,“unloadTime”:“0.0”,“unmountTime”:“0.0”,“encryptionControlTime”:“0.004267”,“transferTime”:“775.266552”,“totalTime”:“1531.527783”,“deliveryTime”:“1531.527783”,“drainingTime”:“0.0”,“dataVolume”:“246049419845”,“filesCount”:“70”,“headerVolume”:“33600”,“payloadTransferSpeedMBps”:“160.656190880868”,“driveTransferSpeedMBps”:“160.656212819745”,“stillOpenFileForThread4”:“root://cmsstor825.fnal.gov:1095/0000CE5F7C5844FC4923A548C173A4B4C34B?xrdcl.requuid=ca278060-35d2-454c-8fab-952d61d70794”,“stillOpenFileForThread9”:“root://cmsstor816.fnal.gov:1096/0000256CDE9A779E41D584C5C05E808E42B1?xrdcl.requuid=3a4061c4-7f5b-4f36-a232-2413654e81dc”,“SubprocessName”:“drive:F1_F4B4D1”}
{“epoch_time”:1752469198.186494687,“local_time”:“2025-07-13T23:59:58-0500”,“hostname”:“tpsrvf2205”,“program”:“cta-taped”,“log_level”:“INFO”,“pid”:3153337,“tid”:3153337,“message”:“Propagated SIGCHILD.”,“drive_name”:“F1_F4B4D1”,“instance”:“prd”,“sched_backend”:“cephUser”,“capacityInBytes”:“18000000000000”,“logicalLibrary”:“F1_LTO9”,“mediaType”:“LTO9”,“mountAttempted”:“1”,“mountId”:“138917”,“mountType”:“ArchiveForUser”,“stillOpenFileForThread1”:“root://cmsstor820.fnal.gov:1095/000068FD5E1A5BF74183B4C9553F2D722C31?xrdcl.requuid=e2af057b-ffb9-41f6-b38f-10aa6afb5f44”,“stillOpenFileForThread2”:“root://cmsstor807.fnal.gov:1094/0000A7BA7FB472F34B87BC253C2D4C541777?xrdcl.requuid=c4792d08-84d7-402b-9295-650d387c2f90”,“stillOpenFileForThread5”:“root://cmsstor815.fnal.gov:1095/0000D16542837BAB4466A2838BAC4E861115?xrdcl.requuid=316182e8-a854-4163-9e2b-00e263e16de1”,“stillOpenFileForThread6”:“root://cmsstor814.fnal.gov:1095/00008DA0F0856F434D1BAD8BC8F2F829A195?xrdcl.requuid=39148685-520e-4075-b10d-e03a3742e1f3”,“stillOpenFileForThread7”:“root://cmsstor814.fnal.gov:1094/00008326D99D476A4414ADC16F6FA16835C7?xrdcl.requuid=3baf7bb6-d199-4764-b8d0-56e203debafa”,“stillOpenFileForThread8”:“root://cmsstor810.fnal.gov:1095/0000B85F1F97B7044AD19FDCF72F1A456200?xrdcl.requuid=4aff05de-e7d9-4390-a697-5f1c81e2ec0a”,“tapePool”:“cms.OORun2025Prompt”,“tapeVid”:“FA9328”,“vendor”:“Fujifilm”,“vo”:“cms”,“volReqId”:“138917”,“stillOpenFileForThread3”:“root://cmsstor820.fnal.gov:1096/0000AD1B2C58EB56471BAAB67DD437F0F770?xrdcl.requuid=0df975e1-b52a-4e90-bdf7-49ab32523514”,“wasTapeMounted”:“1”,“mountTime”:“31.933763”,“positionTime”:“60.106393”,“waitInstructionsTime”:“0.315143”,“waitFreeMemoryTime”:“0.0”,“waitDataTime”:“1.920995”,“waitReportingTime”:“0.006894”,“checksumingTime”:“126.70649”,“readWriteTime”:“633.397198”,“flushTime”:“12.919832”,“unloadTime”:“0.0”,“unmountTime”:“0.0”,“encryptionControlTime”:“0.004267”,“transferTime”:“775.266552”,“totalTime”:“1531.527783”,“deliveryTime”:“1531.527783”,“drainingTime”:“0.0”,“dataVolume”:“246049419845”,“filesCount”:“70”,“headerVolume”:“33600”,“payloadTransferSpeedMBps”:“160.656190880868”,“driveTransferSpeedMBps”:“160.656212819745”,“stillOpenFileForThread4”:“root://cmsstor825.fnal.gov:1095/0000CE5F7C5844FC4923A548C173A4B4C34B?xrdcl.requuid=ca278060-35d2-454c-8fab-952d61d70794”,“stillOpenFileForThread9”:“root://cmsstor816.fnal.gov:1096/0000256CDE9A779E41D584C5C05E808E42B1?xrdcl.requuid=3a4061c4-7f5b-4f36-a232-2413654e81dc”,“SubprocessName”:“maintenanceHandler”}
{“epoch_time”:1752469198.186867331,“local_time”:“2025-07-13T23:59:58-0500”,“hostname”:“tpsrvf2205”,“program”:“cta-taped”,“log_level”:“INFO”,“pid”:3153337,“tid”:3153337,“message”:“Handler received SIGCHILD. Propagating to all handlers.”,“drive_name”:“F1_F4B4D1”,“instance”:“prd”,“sched_backend”:“cephUser”,“capacityInBytes”:“18000000000000”,“logicalLibrary”:“F1_LTO9”,“mediaType”:“LTO9”,“mountAttempted”:“1”,“mountId”:“138917”,“mountType”:“ArchiveForUser”,“stillOpenFileForThread1”:“root://cmsstor820.fnal.gov:1095/000068FD5E1A5BF74183B4C9553F2D722C31?xrdcl.requuid=e2af057b-ffb9-41f6-b38f-10aa6afb5f44”,“stillOpenFileForThread2”:“root://cmsstor807.fnal.gov:1094/0000A7BA7FB472F34B87BC253C2D4C541777?xrdcl.requuid=c4792d08-84d7-402b-9295-650d387c2f90”,“stillOpenFileForThread5”:“root://cmsstor815.fnal.gov:1095/0000D16542837BAB4466A2838BAC4E861115?xrdcl.requuid=316182e8-a854-4163-9e2b-00e263e16de1”,“stillOpenFileForThread6”:“root://cmsstor814.fnal.gov:1095/00008DA0F0856F434D1BAD8BC8F2F829A195?xrdcl.requuid=39148685-520e-4075-b10d-e03a3742e1f3”,“stillOpenFileForThread7”:“root://cmsstor814.fnal.gov:1094/00008326D99D476A4414ADC16F6FA16835C7?xrdcl.requuid=3baf7bb6-d199-4764-b8d0-56e203debafa”,“stillOpenFileForThread8”:“root://cmsstor810.fnal.gov:1095/0000B85F1F97B7044AD19FDCF72F1A456200?xrdcl.requuid=4aff05de-e7d9-4390-a697-5f1c81e2ec0a”,“tapePool”:“cms.OORun2025Prompt”,“tapeVid”:“FA9328”,“vendor”:“Fujifilm”,“vo”:“cms”,“volReqId”:“138917”,“stillOpenFileForThread3”:“root://cmsstor820.fnal.gov:1096/0000AD1B2C58EB56471BAAB67DD437F0F770?xrdcl.requuid=0df975e1-b52a-4e90-bdf7-49ab32523514”,“wasTapeMounted”:“1”,“mountTime”:“31.933763”,“positionTime”:“60.106393”,“waitInstructionsTime”:“0.315143”,“waitFreeMemoryTime”:“0.0”,“waitDataTime”:“1.920995”,“waitReportingTime”:“0.006894”,“checksumingTime”:“126.70649”,“readWriteTime”:“633.397198”,“flushTime”:“12.919832”,“unloadTime”:“0.0”,“unmountTime”:“0.0”,“encryptionControlTime”:“0.004267”,“transferTime”:“775.266552”,“totalTime”:“1546.543352”,“deliveryTime”:“1546.543352”,“drainingTime”:“0.0”,“dataVolume”:“246049419845”,“filesCount”:“70”,“headerVolume”:“33600”,“payloadTransferSpeedMBps”:“159.096361267085”,“driveTransferSpeedMBps”:“159.096382992955”,“stillOpenFileForThread4”:“root://cmsstor825.fnal.gov:1095/0000CE5F7C5844FC4923A548C173A4B4C34B?xrdcl.requuid=ca278060-35d2-454c-8fab-952d61d70794”,“stillOpenFileForThread9”:“root://cmsstor816.fnal.gov:1096/0000256CDE9A779E41D584C5C05E808E42B1?xrdcl.requuid=3a4061c4-7f5b-4f36-a232-2413654e81dc”,“SubprocessName”:“signalHandler”}
{“epoch_time”:1752469198.186976913,“local_time”:“2025-07-13T23:59:58-0500”,“hostname”:“tpsrvf2205”,“program”:“cta-taped”,“log_level”:“INFO”,“pid”:3153337,“tid”:3153337,“message”:“Propagated SIGCHILD.”,“drive_name”:“F1_F4B4D1”,“instance”:“prd”,“sched_backend”:“cephUser”,“capacityInBytes”:“18000000000000”,“logicalLibrary”:“F1_LTO9”,“mediaType”:“LTO9”,“mountAttempted”:“1”,“mountId”:“138917”,“mountType”:“ArchiveForUser”,“stillOpenFileForThread1”:“root://cmsstor820.fnal.gov:1095/000068FD5E1A5BF74183B4C9553F2D722C31?xrdcl.requuid=e2af057b-ffb9-41f6-b38f-10aa6afb5f44”,“stillOpenFileForThread2”:“root://cmsstor807.fnal.gov:1094/0000A7BA7FB472F34B87BC253C2D4C541777?xrdcl.requuid=c4792d08-84d7-402b-9295-650d387c2f90”,“stillOpenFileForThread5”:“root://cmsstor815.fnal.gov:1095/0000D16542837BAB4466A2838BAC4E861115?xrdcl.requuid=316182e8-a854-4163-9e2b-00e263e16de1”,“stillOpenFileForThread6”:“root://cmsstor814.fnal.gov:1095/00008DA0F0856F434D1BAD8BC8F2F829A195?xrdcl.requuid=39148685-520e-4075-b10d-e03a3742e1f3”,“stillOpenFileForThread7”:“root://cmsstor814.fnal.gov:1094/00008326D99D476A4414ADC16F6FA16835C7?xrdcl.requuid=3baf7bb6-d199-4764-b8d0-56e203debafa”,“stillOpenFileForThread8”:“root://cmsstor810.fnal.gov:1095/0000B85F1F97B7044AD19FDCF72F1A456200?xrdcl.requuid=4aff05de-e7d9-4390-a697-5f1c81e2ec0a”,“tapePool”:“cms.OORun2025Prompt”,“tapeVid”:“FA9328”,“vendor”:“Fujifilm”,“vo”:“cms”,“volReqId”:“138917”,“stillOpenFileForThread3”:“root://cmsstor820.fnal.gov:1096/0000AD1B2C58EB56471BAAB67DD437F0F770?xrdcl.requuid=0df975e1-b52a-4e90-bdf7-49ab32523514”,“wasTapeMounted”:“1”,“mountTime”:“31.933763”,“positionTime”:“60.106393”,“waitInstructionsTime”:“0.315143”,“waitFreeMemoryTime”:“0.0”,“waitDataTime”:“1.920995”,“waitReportingTime”:“0.006894”,“checksumingTime”:“126.70649”,“readWriteTime”:“633.397198”,“flushTime”:“12.919832”,“unloadTime”:“0.0”,“unmountTime”:“0.0”,“encryptionControlTime”:“0.004267”,“transferTime”:“775.266552”,“totalTime”:“1546.543352”,“deliveryTime”:“1546.543352”,“drainingTime”:“0.0”,“dataVolume”:“246049419845”,“filesCount”:“70”,“headerVolume”:“33600”,“payloadTransferSpeedMBps”:“159.096361267085”,“driveTransferSpeedMBps”:“159.096382992955”,“stillOpenFileForThread4”:“root://cmsstor825.fnal.gov:1095/0000CE5F7C5844FC4923A548C173A4B4C34B?xrdcl.requuid=ca278060-35d2-454c-8fab-952d61d70794”,“stillOpenFileForThread9”:“root://cmsstor816.fnal.gov:1096/0000256CDE9A779E41D584C5C05E808E42B1?xrdcl.requuid=3a4061c4-7f5b-4f36-a232-2413654e81dc”,“SubprocessName”:“signalHandler”}
{“epoch_time”:1752469198.187098269,“local_time”:“2025-07-13T23:59:58-0500”,“hostname”:“tpsrvf2205”,“program”:“cta-taped”,“log_level”:“INFO”,“pid”:3153337,“tid”:3153337,“message”:“Propagated SIGCHILD.”,“drive_name”:“F1_F4B4D1”,“instance”:“prd”,“sched_backend”:“cephUser”,“capacityInBytes”:“18000000000000”,“logicalLibrary”:“F1_LTO9”,“mediaType”:“LTO9”,“mountAttempted”:“1”,“mountId”:“138917”,“mountType”:“ArchiveForUser”,“stillOpenFileForThread1”:“root://cmsstor820.fnal.gov:1095/000068FD5E1A5BF74183B4C9553F2D722C31?xrdcl.requuid=e2af057b-ffb9-41f6-b38f-10aa6afb5f44”,“stillOpenFileForThread2”:“root://cmsstor807.fnal.gov:1094/0000A7BA7FB472F34B87BC253C2D4C541777?xrdcl.requuid=c4792d08-84d7-402b-9295-650d387c2f90”,“stillOpenFileForThread5”:“root://cmsstor815.fnal.gov:1095/0000D16542837BAB4466A2838BAC4E861115?xrdcl.requuid=316182e8-a854-4163-9e2b-00e263e16de1”,“stillOpenFileForThread6”:“root://cmsstor814.fnal.gov:1095/00008DA0F0856F434D1BAD8BC8F2F829A195?xrdcl.requuid=39148685-520e-4075-b10d-e03a3742e1f3”,“stillOpenFileForThread7”:“root://cmsstor814.fnal.gov:1094/00008326D99D476A4414ADC16F6FA16835C7?xrdcl.requuid=3baf7bb6-d199-4764-b8d0-56e203debafa”,“stillOpenFileForThread8”:“root://cmsstor810.fnal.gov:1095/0000B85F1F97B7044AD19FDCF72F1A456200?xrdcl.requuid=4aff05de-e7d9-4390-a697-5f1c81e2ec0a”,“tapePool”:“cms.OORun2025Prompt”,“tapeVid”:“FA9328”,“vendor”:“Fujifilm”,“vo”:“cms”,“volReqId”:“138917”,“stillOpenFileForThread3”:“root://cmsstor820.fnal.gov:1096/0000AD1B2C58EB56471BAAB67DD437F0F770?xrdcl.requuid=0df975e1-b52a-4e90-bdf7-49ab32523514”,“wasTapeMounted”:“1”,“mountTime”:“31.933763”,“positionTime”:“60.106393”,“waitInstructionsTime”:“0.315143”,“waitFreeMemoryTime”:“0.0”,“waitDataTime”:“1.920995”,“waitReportingTime”:“0.006894”,“checksumingTime”:“126.70649”,“readWriteTime”:“633.397198”,“flushTime”:“12.919832”,“unloadTime”:“0.0”,“unmountTime”:“0.0”,“encryptionControlTime”:“0.004267”,“transferTime”:“775.266552”,“totalTime”:“1546.543352”,“deliveryTime”:“1546.543352”,“drainingTime”:“0.0”,“dataVolume”:“246049419845”,“filesCount”:“70”,“headerVolume”:“33600”,“payloadTransferSpeedMBps”:“159.096361267085”,“driveTransferSpeedMBps”:“159.096382992955”,“stillOpenFileForThread4”:“root://cmsstor825.fnal.gov:1095/0000CE5F7C5844FC4923A548C173A4B4C34B?xrdcl.requuid=ca278060-35d2-454c-8fab-952d61d70794”,“stillOpenFileForThread9”:“root://cmsstor816.fnal.gov:1096/0000256CDE9A779E41D584C5C05E808E42B1?xrdcl.requuid=3a4061c4-7f5b-4f36-a232-2413654e81dc”,“SubprocessName”:“drive:F1_F4B4D1”}
{“epoch_time”:1752469198.187207464,“local_time”:“2025-07-13T23:59:58-0500”,“hostname”:“tpsrvf2205”,“program”:“cta-taped”,“log_level”:“INFO”,“pid”:3153337,“tid”:3153337,“message”:“Propagated SIGCHILD.”,“drive_name”:“F1_F4B4D1”,“instance”:“prd”,“sched_backend”:“cephUser”,“capacityInBytes”:“18000000000000”,“logicalLibrary”:“F1_LTO9”,“mediaType”:“LTO9”,“mountAttempted”:“1”,“mountId”:“138917”,“mountType”:“ArchiveForUser”,“stillOpenFileForThread1”:“root://cmsstor820.fnal.gov:1095/000068FD5E1A5BF74183B4C9553F2D722C31?xrdcl.requuid=e2af057b-ffb9-41f6-b38f-10aa6afb5f44”,“stillOpenFileForThread2”:“root://cmsstor807.fnal.gov:1094/0000A7BA7FB472F34B87BC253C2D4C541777?xrdcl.requuid=c4792d08-84d7-402b-9295-650d387c2f90”,“stillOpenFileForThread5”:“root://cmsstor815.fnal.gov:1095/0000D16542837BAB4466A2838BAC4E861115?xrdcl.requuid=316182e8-a854-4163-9e2b-00e263e16de1”,“stillOpenFileForThread6”:“root://cmsstor814.fnal.gov:1095/00008DA0F0856F434D1BAD8BC8F2F829A195?xrdcl.requuid=39148685-520e-4075-b10d-e03a3742e1f3”,“stillOpenFileForThread7”:“root://cmsstor814.fnal.gov:1094/00008326D99D476A4414ADC16F6FA16835C7?xrdcl.requuid=3baf7bb6-d199-4764-b8d0-56e203debafa”,“stillOpenFileForThread8”:“root://cmsstor810.fnal.gov:1095/0000B85F1F97B7044AD19FDCF72F1A456200?xrdcl.requuid=4aff05de-e7d9-4390-a697-5f1c81e2ec0a”,“tapePool”:“cms.OORun2025Prompt”,“tapeVid”:“FA9328”,“vendor”:“Fujifilm”,“vo”:“cms”,“volReqId”:“138917”,“stillOpenFileForThread3”:“root://cmsstor820.fnal.gov:1096/0000AD1B2C58EB56471BAAB67DD437F0F770?xrdcl.requuid=0df975e1-b52a-4e90-bdf7-49ab32523514”,“wasTapeMounted”:“1”,“mountTime”:“31.933763”,“positionTime”:“60.106393”,“waitInstructionsTime”:“0.315143”,“waitFreeMemoryTime”:“0.0”,“waitDataTime”:“1.920995”,“waitReportingTime”:“0.006894”,“checksumingTime”:“126.70649”,“readWriteTime”:“633.397198”,“flushTime”:“12.919832”,“unloadTime”:“0.0”,“unmountTime”:“0.0”,“encryptionControlTime”:“0.004267”,“transferTime”:“775.266552”,“totalTime”:“1546.543352”,“deliveryTime”:“1546.543352”,“drainingTime”:“0.0”,“dataVolume”:“246049419845”,“filesCount”:“70”,“headerVolume”:“33600”,“payloadTransferSpeedMBps”:“159.096361267085”,“driveTransferSpeedMBps”:“159.096382992955”,“stillOpenFileForThread4”:“root://cmsstor825.fnal.gov:1095/0000CE5F7C5844FC4923A548C173A4B4C34B?xrdcl.requuid=ca278060-35d2-454c-8fab-952d61d70794”,“stillOpenFileForThread9”:“root://cmsstor816.fnal.gov:1096/0000256CDE9A779E41D584C5C05E808E42B1?xrdcl.requuid=3a4061c4-7f5b-4f36-a232-2413654e81dc”,“SubprocessName”:“maintenanceHandler”}
{“epoch_time”:1752469201.621937231,“local_time”:“2025-07-14T00:00:01-0500”,“hostname”:“tpsrvf2205”,“program”:“cta-taped”,“log_level”:“INFO”,“pid”:3153339,“tid”:3153339,“message”:“In MaintenanceHandler::exceptionThrowingRunChild(): About to do a maintenance pass.”,“drive_name”:“F1_F4B4D1”,“instance”:“prd”,“sched_backend”:“cephUser”,“SubprocessName”:“maintenanceHandler”}


This is to show the improvement from yesterday to today.

@snorberg1 Those log messages just tell you that the child processes were killed by the parent process. You need to look at the messages from the parent process to find out why it died. Was it because of a timeout, as in the log message from @rillingw ?

@rillingw The log line you sent says DriveHandler::processTimeout(): Killed subprocess. for "SubprocessId":1202507. You need to look at what SubprocessId 1202507 was doing that caused it to time out.

So far we have established the following chain of events:

  • Something is timing out
  • Drive process is killing child processes due to the timeout
  • Tape is still in the drive when the processes restart
  • Mounts are failing because the tape is stuck in a drive

Disabling tapes/unmounting them manually/re-enabling them will deal with the immediate problem of stuck tapes and failed mounts. Separate to that you need to determine what is timing out and causing the problem in the first place.

The last messages I can see for this pid are

{"epoch_time":1752575559.424171415,"local_time":"2025-07-15T05:32:39-0500","hostname":"tpsrvf2205","program":"cta-taped","log_level":"INFO","pid":1202507,"tid":1202600,"message":"File successfully transmitted to drive","drive_name":"F1_F4B4D1","instance":"prd","sched_backend":"cephUser","thread":"TapeWrite","tapeDrive":"F1_F4B4D1","tapeVid":"FA9237","mountId":"141740","vo":"cms","tapePool":"cms.OORun2025RAW","mediaType":"LTO9","logicalLibrary":"F1_LTO9","mountType":"ArchiveForUser","vendor":"Fujifilm","capacityInBytes":18000000000000,"fileId":4526758676,"fileSize":5401031488,"fSeq":2373,"diskURL":"root://cmsstor825.fnal.gov:1096/00004196F7D7CED94619B3045CE9812536C7","readWriteTime":13.260813,"checksumingTime":2.759233,"waitDataTime":0.00685899999999995,"waitReportingTime":0.000155,"transferTime":16.02706,"totalTime":16.028591,"dataVolume":5401031488,"headerVolume":480,"driveTransferSpeedMBps":336.962367309765,"payloadTransferSpeedMBps":336.962337363278,"reconciliationTime":0,"LBPMode":"LBP_On"}
...
{"epoch_time":1752576159.459243089,"local_time":"2025-07-15T05:42:39-0500","hostname":"tpsrvf2205","program":"cta-taped","log_level":"WARN","pid":1202507,"tid":1202599,"message":"No tape block movement for too long during archiving","drive_name":"F1_F4B4D1","instance":"prd","sched_backend":"cephUser","thread":"Watchdog","tapeDrive":"F1_F4B4D1","tapeVid":"FA9237","mountId":"141740","vo":"cms","tapePool":"cms.OORun2025RAW","TimeSinceLastBlockMove":600.041354,"TimeSinceLastBlockMoveReport":802.645097,"NoBlockMoveMaxSecs":600.0,"fileId":4526757870,"fSeq":2374}

It looks like the first is after writing the trailer of a file, so what block movement is taking too long? It doesn’t appear to have started writing a new file.

Although, as @snorberg1 mentioned, the situation appears to have spontaneously resolved itself, which is helpful, but concerning as we currently have no clear idea why it started or stopped.

Well that looks like it could be the cause of the timeouts: No tape block movement for too long during archiving. The timeout value is 10 minutes so something seems wrong.

Check back in your logs for that message and see if it occurs across multiple tape servers and how the error is distributed in time. If it happens in many tape servers at once, that points to a problem in the shared infrastructure (objectstore, database, disk buffer, hardware, … ?)

Thanks this was very helpful.

We are seeing those errors.
I expected to see them for this tape pool:
{“epoch_time”:1752595245.429060369,“local_time”:“2025-07-15T11:00:45-0500”,“hostname”:“tpsrvf2201”,“program”:“cta-taped”,“log_level”:“WARN”,“pid”:984557,“tid”:985467,“message”:“No tape block movement for too long during archiving”,“drive_name”:“F1_F4B2D2”,“instance”:“prd”,“sched_backend”:“cephUser”,“thread”:“Watchdog”,“tapeDrive”:“F1_F4B2D2”,“tapeVid”:“FA8633”,“mountId”:“141876”,“vo”:“prd_repack”,“tapePool”:“cms.ConsolidatedData”,“TimeSinceLastBlockMove”:600.092159,“TimeSinceLastBlockMoveReport”:674.067905,“NoBlockMoveMaxSecs”:600.0,“fileId”:4322464272,“fSeq”:128}
We know that the dCache pool went down for this one.
I was not expecting to see it for other tape pools
{“epoch_time”:1752205779.773900447,“local_time”:“2025-07-10T22:49:39-0500”,“hostname”:“tpsrvf2205”,“program”:“cta-taped”,“log_level”:“WARN”,“pid”:2646849,“tid”:2646955,“message”:“No tape block movement for too long during archiving”,“drive_name”:“F1_F4B4D1”,“instance”:“prd”,“sched_backend”:“cephUser”,“thread”:“Watchdog”,“tapeDrive”:“F1_F4B4D1”,“tapeVid”:“FA9196”,“mountId”:“131595”,“vo”:“cms”,“tapePool”:“cms.OORun2025RAW”,“TimeSinceLastBlockMove”:600.028514,“TimeSinceLastBlockMoveReport”:883.827896,“NoBlockMoveMaxSecs”:600.0,“fileId”:4526771907,“fSeq”:2625}
We are investigating some of the connections to CTA to see if anything pops up. We are at least recovered. I really appreciate you having a look and helping make suggestions.

What do you mean by pool in this context? You mean the dCache disk pool went down for transfers to the cms.ConsolidatedData tape pool?

Of course if the disk buffer is inaccessible the transfer will stall. Does your disk buffer monitoring show that the problems correlate in time to the No tape block movement for too long errors?

Sorry pool is like eos node. One of the dCache nodes went down after this issue started to occur so of course I expected to see those errors. For that node. No other node went down so it does not make sense for it to cause this wide spread issue.
None of the other nodes seem to be having any issues during that time.
Digging through the logs more for Thursday/Friday. I just see sources of the issue but not something that points to an actually problem.