Hello.
I have found following problem: while writing data to tape I have got:
Mar 08 13:30:12 arch01 cta-taped[36359]: LVL="INFO" PID="36359" TID="40776" MSG="In cta::ArchiveMount::reportJobsBatchTransferred(), archive job successful." thread="MainThread" tapeDrive="IBM-3592-E07-1_2" mountId="317" tapeVid="A01904" mountType="ARCHIVE_FOR_USER" fileId="66511" type="ReportSuccessful"
Mar 08 13:30:12 arch01 cta-taped[36359]: LVL="ERROR" PID="36359" TID="40776" MSG="In ArchiveMount::reportJobsBatchWritten(): got an exception" thread="MainThread" tapeDrive="IBM-3592-E07-1_2" mountId="317" exceptionMessageValue="Checksum type expected=NONE actual=ADLER32" fileId="66511" diskInstance="eosNBGI" diskFileId="86419" lastKnownDiskPath="/eos/NBGI/ceph/data/exp/bio/db/ncbi/pmc/2023-02-28/ncbi_pmc-2023-02-28-2720.tar"
reportURL="In ArchiveJob::exceptionThrowingReportURL(): job status NoReportRequired does not require reporting.
/lib64/libctacommon.so.0(cta::exception::Backtrace::Backtrace(bool)+0x69) [0x7fa415ef58d3]
/lib64/libctacommon.so.0(cta::exception::Exception::Exception(std::string const&, bool)+0x89) [0x7fa415ef7137]
/lib64/libctascheduler.so.0(cta::ArchiveJob::exceptionThrowingReportURL()+0x261) [0x7fa4188c2399]
/lib64/libctascheduler.so.0(cta::ArchiveJob::reportURL()+0x3d) [0x7fa4188c26a5]
/lib64/libctascheduler.so.0(cta::ArchiveMount::reportJobsBatchTransferred(std::queue<std::unique_ptr<cta::ArchiveJob, std::default_delete<cta::ArchiveJob> >, std::deque<std::unique_ptr<cta::ArchiveJob, std::default_delete<cta::ArchiveJob> >, std::allocator<std::unique_ptr<cta::ArchiveJob, std::default_delete<cta::ArchiveJob> > > > >&, std::queue<cta::catalogue::TapeItemWritten, std::deque<cta::catalogue::TapeItemWritten, std::allocator<cta::catalogue::TapeItemWritten> > >&, std::queue<std::unique_ptr<cta::SchedulerDatabase::ArchiveJob, std::default_delete<cta::SchedulerDatabase::ArchiveJob> >, std::deque<std::unique_ptr<cta::SchedulerDatabase::ArchiveJob, std::default_delete<cta::SchedulerDatabase::ArchiveJob> >, std::allocator<std::unique_ptr<cta::SchedulerDatabase::ArchiveJob, std::default_delete<cta::SchedulerDatabase::ArchiveJob> > > > >&, cta::log::LogContext&)+0xd47) [0x7fa4188d6555]
/usr/bin/cta-taped() [0x52e5f6] /usr/bin/cta-taped() [0x530752]
/lib64/libctacommon.so.0(cta::threading::Thread::pthread_runner(void*)+0xef) [0x7fa415f1d047]
/lib64/libpthread.so.0(+0x7ea5) [0x7fa415bd7ea5]
/lib64/libc.so.6(clone+0x6d) [0x7fa40fb1696d]"
Mar 08 13:30:12 arch01 cta-taped[36359]: LVL="INFO" PID="36359" TID="40776" MSG="In Helpers::getLockedAndFetchedQueue<ArchiveQueue>(): Successfully found and locked an archive queue." thread="MainThread" tapeDrive="IBM-3592-E07-1_2" mountId="317" attemptNb="1" queueObject="ArchiveQueueToTransferForUser-db-Maintenance-arch01-5459-20230307-14:27:42-0-0" rootFetchNoLockTime="0.000186" rootRelockExclusiveTime="0.000000" rootRefetchTime="0.000000" addOrGetQueueandCommitTime="0.000000" rootUnlockExclusiveTime="0.000000" queueLockTime="0.000076" queueFetchTime="0.000090"
Mar 08 13:30:12 arch01 cta-taped[36359]: LVL="DEBUG" PID="36359" TID="40775" MSG="Successfully opened the tape file for writing" thread="TapeWrite" tapeDrive="IBM-3592-E07-1_2" tapeVid="A01904" mountId="317" vo="nbgi" mediaType="JC" tapePool="db" logicalLibrary="NBGI" mountType="ArchiveForUser" vendor="IBM" capacityInBytes="4000000000000" fileId="66391" fileSize="4293365760" fSeq="67" diskURL="root://arch01.ceph.kiae.ru:1094//eos/NBGI/ceph/data/exp/bio/db/ncbi/pmc/2023-02-28/ncbi_pmc-2023-02-28-2724.tar?eos.lfn=fxid:15197"
Mar 08 13:30:12 arch01 cta-taped[36359]: LVL="DEBUG" PID="36359" TID="40776" MSG="In AgentReference::appyAction(): removed object from ownership (by batch)." agentObject="DriveProcess-IBM-3592-E07-1_2-arch01-36359-20230308-13:26:47-0" ownedObject="ArchiveRequest-Frontend-arch01-34397-20230124-19:27:46-0-71584"
Mar 08 13:30:12 arch01 cta-taped[36359]: LVL="INFO" PID="36359" TID="40776" MSG="In ContainerAlgorithms::referenceAndSwitchOwnership(): Requeued a batch of elements." thread="MainThread" tapeDrive="IBM-3592-E07-1_2" mountId="317" C="ArchiveQueueToTransferForUser" tapepool="db" containerAddress="ArchiveQueueToTransferForUser-db-Maintenance-arch01-5459-20230307-14:27:42-0-0" queueJobsBefore="6760" queueBytesBefore="4753193594741" queueJobsAfter="6761" queueBytesAfter="4757486837621" queueLockFetchTime="0.000758" queueProcessAndCommitTime="0.044424" asyncUpdateLaunchTime="0.000323" asyncUpdateCompletionTime="0.000955" requestsUpdatingTime="0.000035" queueUnlockTime="0.000068"
Mar 08 13:30:12 arch01 cta-taped[36359]: LVL="INFO" PID="36359" TID="40776" MSG="In ArchiveJob::failTransfer(): requeued job for (potentially in-mount) retry." thread="MainThread" tapeDrive="IBM-3592-E07-1_2" mountId="317" fileId="66511" copyNb="1" failureReason="In ArchiveMount::reportJobsBatchWritten(): got an exception" requestObject="ArchiveRequest-Frontend-arch01-34397-20230124-19:27:46-0-71584" retriesWithinMount="1" maxRetriesWithinMount="2" totalRetries="1" maxTotalRetries="2"
Mar 08 13:30:12 arch01 cta-taped[36359]: LVL="INFO" PID="36359" TID="40776" MSG="In OStoreDB::ArchiveJob::~ArchiveJob(): will leave the job owned after destruction." agentObject="DriveProcess-IBM-3592-E07-1_2-arch01-36359-20230308-13:26:47-0" jobObject="ArchiveRequest-Frontend-arch01-34397-20230124-19:27:46-0-71584"
Mar 08 13:30:12 arch01 cta-taped[36359]: LVL="ERROR" PID="36359" TID="40776" MSG="In MigrationReportPacker::WorkerThread::run(): Received a CTA exception while reporting archive mount results." thread="ReportPacker" tapeDrive="IBM-3592-E07-1_2" tapeVid="A01904" mountId="317" exceptionMSG="In ArchiveMount::reportJobsBatchWritten(): got an exception"
and then taped
continue writes data to tape, but without actual flushing to catalogue.
Mar 08 13:30:26 arch01 cta-taped[36359]: LVL="INFO" PID="36359" TID="40773" MSG="File successfully read from disk" thread="DiskRead" tapeDrive="IBM-3592-E07-1_2" tapeVid="A01904" mountId="317" threadID="9" path="root://arch01.ceph.kiae.ru:1094//eos/NBGI/ceph/data/exp/bio/db/ncbi/pmc/2023-02-28/ncbi_pmc-2023-02-28-2725.tar?eos.lfn=fxid:15198" actualURL="root://arch01.ceph.kiae.ru:1095//eos/NBGI/ceph/data/exp/bio/db/ncbi/pmc/2023-02-28/ncbi_pmc-2023-02-28-2725.tar?cap.msg=vUpJXMIslxqXVKGSNGLiug79oY7chlxem6XWbJjZeJvch/Mx6hAGZ47/4Bepb+SrLGa3AX24E0deesdC7s7Ba5TYgJ3mgnWGp6EnCRjKuIuQj46ZwNSYGDdThlq1QGPuOL8qS8Gg7CtJxfQryySHL6UhWTs3lodgsCPKsTlMic2huo4/htWJ2eXZyJSCHyDNqSE7jYpGZkUnqUOmKK5xOaryQF3Kc+pEPg3fjK/pQKFd7jNfULmLqwH/pLwqrFrwGbj1b2MsMpA57w3h/5z3KQrpkpXXJWicoPzdF479EQlPYL2XaPJ0YwvtvIF8Gn+aYERD2ZJVR4iLCCis6FaybjSBb4sFLE4wV8v6PP+S2QXYlFGF47QS+zj4owROyCFMQWrHoo1aIvjz5Kqpq67mALJKwtVrw+E39mhjKG7aAt7AfxiS3TfMjKbn+BqnZw1sR9eEsK3vQG7CeyE2tKFI/A==&cap.sym=0A57/ahIl9rqg5d9pPgN+WSZZ68=&eos.lfn=fxid:15198&mgm.id=00015198&mgm.logid=df13a510-bd9b-11ed-8f7a-0cc47a871188&mgm.mtime=1677715692&mgm.replicahead=0&mgm.replicaindex=0&xrdcl.requuid=f12b17cf-f7eb-41dd-bf58-f72000497c1d" fileId="66472" readWriteTime="9.631786" checksumingTime="0.000000" waitFreeMemoryTime="149.299136" waitDataTime="0.000000" waitReportingTime="0.000000" checkingErrorTime="0.002969" openingTime="0.017173" transferTime="158.951078" totalTime="158.951078" dataVolume="4292986880" globalPayloadTransferSpeedMBps="27.008228" diskPerformanceMBps="27.008228" openRWCloseToTransferTimeRatio="0.060704"
Mar 08 13:30:26 arch01 cta-taped[36357]: LVL="DEBUG" PID="36357" TID="36357" MSG="Computed new timeout" tapeDrive="IBM-3592-E07-1_2" TimeoutType="Heartbeat" LastStateChangeTime="4067783" LastHeartBeatTime="4067852" LastDataMovementTime="4067843" Now="4067852" Timeout="4067912"
Mar 08 13:30:26 arch01 cta-taped[36359]: LVL="INFO" PID="36359" TID="40773" MSG="Opened disk file for read" thread="DiskRead" tapeDrive="IBM-3592-E07-1_2" tapeVid="A01904" mountId="317" threadID="9" path="root://arch01.ceph.kiae.ru:1094//eos/NBGI/ceph/data/exp/bio/db/ncbi/pmc/2023-02-28/ncbi_pmc-2023-02-28-2290.tar?eos.lfn=fxid:15324" actualURL="root://arch01.ceph.kiae.ru:1095//eos/NBGI/ceph/data/exp/bio/db/ncbi/pmc/2023-02-28/ncbi_pmc-2023-02-28-2290.tar?cap.msg=vUpJXMIslxqXVKGSNGLiug79oY7chlxem6XWbJjZeJvch/Mx6hAGZ47/4Bepb+SrLGa3AX24E0deesdC7s7Ba5TYgJ3mgnWGp6EnCRjKuIuQj46ZwNSYGDdThlq1QGPuOL8qS8Gg7CtJxfQryySHL6UhWTs3lodgsCPKsTlMic2huo4/htWJ2eXZyJSCHyDNqSE7jYpGZkV6zzlFQGrOckHUwEbhshj0kV/UNXrEKF0vFBNNst5e6a80nplcbASXCQHTIz79e/zHYbUYfNlvU9Br+SdFjqetrezKmgLb0gwRfZqsSpAF8n7ckuGrYSjdquKWiCi5WZC18JOsC+SwqVOrRjrueMv+8Z1ZkZoIF5kZez6V1WgKBmE2MVrCMQdtNO2Ux6CTuHQs0HwBZLvQn2iWJnM7s73sJsSu3CsWbKx6E0MgKpyo6zZgenepMMj/2K5fug/rdjO6PsdpZAz9/Q==&cap.sym=0A57/ahIl9rqg5d9pPgN+WSZZ68=&eos.lfn=fxid:15324&mgm.id=00015324&mgm.logid=3dd1cb86-bd9c-11ed-abb6-0cc47a871188&mgm.mtime=1677720620&mgm.replicahead=0&mgm.replicaindex=0&xrdcl.requuid=ac224236-3142-4ec2-a4bb-20855dc2fbdb" fileId="66904"
Mar 08 13:30:26 arch01 cta-taped[36357]: LVL="DEBUG" PID="36357" TID="36357" MSG="Computed new timeout" tapeDrive="IBM-3592-E07-1_2" TimeoutType="Heartbeat" LastStateChangeTime="4067783" LastHeartBeatTime="4067852" LastDataMovementTime="4067843" Now="4067852" Timeout="4067912"
Mar 08 13:30:27 arch01 cta-taped[36359]: LVL="DEBUG" PID="36359" TID="40775" MSG="In MigrationReportPacker::reportCompletedJob(), pushing a report." thread="TapeWrite" tapeDrive="IBM-3592-E07-1_2" tapeVid="A01904" mountId="317" vo="nbgi" mediaType="JC" tapePool="db" logicalLibrary="NBGI" mountType="ArchiveForUser" vendor="IBM" capacityInBytes="4000000000000" fileId="66391" fileSize="4293365760" fSeq="67" diskURL="root://arch01.ceph.kiae.ru:1094//eos/NBGI/ceph/data/exp/bio/db/ncbi/pmc/2023-02-28/ncbi_pmc-2023-02-28-2724.tar?eos.lfn=fxid:15197" type="ReportSuccessful"
Mar 08 13:30:27 arch01 cta-taped[36359]: LVL="INFO" PID="36359" TID="40775" MSG="File successfully transmitted to drive" thread="TapeWrite" tapeDrive="IBM-3592-E07-1_2" tapeVid="A01904" mountId="317" vo="nbgi" mediaType="JC" tapePool="db" logicalLibrary="NBGI" mountType="ArchiveForUser" vendor="IBM" capacityInBytes="4000000000000" fileId="66391" fileSize="4293365760" fSeq="67" diskURL="root://arch01.ceph.kiae.ru:1094//eos/NBGI/ceph/data/exp/bio/db/ncbi/pmc/2023-02-28/ncbi_pmc-2023-02-28-2724.tar?eos.lfn=fxid:15197" readWriteTime="8.101073" checksumingTime="7.291491" waitDataTime="0.022846" waitReportingTime="0.000310" transferTime="15.415720" totalTime="15.415673" dataVolume="4293365760" headerVolume="480" driveTransferSpeedMBps="278.506572" payloadTransferSpeedMBps="278.506541" reconciliationTime="0" LBPMode="LBP_Off"
Mar 08 13:30:29 arch01 cta-taped[36359]: LVL="INFO" PID="36359" TID="40775" MSG="Normal flush because thresholds was reached" thread="TapeWrite" tapeDrive="IBM-3592-E07-1_2" tapeVid="A01904" mountId="317" vo="nbgi" mediaType="JC" tapePool="db" logicalLibrary="NBGI" mountType="ArchiveForUser" vendor="IBM" capacityInBytes="4000000000000" files="1" bytes="4293365760" flushTime="1.890982"
Mar 08 13:30:29 arch01 cta-taped[36359]: LVL="DEBUG" PID="36359" TID="40775" MSG="In MigrationReportPacker::reportFlush(), pushing a report." thread="TapeWrite" tapeDrive="IBM-3592-E07-1_2" tapeVid="A01904" mountId="317" vo="nbgi" mediaType="JC" tapePool="db" logicalLibrary="NBGI" mountType="ArchiveForUser" vendor="IBM" capacityInBytes="4000000000000" files="1" bytes="4293365760" flushTime="1.890982" type="ReportFlush"
[root@arch01 ~]# eos file info fxid:15197
File: '/eos/NBGI/ceph/data/exp/bio/db/ncbi/pmc/2023-02-28/ncbi_pmc-2023-02-28-2724.tar' Flags: 0640 Clock: 174a40e4f6db8908
Size: 4293365760
Modify: Thu Mar 2 03:07:10 2023 Timestamp: 1677715630.601220000
Change: Thu Mar 2 02:56:14 2023 Timestamp: 1677714974.837206918
Birth: Thu Mar 2 02:56:14 2023 Timestamp: 1677714974.832368443
CUid: 2 CGid: 2 Fxid: 00015197 Fid: 86423 Pid: 5499 Pxid: 0000157b
XStype: adler XS: c3 15 70 ea ETAGs: "23198997413888:c31570ea"
Layout: plain Stripes: 1 Blocksize: 4k LayoutId: 00100002 Redundancy: d1::t0
#Rep: 1
So all data written after exception is not marked as written to tape and I have to rewrite it again.
If such error occurs at the beginning of tape, then tape should be written two (or more) times.
After restarting taped
it takes same tape and may successfully write data (or may fails again after several files).
It is looks like tape/drive problem ( sys.forced.checksum="adler"
is setted for file catalog, fileinfo report XStype: adler
for file, eos-adler32
calculates same checksum after downloading as fileinfo, file is random), but taped behaviour is strange (if no data committed after fail, then why it continue writes? According to Checksums - EOSCTA Docs it should skip commit just for failed file).
Is any ideas how to normalize proces?
ps. affected versions: 4.8.6-1(from gitlab), 4.7-14, 4.7.13, 4.7.12 (from public repository)