Taped stucks after CRC errors

igtkache · 8 March 2023 08:57

Hello.

I have found following problem: while writing data to tape I have got:

Mar 08 13:30:12 arch01 cta-taped[36359]: LVL="INFO" PID="36359" TID="40776" MSG="In cta::ArchiveMount::reportJobsBatchTransferred(), archive job successful." thread="MainThread" tapeDrive="IBM-3592-E07-1_2" mountId="317" tapeVid="A01904" mountType="ARCHIVE_FOR_USER" fileId="66511" type="ReportSuccessful"
Mar 08 13:30:12 arch01 cta-taped[36359]: LVL="ERROR" PID="36359" TID="40776" MSG="In ArchiveMount::reportJobsBatchWritten(): got an exception" thread="MainThread" tapeDrive="IBM-3592-E07-1_2" mountId="317" exceptionMessageValue="Checksum type expected=NONE actual=ADLER32" fileId="66511" diskInstance="eosNBGI" diskFileId="86419" lastKnownDiskPath="/eos/NBGI/ceph/data/exp/bio/db/ncbi/pmc/2023-02-28/ncbi_pmc-2023-02-28-2720.tar" 
  reportURL="In ArchiveJob::exceptionThrowingReportURL(): job status NoReportRequired does not require reporting. 
    /lib64/libctacommon.so.0(cta::exception::Backtrace::Backtrace(bool)+0x69) [0x7fa415ef58d3] 
    /lib64/libctacommon.so.0(cta::exception::Exception::Exception(std::string const&, bool)+0x89) [0x7fa415ef7137] 
    /lib64/libctascheduler.so.0(cta::ArchiveJob::exceptionThrowingReportURL()+0x261) [0x7fa4188c2399] 
    /lib64/libctascheduler.so.0(cta::ArchiveJob::reportURL()+0x3d) [0x7fa4188c26a5] 
    /lib64/libctascheduler.so.0(cta::ArchiveMount::reportJobsBatchTransferred(std::queue<std::unique_ptr<cta::ArchiveJob, std::default_delete<cta::ArchiveJob> >, std::deque<std::unique_ptr<cta::ArchiveJob, std::default_delete<cta::ArchiveJob> >, std::allocator<std::unique_ptr<cta::ArchiveJob, std::default_delete<cta::ArchiveJob> > > > >&, std::queue<cta::catalogue::TapeItemWritten, std::deque<cta::catalogue::TapeItemWritten, std::allocator<cta::catalogue::TapeItemWritten> > >&, std::queue<std::unique_ptr<cta::SchedulerDatabase::ArchiveJob, std::default_delete<cta::SchedulerDatabase::ArchiveJob> >, std::deque<std::unique_ptr<cta::SchedulerDatabase::ArchiveJob, std::default_delete<cta::SchedulerDatabase::ArchiveJob> >, std::allocator<std::unique_ptr<cta::SchedulerDatabase::ArchiveJob, std::default_delete<cta::SchedulerDatabase::ArchiveJob> > > > >&, cta::log::LogContext&)+0xd47) [0x7fa4188d6555] 
    /usr/bin/cta-taped() [0x52e5f6] /usr/bin/cta-taped() [0x530752] 
    /lib64/libctacommon.so.0(cta::threading::Thread::pthread_runner(void*)+0xef) [0x7fa415f1d047] 
    /lib64/libpthread.so.0(+0x7ea5) [0x7fa415bd7ea5]
    /lib64/libc.so.6(clone+0x6d) [0x7fa40fb1696d]"
Mar 08 13:30:12 arch01 cta-taped[36359]: LVL="INFO" PID="36359" TID="40776" MSG="In Helpers::getLockedAndFetchedQueue<ArchiveQueue>(): Successfully found and locked an archive queue." thread="MainThread" tapeDrive="IBM-3592-E07-1_2" mountId="317" attemptNb="1" queueObject="ArchiveQueueToTransferForUser-db-Maintenance-arch01-5459-20230307-14:27:42-0-0" rootFetchNoLockTime="0.000186" rootRelockExclusiveTime="0.000000" rootRefetchTime="0.000000" addOrGetQueueandCommitTime="0.000000" rootUnlockExclusiveTime="0.000000" queueLockTime="0.000076" queueFetchTime="0.000090"
Mar 08 13:30:12 arch01 cta-taped[36359]: LVL="DEBUG" PID="36359" TID="40775" MSG="Successfully opened the tape file for writing" thread="TapeWrite" tapeDrive="IBM-3592-E07-1_2" tapeVid="A01904" mountId="317" vo="nbgi" mediaType="JC" tapePool="db" logicalLibrary="NBGI" mountType="ArchiveForUser" vendor="IBM" capacityInBytes="4000000000000" fileId="66391" fileSize="4293365760" fSeq="67" diskURL="root://arch01.ceph.kiae.ru:1094//eos/NBGI/ceph/data/exp/bio/db/ncbi/pmc/2023-02-28/ncbi_pmc-2023-02-28-2724.tar?eos.lfn=fxid:15197"
Mar 08 13:30:12 arch01 cta-taped[36359]: LVL="DEBUG" PID="36359" TID="40776" MSG="In AgentReference::appyAction(): removed object from ownership (by batch)." agentObject="DriveProcess-IBM-3592-E07-1_2-arch01-36359-20230308-13:26:47-0" ownedObject="ArchiveRequest-Frontend-arch01-34397-20230124-19:27:46-0-71584"
Mar 08 13:30:12 arch01 cta-taped[36359]: LVL="INFO" PID="36359" TID="40776" MSG="In ContainerAlgorithms::referenceAndSwitchOwnership(): Requeued a batch of elements." thread="MainThread" tapeDrive="IBM-3592-E07-1_2" mountId="317" C="ArchiveQueueToTransferForUser" tapepool="db" containerAddress="ArchiveQueueToTransferForUser-db-Maintenance-arch01-5459-20230307-14:27:42-0-0" queueJobsBefore="6760" queueBytesBefore="4753193594741" queueJobsAfter="6761" queueBytesAfter="4757486837621" queueLockFetchTime="0.000758" queueProcessAndCommitTime="0.044424" asyncUpdateLaunchTime="0.000323" asyncUpdateCompletionTime="0.000955" requestsUpdatingTime="0.000035" queueUnlockTime="0.000068"
Mar 08 13:30:12 arch01 cta-taped[36359]: LVL="INFO" PID="36359" TID="40776" MSG="In ArchiveJob::failTransfer(): requeued job for (potentially in-mount) retry." thread="MainThread" tapeDrive="IBM-3592-E07-1_2" mountId="317" fileId="66511" copyNb="1" failureReason="In ArchiveMount::reportJobsBatchWritten(): got an exception" requestObject="ArchiveRequest-Frontend-arch01-34397-20230124-19:27:46-0-71584" retriesWithinMount="1" maxRetriesWithinMount="2" totalRetries="1" maxTotalRetries="2"
Mar 08 13:30:12 arch01 cta-taped[36359]: LVL="INFO" PID="36359" TID="40776" MSG="In OStoreDB::ArchiveJob::~ArchiveJob(): will leave the job owned after destruction." agentObject="DriveProcess-IBM-3592-E07-1_2-arch01-36359-20230308-13:26:47-0" jobObject="ArchiveRequest-Frontend-arch01-34397-20230124-19:27:46-0-71584"
Mar 08 13:30:12 arch01 cta-taped[36359]: LVL="ERROR" PID="36359" TID="40776" MSG="In MigrationReportPacker::WorkerThread::run(): Received a CTA exception while reporting archive mount results." thread="ReportPacker" tapeDrive="IBM-3592-E07-1_2" tapeVid="A01904" mountId="317" exceptionMSG="In ArchiveMount::reportJobsBatchWritten(): got an exception"

and then taped continue writes data to tape, but without actual flushing to catalogue.

Mar 08 13:30:26 arch01 cta-taped[36359]: LVL="INFO" PID="36359" TID="40773" MSG="File successfully read from disk" thread="DiskRead" tapeDrive="IBM-3592-E07-1_2" tapeVid="A01904" mountId="317" threadID="9" path="root://arch01.ceph.kiae.ru:1094//eos/NBGI/ceph/data/exp/bio/db/ncbi/pmc/2023-02-28/ncbi_pmc-2023-02-28-2725.tar?eos.lfn=fxid:15198" actualURL="root://arch01.ceph.kiae.ru:1095//eos/NBGI/ceph/data/exp/bio/db/ncbi/pmc/2023-02-28/ncbi_pmc-2023-02-28-2725.tar?cap.msg=vUpJXMIslxqXVKGSNGLiug79oY7chlxem6XWbJjZeJvch/Mx6hAGZ47/4Bepb+SrLGa3AX24E0deesdC7s7Ba5TYgJ3mgnWGp6EnCRjKuIuQj46ZwNSYGDdThlq1QGPuOL8qS8Gg7CtJxfQryySHL6UhWTs3lodgsCPKsTlMic2huo4/htWJ2eXZyJSCHyDNqSE7jYpGZkUnqUOmKK5xOaryQF3Kc+pEPg3fjK/pQKFd7jNfULmLqwH/pLwqrFrwGbj1b2MsMpA57w3h/5z3KQrpkpXXJWicoPzdF479EQlPYL2XaPJ0YwvtvIF8Gn+aYERD2ZJVR4iLCCis6FaybjSBb4sFLE4wV8v6PP+S2QXYlFGF47QS+zj4owROyCFMQWrHoo1aIvjz5Kqpq67mALJKwtVrw+E39mhjKG7aAt7AfxiS3TfMjKbn+BqnZw1sR9eEsK3vQG7CeyE2tKFI/A==&cap.sym=0A57/ahIl9rqg5d9pPgN+WSZZ68=&eos.lfn=fxid:15198&mgm.id=00015198&mgm.logid=df13a510-bd9b-11ed-8f7a-0cc47a871188&mgm.mtime=1677715692&mgm.replicahead=0&mgm.replicaindex=0&xrdcl.requuid=f12b17cf-f7eb-41dd-bf58-f72000497c1d" fileId="66472" readWriteTime="9.631786" checksumingTime="0.000000" waitFreeMemoryTime="149.299136" waitDataTime="0.000000" waitReportingTime="0.000000" checkingErrorTime="0.002969" openingTime="0.017173" transferTime="158.951078" totalTime="158.951078" dataVolume="4292986880" globalPayloadTransferSpeedMBps="27.008228" diskPerformanceMBps="27.008228" openRWCloseToTransferTimeRatio="0.060704"
Mar 08 13:30:26 arch01 cta-taped[36357]: LVL="DEBUG" PID="36357" TID="36357" MSG="Computed new timeout" tapeDrive="IBM-3592-E07-1_2" TimeoutType="Heartbeat" LastStateChangeTime="4067783" LastHeartBeatTime="4067852" LastDataMovementTime="4067843" Now="4067852" Timeout="4067912"
Mar 08 13:30:26 arch01 cta-taped[36359]: LVL="INFO" PID="36359" TID="40773" MSG="Opened disk file for read" thread="DiskRead" tapeDrive="IBM-3592-E07-1_2" tapeVid="A01904" mountId="317" threadID="9" path="root://arch01.ceph.kiae.ru:1094//eos/NBGI/ceph/data/exp/bio/db/ncbi/pmc/2023-02-28/ncbi_pmc-2023-02-28-2290.tar?eos.lfn=fxid:15324" actualURL="root://arch01.ceph.kiae.ru:1095//eos/NBGI/ceph/data/exp/bio/db/ncbi/pmc/2023-02-28/ncbi_pmc-2023-02-28-2290.tar?cap.msg=vUpJXMIslxqXVKGSNGLiug79oY7chlxem6XWbJjZeJvch/Mx6hAGZ47/4Bepb+SrLGa3AX24E0deesdC7s7Ba5TYgJ3mgnWGp6EnCRjKuIuQj46ZwNSYGDdThlq1QGPuOL8qS8Gg7CtJxfQryySHL6UhWTs3lodgsCPKsTlMic2huo4/htWJ2eXZyJSCHyDNqSE7jYpGZkV6zzlFQGrOckHUwEbhshj0kV/UNXrEKF0vFBNNst5e6a80nplcbASXCQHTIz79e/zHYbUYfNlvU9Br+SdFjqetrezKmgLb0gwRfZqsSpAF8n7ckuGrYSjdquKWiCi5WZC18JOsC+SwqVOrRjrueMv+8Z1ZkZoIF5kZez6V1WgKBmE2MVrCMQdtNO2Ux6CTuHQs0HwBZLvQn2iWJnM7s73sJsSu3CsWbKx6E0MgKpyo6zZgenepMMj/2K5fug/rdjO6PsdpZAz9/Q==&cap.sym=0A57/ahIl9rqg5d9pPgN+WSZZ68=&eos.lfn=fxid:15324&mgm.id=00015324&mgm.logid=3dd1cb86-bd9c-11ed-abb6-0cc47a871188&mgm.mtime=1677720620&mgm.replicahead=0&mgm.replicaindex=0&xrdcl.requuid=ac224236-3142-4ec2-a4bb-20855dc2fbdb" fileId="66904"
Mar 08 13:30:26 arch01 cta-taped[36357]: LVL="DEBUG" PID="36357" TID="36357" MSG="Computed new timeout" tapeDrive="IBM-3592-E07-1_2" TimeoutType="Heartbeat" LastStateChangeTime="4067783" LastHeartBeatTime="4067852" LastDataMovementTime="4067843" Now="4067852" Timeout="4067912"
Mar 08 13:30:27 arch01 cta-taped[36359]: LVL="DEBUG" PID="36359" TID="40775" MSG="In MigrationReportPacker::reportCompletedJob(), pushing a report." thread="TapeWrite" tapeDrive="IBM-3592-E07-1_2" tapeVid="A01904" mountId="317" vo="nbgi" mediaType="JC" tapePool="db" logicalLibrary="NBGI" mountType="ArchiveForUser" vendor="IBM" capacityInBytes="4000000000000" fileId="66391" fileSize="4293365760" fSeq="67" diskURL="root://arch01.ceph.kiae.ru:1094//eos/NBGI/ceph/data/exp/bio/db/ncbi/pmc/2023-02-28/ncbi_pmc-2023-02-28-2724.tar?eos.lfn=fxid:15197" type="ReportSuccessful"
Mar 08 13:30:27 arch01 cta-taped[36359]: LVL="INFO" PID="36359" TID="40775" MSG="File successfully transmitted to drive" thread="TapeWrite" tapeDrive="IBM-3592-E07-1_2" tapeVid="A01904" mountId="317" vo="nbgi" mediaType="JC" tapePool="db" logicalLibrary="NBGI" mountType="ArchiveForUser" vendor="IBM" capacityInBytes="4000000000000" fileId="66391" fileSize="4293365760" fSeq="67" diskURL="root://arch01.ceph.kiae.ru:1094//eos/NBGI/ceph/data/exp/bio/db/ncbi/pmc/2023-02-28/ncbi_pmc-2023-02-28-2724.tar?eos.lfn=fxid:15197" readWriteTime="8.101073" checksumingTime="7.291491" waitDataTime="0.022846" waitReportingTime="0.000310" transferTime="15.415720" totalTime="15.415673" dataVolume="4293365760" headerVolume="480" driveTransferSpeedMBps="278.506572" payloadTransferSpeedMBps="278.506541" reconciliationTime="0" LBPMode="LBP_Off"
Mar 08 13:30:29 arch01 cta-taped[36359]: LVL="INFO" PID="36359" TID="40775" MSG="Normal flush because thresholds was reached" thread="TapeWrite" tapeDrive="IBM-3592-E07-1_2" tapeVid="A01904" mountId="317" vo="nbgi" mediaType="JC" tapePool="db" logicalLibrary="NBGI" mountType="ArchiveForUser" vendor="IBM" capacityInBytes="4000000000000" files="1" bytes="4293365760" flushTime="1.890982"
Mar 08 13:30:29 arch01 cta-taped[36359]: LVL="DEBUG" PID="36359" TID="40775" MSG="In MigrationReportPacker::reportFlush(), pushing a report." thread="TapeWrite" tapeDrive="IBM-3592-E07-1_2" tapeVid="A01904" mountId="317" vo="nbgi" mediaType="JC" tapePool="db" logicalLibrary="NBGI" mountType="ArchiveForUser" vendor="IBM" capacityInBytes="4000000000000" files="1" bytes="4293365760" flushTime="1.890982" type="ReportFlush"

[root@arch01 ~]# eos file info fxid:15197
  File: '/eos/NBGI/ceph/data/exp/bio/db/ncbi/pmc/2023-02-28/ncbi_pmc-2023-02-28-2724.tar'  Flags: 0640  Clock: 174a40e4f6db8908
  Size: 4293365760
Modify: Thu Mar  2 03:07:10 2023 Timestamp: 1677715630.601220000
Change: Thu Mar  2 02:56:14 2023 Timestamp: 1677714974.837206918
 Birth: Thu Mar  2 02:56:14 2023 Timestamp: 1677714974.832368443
  CUid: 2 CGid: 2 Fxid: 00015197 Fid: 86423 Pid: 5499 Pxid: 0000157b
XStype: adler    XS: c3 15 70 ea    ETAGs: "23198997413888:c31570ea"
Layout: plain Stripes: 1 Blocksize: 4k LayoutId: 00100002 Redundancy: d1::t0
  #Rep: 1

So all data written after exception is not marked as written to tape and I have to rewrite it again.
If such error occurs at the beginning of tape, then tape should be written two (or more) times.

After restarting taped it takes same tape and may successfully write data (or may fails again after several files).

It is looks like tape/drive problem ( sys.forced.checksum="adler" is setted for file catalog, fileinfo report XStype: adler for file, eos-adler32 calculates same checksum after downloading as fileinfo, file is random), but taped behaviour is strange (if no data committed after fail, then why it continue writes? According to Checksums - EOSCTA Docs it should skip commit just for failed file).

Is any ideas how to normalize proces?

ps. affected versions: 4.8.6-1(from gitlab), 4.7-14, 4.7.13, 4.7.12 (from public repository)

igtkache · 9 March 2023 11:19

Digging to object store (cta-objectstore-dump-object) show that request have "checksumblob": "CgA=".
Looks like while high load at fts archive requests was created before checksum calculation.

Have recreated (via cta-send-event) requests and all of them have correct checksumblob field. Writing to tapes goes fine too.

rbachman · 14 March 2023 09:38

Hello @igtkache,
it looks like this is related to the following issue on our Gitlab project: Use correct fSeq after disk read error (#240) · Issues · cta / CTA · GitLab
So we are aware of the problem, but haven’t quite settled on the final way to address the issue.

I can confirm that we have experienced this problem ourselves, but not often enough to make it a high-priority issue. Is this a one-off on your end, or are you seeing this behaviour frequently?

As a side note, please beware that it looks like you were very quick with deploying 4.8.6-1 - you even beat us to it. After tagging we usually need around half a day to run tests to simulate how the system will perform under load (a stresstest), and then we will run the new version on our preproduction instance for at least one day to perform manual tests before we move on to a production upgrade.
So in other words I’d recommend that you wait with upgrading your production environment until the tagged version is at least a few days old.

igtkache · 14 March 2023 12:55

@rbachman, hello.

Thank you for link to issue at Gitlab.

I have found this issue while writing 7TB of data to tapes for about 3 days.
It was about 200 requests with incorrect checksum in object storage (BTW, looks like EOS have a race between [ checksum calculation at FST or committing CRC to QuarkDB ] and creating archive requests for CTA) and about 7000 requests total. taped was lucky to write 5-7 files to database per tape mount.
It was very annoying and completely blocked writing data to tapes. But after finding a source of the problem (incorrect checksum in object storage) it is not problem to check object storage and recreate (via cta-send-event) bad requests. So this issue is not affect to operations now.
Previous 250TB was written with such errors, but it is not blocked writing process at all and I have found just a few (~20) files in disk cache.
So, it’s OK for me if it will be fixed as planned, without bumping priority.

ps. I have used 4.8.6-1 just for testing, to make sure that I should not wait for new release.

guenther · 23 March 2023 10:03

Hi @igtkache,

we’ve decided that the way to go forward with this is to diagnose the exact cause of the checksum comparison problem. The batch writing behavior of failing the batch when one or more files fail will then be left in place as it is. We’re also not sure if we can rule out that this has something to do with the way CTA is configured.

In order to understand the issue better it would be helpful for us if you could provide the protocol buffer log from the CTA frontend when the issue occurs.
This can be gathered by setting cta.log.level PROTOBUF in your frontend-xrootd config file (/etc/cta/cta-frontend-xrootd.conf) and restarting the frontend.
Then the relevant output should be in /var/log/cta/cta-frontend-xrootd.log.

In addition, could you also provide us with the output of [1] command, to see the attribute settings of your EOS directory from where you archive the files ?

Best regards,
Jaroslav & Richard

[1] `eos attr ls /eos/NBGI/ceph/data/exp/bio/db/ncbi/pmc/2023-02-28/``

igtkache · 23 March 2023 11:16

@guenther, hello.

I had finished tape writing session few days ago, so I can’t debug PROTOBUF now. Sorry. I’ll turn on debug for next session (but have no idea when it will be) and will increase IO load (after decreasing parallel writings to EOS problem is gone completely: 25 parallel writing to single FTS works fine, but with 150 I had the problem).
Object store had wrong checksum after writing file to FTS and correct checksum after recreating object via cta-send-event (Taped stucks after CRC errors - #2 by igtkache), it does not look like a problem at CTA side (for me, now).

And about attributes (same for all other incorrect checksum failed writings):

[root@arch01 cta]# eos attr ls /eos/NBGI/ceph/data/exp/bio/db/ncbi/pmc/2023-02-28/
sys.archive.storage_class="db"
sys.attr.link="/eos/dev/proc/cta/workflow"
sys.eos.btime="1674221979.536538902"
sys.forced.checksum="adler"
sys.link.workflow.sync::abort_prepare.default="proto"
sys.link.workflow.sync::archive_failed.default="proto"
sys.link.workflow.sync::archived.default="proto"
sys.link.workflow.sync::closew.default="proto"
sys.link.workflow.sync::closew.retrieve_written="proto"
sys.link.workflow.sync::create.default="proto"
sys.link.workflow.sync::delete.default="proto"
sys.link.workflow.sync::evict_prepare.default="proto"
sys.link.workflow.sync::prepare.default="proto"
sys.link.workflow.sync::retrieve_failed.default="proto"

ps. I have setted batch size to 1 to commit to DB every file, but it does not work. After first CRC error no data was committed, so it may be a good idea to have taped configuration option to put drive down in such case to avoid unnecessary writings.