Cta-admin dr ls failed when a file is being transferred

Hi, there,
After I upgraded cta to 4.2-3 from 4.0-2, I notice that when I run cta-admin dr ls with a file being transferred, it will crash and the error is:

210916 08:33:36 256076 ssi_Pb::Request: pid:256068 tid:140486964729600 ProcessResponseData(): fatal error from XRootD framework
Unable to read stream /ctafrontend; GetBuff failed: basic_string::at
library drive host desired request status since vid tapepool vo files data MB/s session priority activity age reason disk_system_name reserved_bytes

Meanwhile, the cta-taped will throw an exception which may cause the backup fail. Some error log like

Sep 16 08:29:35.107969 laso-tape01 cta-taped: LVL="ERROR" PID="96932" TID="96932" MSG="Aborting cta-taped on uncaught exception. Stack trace follows." Message="Uncaught exception of type 'std::logic_error' in Thread.run(): >>>><<<< End of uncaught exception"
Sep 16 08:29:35.108079 laso-tape01 cta-taped: LVL="ERROR" PID="96932" TID="96932" MSG="Stack trace" traceFrameNumber="0" traceFrame="/lib64/libctacommon.so.0(cta::exception::Backtrace::Backtrace(bool)+0x6e) [0x7fb155fd73bc]"
Sep 16 08:29:35.108146 laso-tape01 cta-taped: LVL="ERROR" PID="96932" TID="96932" MSG="Stack trace" traceFrameNumber="1" traceFrame="/lib64/libctacommon.so.0(cta::exception::Exception::Exception(std::string const&, bool)+0x89) [0x7fb155fd8c53]"
Sep 16 08:29:35.108197 laso-tape01 cta-taped: LVL="ERROR" PID="96932" TID="96932" MSG="Stack trace" traceFrameNumber="2" traceFrame="/lib64/libctacommon.so.0(cta::threading::UncaughtExceptionInThread::UncaughtExceptionInThread(std::string const&)+0x37) [0x7fb155ff3517]"
Sep 16 08:29:35.108244 laso-tape01 cta-taped: LVL="ERROR" PID="96932" TID="96932" MSG="Stack trace" traceFrameNumber="3" traceFrame="/lib64/libctacommon.so.0(cta::threading::Thread::wait()+0x13b) [0x7fb155ff2e2d]"
Sep 16 08:29:35.108288 laso-tape01 cta-taped: LVL="ERROR" PID="96932" TID="96932" MSG="Stack trace" traceFrameNumber="4" traceFrame="/usr/bin/cta-taped(castor::tape::tapeserver::daemon::TaskWatchDog::stopAndWaitThread()+0x39) [0x594437]"
Sep 16 08:29:35.108334 laso-tape01 cta-taped: LVL="ERROR" PID="96932" TID="96932" MSG="Stack trace" traceFrameNumber="5" traceFrame="/usr/bin/cta-taped(castor::tape::tapeserver::daemon::DataTransferSession::executeWrite(cta::log::LogContext&, cta::ArchiveMount*)+0x66a) [0x58e4e0]"
Sep 16 08:29:35.108382 laso-tape01 cta-taped: LVL="ERROR" PID="96932" TID="96932" MSG="Stack trace" traceFrameNumber="6" traceFrame="/usr/bin/cta-taped(castor::tape::tapeserver::daemon::DataTransferSession::execute()+0xbd3) [0x58c4bb]"
Sep 16 08:29:35.108426 laso-tape01 cta-taped: LVL="ERROR" PID="96932" TID="96932" MSG="Stack trace" traceFrameNumber="7" traceFrame="/usr/bin/cta-taped(cta::tape::daemon::DriveHandler::runChild()+0x178b) [0x5502bb]"
Sep 16 08:29:35.108468 laso-tape01 cta-taped: LVL="ERROR" PID="96932" TID="96932" MSG="Stack trace" traceFrameNumber="8" traceFrame="/usr/bin/cta-taped(cta::tape::daemon::ProcessManager::runForkManagement()+0x3dd) [0x57293b]"
Sep 16 08:29:35.108510 laso-tape01 cta-taped: LVL="ERROR" PID="96932" TID="96932" MSG="Stack trace" traceFrameNumber="9" traceFrame="/usr/bin/cta-taped(cta::tape::daemon::ProcessManager::run()+0x7e) [0x571934]"
Sep 16 08:29:35.108552 laso-tape01 cta-taped: LVL="ERROR" PID="96932" TID="96932" MSG="Stack trace" traceFrameNumber="10" traceFrame="/usr/bin/cta-taped(cta::tape::daemon::TapeDaemon::mainEventLoop()+0x2fb) [0x53bd85]"
Sep 16 08:29:35.108594 laso-tape01 cta-taped: LVL="ERROR" PID="96932" TID="96932" MSG="Stack trace" traceFrameNumber="11" traceFrame="/usr/bin/cta-taped(cta::tape::daemon::TapeDaemon::exceptionThrowingMain()+0x1d9) [0x53b9a1]"
Sep 16 08:29:35.108636 laso-tape01 cta-taped: LVL="ERROR" PID="96932" TID="96932" MSG="Stack trace" traceFrameNumber="12" traceFrame="/usr/bin/cta-taped(cta::tape::daemon::TapeDaemon::main()+0x37) [0x53b379]"
Sep 16 08:29:35.108676 laso-tape01 cta-taped: LVL="ERROR" PID="96932" TID="96932" MSG="Stack trace" traceFrameNumber="13" traceFrame="/usr/bin/cta-taped() [0x52c73a]"
Sep 16 08:29:35.108714 laso-tape01 cta-taped: LVL="ERROR" PID="96932" TID="96932" MSG="Stack trace" traceFrameNumber="14" traceFrame="/usr/bin/cta-taped(main+0x334) [0x52ce23]"
Sep 16 08:29:35.108752 laso-tape01 cta-taped: LVL="ERROR" PID="96932" TID="96932" MSG="Stack trace" traceFrameNumber="15" traceFrame="/lib64/libc.so.6(__libc_start_main+0xf5) [0x7fb14f654555]"
Sep 16 08:29:35.108808 laso-tape01 cta-taped: LVL="ERROR" PID="96932" TID="96932" MSG="Stack trace" traceFrameNumber="16" traceFrame="/usr/bin/cta-taped() [0x52c3f9]"
Sep 16 08:29:44.171453 laso-tape01 cta-taped: LVL="ERROR" PID="95526" TID="95526" MSG="Aborting cta-taped on uncaught exception. Stack trace follows." Message="Uncaught exception of type 'std::logic_error' in Thread.run(): >>>><<<< End of uncaught exception"
Sep 16 08:29:44.171591 laso-tape01 cta-taped: LVL="ERROR" PID="95526" TID="95526" MSG="Stack trace" traceFrameNumber="0" traceFrame="/lib64/libctacommon.so.0(cta::exception::Backtrace::Backtrace(bool)+0x6e) [0x7fb155fd73bc]"
Sep 16 08:29:44.171677 laso-tape01 cta-taped: LVL="ERROR" PID="95526" TID="95526" MSG="Stack trace" traceFrameNumber="1" traceFrame="/lib64/libctacommon.so.0(cta::exception::Exception::Exception(std::string const&, bool)+0x89) [0x7fb155fd8c53]"
Sep 16 08:29:44.171753 laso-tape01 cta-taped: LVL="ERROR" PID="95526" TID="95526" MSG="Stack trace" traceFrameNumber="2" traceFrame="/lib64/libctacommon.so.0(cta::threading::UncaughtExceptionInThread::UncaughtExceptionInThread(std::string const&)+0x37) [0x7fb155ff3517]"
Sep 16 08:29:44.171847 laso-tape01 cta-taped: LVL="ERROR" PID="95526" TID="95526" MSG="Stack trace" traceFrameNumber="3" traceFrame="/lib64/libctacommon.so.0(cta::threading::Thread::wait()+0x13b) [0x7fb155ff2e2d]"
Sep 16 08:29:44.171916 laso-tape01 cta-taped: LVL="ERROR" PID="95526" TID="95526" MSG="Stack trace" traceFrameNumber="4" traceFrame="/usr/bin/cta-taped(castor::tape::tapeserver::daemon::TaskWatchDog::stopAndWaitThread()+0x39) [0x594437]"
Sep 16 08:29:44.171988 laso-tape01 cta-taped: LVL="ERROR" PID="95526" TID="95526" MSG="Stack trace" traceFrameNumber="5" traceFrame="/usr/bin/cta-taped(castor::tape::tapeserver::daemon::DataTransferSession::executeWrite(cta::log::LogContext&, cta::ArchiveMount*)+0x66a) [0x58e4e0]"
Sep 16 08:29:44.172067 laso-tape01 cta-taped: LVL="ERROR" PID="95526" TID="95526" MSG="Stack trace" traceFrameNumber="6" traceFrame="/usr/bin/cta-taped(castor::tape::tapeserver::daemon::DataTransferSession::execute()+0xbd3) [0x58c4bb]"
Sep 16 08:29:44.172136 laso-tape01 cta-taped: LVL="ERROR" PID="95526" TID="95526" MSG="Stack trace" traceFrameNumber="7" traceFrame="/usr/bin/cta-taped(cta::tape::daemon::DriveHandler::runChild()+0x178b) [0x5502bb]"
Sep 16 08:29:44.172202 laso-tape01 cta-taped: LVL="ERROR" PID="95526" TID="95526" MSG="Stack trace" traceFrameNumber="8" traceFrame="/usr/bin/cta-taped(cta::tape::daemon::ProcessManager::runForkManagement()+0x3dd) [0x57293b]"
Sep 16 08:29:44.172269 laso-tape01 cta-taped: LVL="ERROR" PID="95526" TID="95526" MSG="Stack trace" traceFrameNumber="9" traceFrame="/usr/bin/cta-taped(cta::tape::daemon::ProcessManager::run()+0x7e) [0x571934]"
Sep 16 08:29:44.172333 laso-tape01 cta-taped: LVL="ERROR" PID="95526" TID="95526" MSG="Stack trace" traceFrameNumber="10" traceFrame="/usr/bin/cta-taped(cta::tape::daemon::TapeDaemon::mainEventLoop()+0x2fb) [0x53bd85]"
Sep 16 08:29:44.172411 laso-tape01 cta-taped: LVL="ERROR" PID="95526" TID="95526" MSG="Stack trace" traceFrameNumber="11" traceFrame="/usr/bin/cta-taped(cta::tape::daemon::TapeDaemon::exceptionThrowingMain()+0x1d9) [0x53b9a1]"
Sep 16 08:29:44.172478 laso-tape01 cta-taped: LVL="ERROR" PID="95526" TID="95526" MSG="Stack trace" traceFrameNumber="12" traceFrame="/usr/bin/cta-taped(cta::tape::daemon::TapeDaemon::main()+0x37) [0x53b379]"
Sep 16 08:29:44.172541 laso-tape01 cta-taped: LVL="ERROR" PID="95526" TID="95526" MSG="Stack trace" traceFrameNumber="13" traceFrame="/usr/bin/cta-taped() [0x52c73a]"
Sep 16 08:29:44.172599 laso-tape01 cta-taped: LVL="ERROR" PID="95526" TID="95526" MSG="Stack trace" traceFrameNumber="14" traceFrame="/usr/bin/cta-taped(main+0x334) [0x52ce23]"
Sep 16 08:29:44.172661 laso-tape01 cta-taped: LVL="ERROR" PID="95526" TID="95526" MSG="Stack trace" traceFrameNumber="15" traceFrame="/lib64/libc.so.6(__libc_start_main+0xf5) [0x7fb14f654555]"
Sep 16 08:29:44.172722 laso-tape01 cta-taped: LVL="ERROR" PID="95526" TID="95526" MSG="Stack trace" traceFrameNumber="16" traceFrame="/usr/bin/cta-taped() [0x52c3f9]"

I lookuped the files on tapes, and these files are archived successfully,

[cta@cta ~]$ cta tf ls -l -v B02705
archive id copy no    vid fseq block id instance disk fxid size checksum type checksum value storage class owner group    creation time path
         2       2 B02705    3       21  eoslaso        3e  158       ADLER32       9c6d36f9      laso-raw 12142   600 2021-09-15 23:47 /eos/laso/test/nn
        21       2 B02705    4       31  eoslaso        3f  158       ADLER32       9c6d36f9      laso-raw 12142   600 2021-09-16 06:52 /eos/laso/raw/test/hosts
        22       2 B02705    5       41  eoslaso        40  158       ADLER32       9c6d36f9      laso-raw 12142   600 2021-09-16 06:57 /eos/laso/raw/test/aa
        23       2 B02705    6       51  eoslaso        41  158       ADLER32       9c6d36f9      laso-raw 12142   600 2021-09-16 07:08 /eos/laso/test/oo
        24       2 B02705    7       61  eoslaso        42  158       ADLER32       9c6d36f9      laso-raw 12142   600 2021-09-16 07:21 /eos/laso/raw/test/bb
        26       2 B02705    8       71  eoslaso        43  158       ADLER32       9c6d36f9      laso-raw 12142   600 2021-09-16 08:01 /eos/laso/raw/test/cc
        27       2 B02705    9       81  eoslaso        44 2.1G       ADLER32       baf4aec3      laso-raw 12142   600 2021-09-16 08:09 /eos/laso/raw/test/test.img
        28       2 B02705   10     8282  eoslaso        45 2.1G       ADLER32       baf4aec3      laso-raw 12142   600 2021-09-16 08:17 /eos/laso/raw/test/2.img
        29       2 B02705   11    16483  eoslaso        46 2.1G       ADLER32       baf4aec3      laso-raw 12142   600 2021-09-16 08:23 /eos/laso/test/2.img
        30       2 B02705   12    24684  eoslaso        47 2.1G       ADLER32       baf4aec3      laso-raw 12142   600 2021-09-16 08:28 /eos/laso/raw/test/3.img
        31       2 B02705   13    32885  eoslaso        48 2.1G       ADLER32       baf4aec3      laso-raw 12142   600 2021-09-16 08:32 /eos/laso/test/3.img
        32       2 B02705   14    41086  eoslaso        49 2.1G       ADLER32       baf4aec3      laso-raw 12142   600 2021-09-16 08:38 /eos/laso/test/4.img

But the eos don’t recognize the archive result as Files archived but eoscta disagrees.

We don’t know what the problem is. Did anyone else have such problems? Thanks.

Cheers,

Yujiang

And cta-admin fr ls shows that cta failed to archive these files:

[root@cta ~]# cta fr ls
                                                           object id request type copy no tapepool/vid requester group path
 ArchiveRequest-Frontend-cta.ihep.ac.cn-215383-20210915-23:30:12-0-0      archive       1    laso-pool      biyj   u07 /eos/laso/test/kk
 ArchiveRequest-Frontend-cta.ihep.ac.cn-215383-20210915-23:30:12-0-0      archive       2   laso-pool2      biyj   u07 /eos/laso/test/kk
 ArchiveRequest-Frontend-cta.ihep.ac.cn-218148-20210915-23:53:20-0-0      archive       2   laso-pool2      biyj   u07 /eos/laso/raw/test/hosts
 ArchiveRequest-Frontend-cta.ihep.ac.cn-218148-20210915-23:53:20-0-5      archive       2   laso-pool2      biyj   u07 /eos/laso/raw/test/aa
ArchiveRequest-Frontend-cta.ihep.ac.cn-218148-20210915-23:53:20-0-25      archive       2   laso-pool2      biyj   u07 /eos/laso/raw/test/test.img
ArchiveRequest-Frontend-cta.ihep.ac.cn-218148-20210915-23:53:20-0-30      archive       2   laso-pool2      biyj   u07 /eos/laso/raw/test/2.img
ArchiveRequest-Frontend-cta.ihep.ac.cn-218148-20210915-23:53:20-0-45      archive       2   laso-pool2      biyj   u07 /eos/laso/test/3.img
ArchiveRequest-Frontend-cta.ihep.ac.cn-218148-20210915-23:53:20-0-50      archive       2   laso-pool2      biyj   u07 /eos/laso/test/4.img
ArchiveRequest-Frontend-cta.ihep.ac.cn-218148-20210915-23:53:20-0-55      archive       2   laso-pool2      biyj   u07 /eos/laso/test/5.img
ArchiveRequest-Frontend-cta.ihep.ac.cn-218148-20210915-23:53:20-0-60      archive       2   laso-pool2      biyj   u07 /eos/laso/test/pp
ArchiveRequest-Frontend-cta.ihep.ac.cn-218148-20210915-23:53:20-0-65      archive       2   laso-pool2      biyj   u07 /eos/laso/test/zz

Hi Yujiang,

Thanks for letting us know. This is a recent regression which will be fixed in the next release. More details in the following ticket;

Oliver.

OK. I noticed the bug. And a new 4.2-2 was released yesterday. I tried the new version, but have a similar problem of running cta-admin dr ls when an archivial is running:

[root@cta ~]# cta dr ls laso.*
210917 10:53:09 411311 ssi_Pb::Request: pid:411303 tid:140446396131072 ProcessResponseData(): fatal error from XRootD framework
[ERROR] Socket error
210917 10:53:09 411311 ssi_Shutdown: Unprovision: /ctafrontend@cta.ihep.ac.cn:10955 error; 14 [ERROR] Invalid session
library drive host desired request status since vid tapepool vo files data MB/s session priority activity age reason disk_system_name reserved_bytes
[root@cta ~]# cta dr ls laso.*
210917 10:54:08 412001 ssi_Pb::Request: pid:411993 tid:140388258789120 ProcessResponseData(): fatal error from XRootD framework
[ERROR] Socket error
210917 10:54:08 412001 ssi_Shutdown: Unprovision: /ctafrontend@cta.ihep.ac.cn:10955 error; 14 [ERROR] Invalid session
library drive host desired request status since vid tapepool vo files data MB/s session priority activity age reason disk_system_name reserved_bytes

The archivial request is finished successfully, however.

Hi,

Tag 4.2-2 was made before the fix was committed. We are running 4.1-1 in production and will not deploy 4.2-2.

We are contemplating making a public binary release which would contain what we are running in production here at CERN, would that be useful to you?

Oliver.

Yeah. A stable public release would make it clear to us, and we can keep up to the latest stable release. Thanks.