Cta-admin dr ls delegateToImpl failed: Column SESSION_ELAPSED_TIME

Summary

cta-admin dr ls
delegateToImpl failed: Column SESSION_ELAPSED_TIME contains the value -236959

Details

CTA version: 5.11.11.0-1.
Operating System and version: el9 5.14.0-611.13.1.el9_7.x86_64
Xrootd version: -5.9.1-1.el9.x86_64
Objectstore backend:

Steps to reproduce

[root@cta03 ~]# cta-admin dr ls
delegateToImpl failed: Column SESSION_ELAPSED_TIME contains the value -236959 which is not a valid unsigned integer

What is the current bug behaviour?

What is the expected correct behaviour?

Relevant logs and/or screenshots

Possible causes

the time was skewed on one taped server.
the time was 65 hours in the past.

Would a better error message make sense?

Also, reading our Slack it seems this caused knock-on effects with other tape servers.

Dear Eric and Terry,

Please let us understand the issue here:

Are you saying that your tape servers do not synchronise the time against an NTP server (i.e. the difference between them can reach 65 hours) and you want CTA to be protected against that?

Can you please clarify?

Thank you.

Vladimir
CERN

NTP ,or the ntpd , failed.

Mar 22 14:47:29 tpsrvg1824 chronyd[1137]: Backward time jump detected!
Mar 22 14:47:29 tpsrvg1824 chronyd[1137]: Can’t synchronise: no selectable sources (0 unreachable sources

And didn't restart until
Mar 25 20:44:34 tpsrvg1824 chronyd[3730837]: System clock wrong by 257431.224621 seconds
Mar 25 20:44:34 tpsrvg1824 chronyd[3730837]: System clock was stepped by 257431.224621 seconds

The server was not rebooted.  and the time seemed to reset OK.

We can’t see exactly why that happened.
the database did not like a negative number for the “SESSION_ELAPSED_TIME”.
Did that answer your question.

Thanks
Terry

Terrance Jones
jonest@fnal.gov, ssa-group@fnal.gov
Phone; 630.840.5200

OK, we will discuss internally and get back to you on this one.

Vladimir

Hello,

I took a look inside the code, and it seems to confirm that this was caused by the server going back in time.

The column SESSION_ELAPSED_TIME is calculated within a SQL UPDATE as REPORT_TIME - SESSION_START_TIME. It expects REPORT_TIME > SESSION_START_TIME to always be true, but it was not the case here which lead to a negative value being inserted in the catalogue.

This eventually leads to the error that you see. Tbh, these are such rare cases that I think the error message is broad enough as it is and should not be changed.

Best,
Joao

Hello,

I believe this should be logged as a warning. The problem is that the error made cta-adminunusable and propagated to several tape servers that did not have a time skew, since the SQL query was producing the same CTAException. Ultimately, we had to restore the drives that were not affected by the time skew too.

Hi Leopoldo,

Ok, I see. I thought that the problem as only on the cta-admin dr ls command, but it probably also affected all drives every time they tried to get the next mount…

Is this the problem you were referring to?

Fixing this is problem (or identifying & logging it every time it happens) is not so simple as it sounds. CTA always trusts that the time provided by the operating system is correct and it has no other sources of truth to compare it to.

One solution could be to tighten the constraints in the DB, so that it fails before a negative value is inserted on a unsigned column. This would keep the catalogue consistent (no negative values where we do not want them) and prevent other drives from also failing.

I will look into this and see if it can be fixed. It will probably require a catalogue update, which means that this fix will need to wait for the next catalogue release.

1 Like

I detected this negative value in table drive_state and column session_elapsed_time.

After the mentioned drive (host) time was fixed, we started to see these on our monitoring

Additional Info

CRIT - Drive State:DOWN Desired:DOWN [cta-taped] ERROR delegateToImpl failed: Column SESSION_ELAPSED_TIME contains the value -235243 which is not a valid unsigned integer tpsrvf2205 NO_MOUNT Last Updated:1 Write,FB8769,2026-03-25 08:04:41(!!)

The following stack trace was in the frontend logs

In RequestProc::ExecuteAction(): RSP_ERR_CTA: delegateToImpl failed: Column SESSION_ELAPSED_TIME contains the value -240391 which is not a valid unsigned integer\n/lib64/libctacommon.so.0(cta::exception::Backtrace::Backtrace(bool)+0x6b) [0x7f285bda8f4d]\n/lib64/libctacommon.so.0(cta::exception::Exception::Exception(std::basic_string_view<char, std::char_traits >, bool)+0x91) [0x7f285bdaa015]\n/lib64/libctardbmspostgres.so.0(cta::rdbms::wrapper::PostgresRset::columnOptionalUint64(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) const+0x223) [0x7f285b7b8a69]\n/lib64/libctardbms.so.0(decltype(auto) cta::rdbms::Rset::delegateToImpl<&(cta::rdbms::wrapper::RsetWrapper::columnOptionalUint64(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) const), std::unique_ptr<cta::rdbms::wrapper::RsetWrapper, std::default_deletecta::rdbms::wrapper::RsetWrapper > const>(std::unique_ptr<cta::rdbms::wrapper::RsetWrapper, std::default_deletecta::rdbms::wrapper::RsetWrapper > const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) const+0xfd) [0x7f285d5eaa91]\n/lib64/libctardbms.so.0(cta::rdbms::Rset::columnOptionalUint64(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) const+0x36) [0x7f285d5e78c8]\n/lib64/libctacatalogue.so.0(cta::catalogue::RdbmsDriveStateCatalogue::gettingSqlTapeDriveValues(cta::rdbms::Rset*) const+0x4f2) [0x7f285cb271c2]\n/lib64/libctacatalogue.so.0(cta::catalogue::RdbmsDriveStateCatalogue::getTapeDrives[abi:cxx11]() const+0x116) [0x7f285cb2937c]\n/lib64/libctacatalogue.so.0(+0x3d452e) [0x7f285cbd452e]\n/lib64/libctacatalogue.so.0(+0x3d5729) [0x7f285cbd5729]\n/lib64/libctacatalogue.so.0(cta::catalogue::DriveStateCatalogueRetryWrapper::getTapeDrives[abi:cxx11]() const+0x46) [0x7f285cbd458e]\n/lib64/libctascheduler.so.0(cta::OStoreDB::fetchMountInfo(cta::SchedulerDatabase::TapeMountDecisionInfo&, cta::objectstore::RootEntry&, cta::SchedulerDatabase::PurposeGetMountInfo, cta::log::LogContext&)+0x218f) [0x7f285db5059f]\n/lib64/libctascheduler.so.0(cta::OStoreDB::getMountInfoNoLock(cta::SchedulerDatabase::PurposeGetMountInfo, cta::log::LogContext&)+0x10d) [0x7f285db533c3]\n/lib64/libctascheduler.so.0(cta::Scheduler::getQueuesAndMountSummaries[abi:cxx11](cta::log::LogContext&)+0x8e) [0x7f285dab9e90]\n/lib64/libXrdSsiCta.so(cta::xrd::ShowQueuesStream::ShowQueuesStream(cta::frontend::AdminCmdStream const&, cta::catalogue::Catalogue&, cta::Scheduler&, cta::log::LogContext&)+0x70) [0x7f285e051a74]\n/lib64/libXrdSsiCta.so(cta::frontend::AdminCmdStream::processShowQueues(cta::xrd::Response&)+0x62) [0x7f285e0571e8]\n/lib64/libXrdSsiCta.so(cta::frontend::AdminCmdStream::process()+0x3d2) [0x7f285e055a48]\n/lib64/libXrdSsiCta.so(cta::xrd::RequestMessage::process(cta::xrd::Request const&, cta::xrd::Response&, XrdSsiStream*&)+0x194) [0x7f285e047ad8]\n/lib64/libXrdSsiCta.so(XrdSsiPb::RequestProc<cta::xrd::Request, cta::xrd::Response, cta::xrd::Alert>::ExecuteAction()+0x169) [0x7f285e043c9d]\n/lib64/libXrdSsiCta.so(XrdSsiPb::RequestProc<cta::xrd::Request, cta::xrd::Response, cta::xrd::Alert>::Execute()+0xd0) [0x7f285e041252]\n/lib64/libXrdSsiCta.so(XrdSsiPb::Service<cta::xrd::Request, cta::xrd::Response, cta::xrd::Alert>::ProcessRequest(XrdSsiRequest&, XrdSsiResource&)+0x8f) [0x7f285e040129]\n/lib64/libXrdUtils.so.3(XrdScheduler::Run()+0x14a) [0x7f2860fc20aa]\n/lib64/libXrdUtils.so.3(XrdStartWorking(void*)+0xd) [0x7f2860fc21ad]\n/lib64/libXrdUtils.so.3(XrdSysThread_Xeq+0x3c) [0x7f286100134c]\n/lib64/libc.so.6(+0x8b2ea) [0x7f286088b2ea]\n/lib64/libc.so.6(+0x1103d0) [0x7f28609103d0]\n

I understand it is not an easy fix but I suggest to do a validation on the frontend code so the cta-admin command doesn’t break.

Hello,

Thank you for the extra information.
Is the problem still affecting you?

In principle, I would expect this negative value to go away as soon as a new session starts on that tape server, but I could be wrong. If not, manually updating that value to zero or to a small value should unblock it.
Can you please confirm if this is the case?

Regarding the CTA Frontend, it seems to be behaving as excepted when faced with an exception of this type: it sends an error message back to cta-admin and logs the stack trace for further analysis.

Unfortunatelly, validating these errors on the CTA Frontend code is complicated, because they are generated deep into the CTA Catalogue code. By the point that they reach the outer layer of the CTA Frontend, it has no way to identify and validate the original cause without the intervention of an operator.

The best way to avoid this next time, is to prevent the CTA Catalogue from ever writting negative values to the DB where they are not allowed. That is a bug indeed.

Yes. When this happened I did put a value of 0 manually in the session_elapsed_time column. That allowed me to put that drive down but it lost effect seconds after the drive updated that value again. The real fix was to deal with the server’s skewed time.