Are you saying that your tape servers do not synchronise the time against an NTP server (i.e. the difference between them can reach 65 hours) and you want CTA to be protected against that?
Mar 22 14:47:29 tpsrvg1824 chronyd[1137]: Backward time jump detected!
Mar 22 14:47:29 tpsrvg1824 chronyd[1137]: Can’t synchronise: no selectable sources (0 unreachable sources
And didn't restart until
Mar 25 20:44:34 tpsrvg1824 chronyd[3730837]: System clock wrong by 257431.224621 seconds
Mar 25 20:44:34 tpsrvg1824 chronyd[3730837]: System clock was stepped by 257431.224621 seconds
The server was not rebooted. and the time seemed to reset OK.
We can’t see exactly why that happened.
the database did not like a negative number for the “SESSION_ELAPSED_TIME”.
Did that answer your question.
I took a look inside the code, and it seems to confirm that this was caused by the server going back in time.
The column SESSION_ELAPSED_TIME is calculated within a SQL UPDATE as REPORT_TIME - SESSION_START_TIME. It expects REPORT_TIME > SESSION_START_TIME to always be true, but it was not the case here which lead to a negative value being inserted in the catalogue.
This eventually leads to the error that you see. Tbh, these are such rare cases that I think the error message is broad enough as it is and should not be changed.
I believe this should be logged as a warning. The problem is that the error made cta-adminunusable and propagated to several tape servers that did not have a time skew, since the SQL query was producing the same CTAException. Ultimately, we had to restore the drives that were not affected by the time skew too.
Ok, I see. I thought that the problem as only on the cta-admin dr ls command, but it probably also affected all drives every time they tried to get the next mount…
Is this the problem you were referring to?
Fixing this is problem (or identifying & logging it every time it happens) is not so simple as it sounds. CTA always trusts that the time provided by the operating system is correct and it has no other sources of truth to compare it to.
One solution could be to tighten the constraints in the DB, so that it fails before a negative value is inserted on a unsigned column. This would keep the catalogue consistent (no negative values where we do not want them) and prevent other drives from also failing.
I will look into this and see if it can be fixed. It will probably require a catalogue update, which means that this fix will need to wait for the next catalogue release.
After the mentioned drive (host) time was fixed, we started to see these on our monitoring
Additional Info
CRIT - Drive State:DOWN Desired:DOWN [cta-taped] ERROR delegateToImpl failed: Column SESSION_ELAPSED_TIME contains the value -235243 which is not a valid unsigned integer tpsrvf2205 NO_MOUNT Last Updated:1 Write,FB8769,2026-03-25 08:04:41(!!)
The following stack trace was in the frontend logs
Thank you for the extra information.
Is the problem still affecting you?
In principle, I would expect this negative value to go away as soon as a new session starts on that tape server, but I could be wrong. If not, manually updating that value to zero or to a small value should unblock it.
Can you please confirm if this is the case?
Regarding the CTA Frontend, it seems to be behaving as excepted when faced with an exception of this type: it sends an error message back to cta-admin and logs the stack trace for further analysis.
Unfortunatelly, validating these errors on the CTA Frontend code is complicated, because they are generated deep into the CTA Catalogue code. By the point that they reach the outer layer of the CTA Frontend, it has no way to identify and validate the original cause without the intervention of an operator.
The best way to avoid this next time, is to prevent the CTA Catalogue from ever writting negative values to the DB where they are not allowed. That is a bug indeed.
Yes. When this happened I did put a value of 0 manually in the session_elapsed_time column. That allowed me to put that drive down but it lost effect seconds after the drive updated that value again. The real fix was to deal with the server’s skewed time.