Timeout values for slow Spectra library response?

To be clear, I don’t necessarily consider this a CTA bug, but I’m providing information and looking for suggestions.

Summary

cta-rmcd returns 0 even when a mount or dismount times out

Details

CTA version: 5.11.11.0-1
Operating System and version: AlmaLinux 9.7
Xrootd version: 5.9.1-1
Objectstore backend: PostgreSQL

Steps to reproduce

Mount or dismount tape when library is busy. This has been difficult to reproduce outside of a busy production environment.

What is the current bug behaviour?

rmcd returns 0 even when tape has not dismounted. Tape remains mounted, then CTA has a cleaner error and taped goes offline.

What is the expected correct behaviour?

I’m not exactly sure. Return code should be something other than 0, perhaps.

Relevant logs and/or screenshots

02/05 12:39:15 2440104 rmc_srv_unmount: RMC92 - unmount request by 1000,33 from localhost.localdomain
02/05 12:39:15 2440104 rmc_srv_unmount: RMC98 - unmount FAA269 8 0
02/05 12:44:52 2440104 rmc_srv_unmount: returns 0
02/05 12:51:28 2440104 rmc_srv_mount: RMC92 - mount request by 1000,33 from localhost.localdomain
02/05 12:51:28 2440104 rmc_srv_mount: RMC98 - mount FAA199/0 on drive 8
02/05 12:56:36 2440104 rmc_srv_mount: returns 0
02/05 13:03:19 2440104 rmc_srv_unmount: RMC92 - unmount request by 1000,33 from localhost.localdomain
02/05 13:03:19 2440104 rmc_srv_unmount: RMC98 - unmount FAA199 8 0
02/05 13:07:35 2440104 rmc_srv_unmount: returns 0
{“epoch_time”:1770318441.586744138,“local_time”:“2026-02-05T13:07:21-0600”,“hostname”:“tpsrvf2101”,“program”:“cta-taped”,“log_level”:“INFO”,“pid”:2950234,“tid”:2950234,“message”:“In Scheduler::setDesiredDriveState(): success.”,“drive_name”:“F1_F9B4D1”,“instance”:“prd”,“sched_backend”:“cephUser”,“drive”:“F1_F9B4D1”,“up”:“down”,“force”:“no”,“reason”:“[cta-taped] ERROR Cleaner failed. Cleaner failed to dismount tape: Failed to dismount tape: vid=FAA199 slot=smc8: Failed to dismount tape in SCSI tape-library: vid=FAA199 librarySlot=smc8: Failed to read message header: In io::readBytes: timeout”,“comment”:“”,“schedulerDbTime”:0.001207}

Possible causes

This occurs when the Spectra library (running BlueScale) is busy and may be slow to respond. The timeout values we are using on the taped may not be ideal. Spectra has suggested 12-20 minutes per their spec.
taped TapeLoadTimeout 300
taped WatchdogMountMaxSecs 600
taped WatchdogUnmountMaxSecs 600

Hi everyone,

Still looking for some input here. What are other sites using for tape server timeout values?

Hi Tim,

In the absence of a reply from anybody else, let me give you what we see at CERN for our Spectra Logic TFinity tape libraries running BlueScale.

We are using the standard timeout values that you mention above:

[root@tpsrv601 cta]# grep TapeLoadTimeout /etc/cta/cta-taped-SPECTRALIB2-LTO9-F05B2S?.conf
/etc/cta/cta-taped-SPECTRALIB2-LTO9-F05B2S3.conf:taped TapeLoadTimeout 300
/etc/cta/cta-taped-SPECTRALIB2-LTO9-F05B2S4.conf:taped TapeLoadTimeout 300

[root@tpsrv601 cta]# grep Watchdog /etc/cta/cta-taped-SPECTRALIB2-LTO9-F05B2S?.conf
[root@tpsrv601 cta]#

We do not see any particular timeouts as long as the robotics of the libraries are physically working fine.

As you can see from the logs below, the mount / dismount operations could exceed 5 minutes, still the session finishes with a success:

[root@tpsrv601 cta]# grep 'sion finished' cta-taped-SPECTRALIB2-LTO9-F05B2S*|grep success | grep mountTime|tr ',' '\n'|grep '"mountTime'|tr ':' ' '|tr -d '"' | sort -k 2 -n -r | head -5
mountTime 484.56015
mountTime 420.028635
mountTime 401.000624
mountTime 387.482968
mountTime 332.809072

[root@tpsrv601 cta]# grep 'sion finished' cta-taped-SPECTRALIB2-LTO9-F05B2S*|grep success | grep mountTime|tr ',' '\n'|grep '"unmountTime'|tr ':' ' '|tr -d '"' | sort -k 2 -n -r | head -5
unmountTime 353.45748
unmountTime 210.729228
unmountTime 210.627866
unmountTime 201.691003
unmountTime 182.124809

[root@tpsrv601 cta]# grep 'sion finished' cta-taped-SPECTRALIB2-LTO9-F05B2S*|grep success | grep mountTime|tr ',' '\n'|grep '"unloadTime'|tr ':' ' '|tr -d '"' | sort -k 2 -n -r | head -5
unloadTime 247.260356
unloadTime 245.031349
unloadTime 244.363445
unloadTime 237.853739
unloadTime 236.865485

This morning, one of the libraries had a problem and the mounts failed / timedout. This is what cta-rmcd reported:

03/04 06:46:22  2767 rmc_srv_mount: RMC92 - mount request by 1000,33 from localhost
03/04 06:46:22  2767 rmc_srv_mount: RMC98 - mount L63990/0 on drive 2
03/04 06:58:32  2767 lasterror: Function entered: asc=8 ascq=1 save_errno=5 rc=-4 sensekey=11 skvalid=1
03/04 06:58:32  2767 lasterror: Entry found in scsierr_acttbl: action_str=RBT_NORETRY
03/04 06:58:32  2767 rmc_sendrep: smc_mount: SR017 - find_cartridge L63990 failed : Logical Unit Communication Time-out
03/04 06:58:32  2767 rmc_srv_mount: returns 2203
03/04 06:58:32  2767 rmc_sendrep: RMC03 - illegal function 4
03/04 06:58:32  2767 rmc_sendrep: RMC02 - send error : Input/output error
03/04 06:58:32  2767 rmc_sendrep: Call to netwrite() failed: rep_type=MSG_ERR neterror=Input/output error
03/04 06:58:32  2767 rmc_sendrep: RMC02 - send error : Input/output error
03/04 06:58:32  2767 rmc_sendrep: Call to netwrite() failed: rep_type=RMC_RC neterror=Input/output error

As you can see, the return value was 2203, i.e. not 0.

I am not sure if this helps you, but this is all I have at this point. Let me know if you need more.

Best regards,

Vladimir

One more add-on - here is example of failed unmount from this night:

03/04 02:58:51  8759 rmc_srv_unmount: RMC92 - unmount request by 1001,33 from localhost
03/04 02:58:51  8759 rmc_srv_unmount: RMC98 - unmount L53440 14 0
03/04 03:11:01  8759 lasterror: Function entered: asc=8 ascq=1 save_errno=5 rc=-4 sensekey=11 skvalid=1
03/04 03:11:01  8759 lasterror: Entry found in scsierr_acttbl: action_str=RBT_NORETRY
03/04 03:11:01  8759 rmc_sendrep: smc_dismount: SR020 - read_elem_status failed : Logical Unit Communication Time-out
03/04 03:11:01  8759 rmc_srv_unmount: returns 2203
03/04 03:11:01  8759 rmc_sendrep: RMC03 - illegal function 5
03/04 03:11:01  8759 rmc_sendrep: RMC02 - send error : Input/output error
03/04 03:11:01  8759 rmc_sendrep: Call to netwrite() failed: rep_type=MSG_ERR neterror=Input/output error
03/04 03:11:01  8759 rmc_sendrep: RMC02 - send error : Input/output error
03/04 03:11:01  8759 rmc_sendrep: Call to netwrite() failed: rep_type=RMC_RC neterror=Input/output error

Again, the return value is 2203.

Vladimir