Archive Jobs Stalled with reportJobsBatchTransferred Exception

atefeh · 2 April 2026 12:28

We had some files belonging to three tapepools on the disk buffer. The EOS redundancy was d1::t0, but some of these files were already on tape while others were not.

We tried to resubmit closew for all files, but we got the following error:
“In ArchiveMount::reportJobsBatchTransferred(): got an exception.”

Then we tried to remove sys.archive.error and resubmit closew, but it was not useful.

Next, we removed sys.archive.error and sys.cta.archive.objectstore.id (set to null) and resubmitted closew, but again there was no result.
Also I checked cta-admin fr ls --log, and apart from the reportJobs exception, there were no other exceptions or errors.

We dedicated one tape server to one of these tapepools, and In the last attempt, I changed the taped ArchiveFlushBytesFiles parameter from
32000000000,1000
to
32000000000,1,
and then restarted cta-taped.

After this change, all files went to tape, and the EOS redundancy of the files already on tape was updated to d0:t1.

An interesting point is that other files that were stuck on disk but belonged to other tapepools also went to tape.(from different tapeserver)

I am wondering: when we changed this parameter to 1, what exactly happened?

I would appreciate any help in understanding what happened and any clues about the root cause.

guenther · 7 April 2026 20:28

Dear Atefeh,

ArchiveFlushBytesFiles controls the maximum number of bytes and number of files that will be written to tape before a tape session flushes its completed work to the CTA catalogue and EOS (synchronising the tape mark with the catalogue and disk buffer info). In your case, the original setting 32GB, 1000 files.

By changing it to 32GB, 1 file, CTA is forced to flush and report after every single file. This suggests the issue may be related to the batch reporting phase (Catalogue/EOS), rather than the tape writing itself. Based on the information provided, it is possible that files were successfully written to tape, but not reported correctly (remaining stuck with EOS bits d1::t0).

To better understand the root cause, could you please provide detailed debugging information for one affected file (ideally one that was stuck before the change)? In particular, it would be helpful to have:

logs from the tape server (cta-taped)
corresponding logs from the frontend
relevant EOS logs and metadata
any related entries in failed requests after the failure: `cta-admin fr ls`

Additionally, could you share the cta-taped configuration ?
Is there a way to reproduce the issue?

Best regards,
Jaroslav

atefeh · 13 April 2026 16:22

Hi Guenther,

thanks for reply,

Please find the log file at the following link:
https://s3.echo.stfc.ac.uk/si-lib-logs/log-10042026.txt?AWSAccessKeyId=0a90c8639b7645d1be8414819a0e6d0d&Signature=4VDldYVzdUuYmcNHwSiTR5IHKes%3D&Expires=1776700523

I have provided the CTA tape configuration file, frontend logs, and tapeserver logs. All EOS information belongs to this sample file. I also used cta-admin fr ls --log to filter out this file.

Unfortunately, I cannot reproduce this issue because I do not know the reason for it.

thank you,
Atefeh

guenther · 14 April 2026 12:29

Hello Atefeh,

Thank you very much for the additional information.

It would be helpful to have the full logs as well as the exact timestamp of the reconfiguration. If this is not possible, we can try to track a single file; however, in that case, we would need to filter based on a specific archive file ID, fxid, fid, file name, or any other identifying parameter to build a more complete picture of the situation.

At the very least, could you please unmask the full "archiveId": "43885*****" and include log lines related to it?

From the log files, I can see that CTA successfully wrote the file to tape at 2026-03-26 12:50 and the archive mount process tried to queue it for reporting, but this operation failed. I suspect there may have been a problematic report that caused the entire batch to fail, or possibly an EOS connectivity issue.

Could you please provide tape server log lines around the following messages? I do not see them in the filtered logs you shared:

"In ArchiveMount::reportJobsBatchTransferred(): got an exception"
"In EOSReporter::asyncReportArchiveFullyComplete(): failed to XrdCl::FileSystem::Query() [FATAL] Invalid address code:101 errNo:0 status:3"

Please also note the following warning (minor issue only):

Mar 26, 2026 @ 12:52:36.299 TapedConfiguration::constructProcessName - short drivename 'asterix_ts1160_06' exceeds max length of 8; truncating

Currently, the maximum process name length is 16 characters. However, we reserve:

1 character for the null terminator,
1 character for the hyphen,
up to 6 characters for the postfix (e.g., “parent” or “drive”).

This leaves 16 - 1 - 1 - 6 = 8 characters available for the drive name.

Best regards,

Jaroslav

atefeh · 5 May 2026 16:25

Hi,
Thank you for your response.

I am sending all MGM log files related to this particular file “/eos/antares/prod/na62.vo.gridpp.ac.uk/disk/prod/v4.2.0/Kch2pipi0g_ib-161/logs/runlog_c2_dr013599_r15544746.log“, along with the tape server log files for the first two attempts, which failed. I have also attached the log file from the last attempt, which was successful (cta-ts06).

Apologies if reviewing all these files is a hassle. Please let me know if you need any additional details or further filtering based on specific parameters.

thanks,

Atefeh

guenther · 1 June 2026 11:11

Hello Atefeh,

Thank you very much for all this additional input. Having a first look I see there are several exceptions and critical messages on the tape servers related to the objectstore backend and repack requests. Are you doing repack operations at the same time ?

I will try to have a deeper look as soon as I find more time.

Best regards,
Jaroslav

```
{“epoch_time”:1774502906.614798643,“local_time”:“2026-03-26T05:28:26+0000”,“hostname”:“cta-ts07”,“program”:“cta-taped”,“log_level”:“ERROR”,“pid”:1115617,“tid”:1115617,“message”:“In Agent::deleteAndUnregisterSelf: agent still owns objects. Here is a part of the list.”,“drive_name”:“asterix_ts1160_06”,“instance”:“antares”,“sched_backend”:“cephUser”,“agentObject”:“Maintenance-cta-ts07.scd.rl.ac.uk-1115617-20260325-22:25:39-0”,“objects”:“RepackSubRequest-Maintenance-getafix-ts23.scd.rl.ac.uk-1661740-20250414-00:21:35-0-82 RepackSubRequest-Maintenance-getafix-ts23.scd.rl.ac.uk-1661740-20250414-00:21:35-0-9”,“startIndex”:75,“endIndex”:76,“totalObjects”:77}
{“epoch_time”:1774502906.617964572,“local_time”:“2026-03-26T05:28:26+0000”,“hostname”:“cta-ts07”,“program”:“cta-taped”,“log_level”:“CRIT”,“pid”:1115617,“tid”:1115617,“message”:“In BackendPopulator::~BackendPopulator(): error deleting agent (cta::exception::Exception). Backtrace follows.”,“drive_name”:“asterix_ts1160_06”,“instance”:“antares”,“sched_backend”:“cephUser”,“errorMessage”:“In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-cta-ts07.scd.rl.ac.uk-1115617-20260325-22:25:39-0) still owns objects. Here’s the first few: RepackSubRequest-Maintenance-getafix-ts23.scd.rl.ac.uk-1661740-20250414-00:21:35-0-10 RepackSubRequest-Maintenance-getafix-ts23.scd.rl.ac.uk-1661740-20250414-00:21:35-0-11 RepackSubRequest-Maintenance-getafix-ts23.scd.rl.ac.uk-1661740-20250414-00:21:35-0-12 RepackSubRequest-Maintenance-getafix-ts23.scd.rl.ac.uk-1661740-20250414-00:21:35-0-13 [… trimmed at 3 of 77]”}```

atefeh · 2 June 2026 13:32

Hi Guenther,

thank you for reply.
yes, we were doing repack operations at the same time.

thank you for the update.