Archive Jobs Stalled with reportJobsBatchTransferred Exception

We had some files belonging to three tapepools on the disk buffer. The EOS redundancy was d1::t0, but some of these files were already on tape while others were not.

We tried to resubmit closew for all files, but we got the following error:
“In ArchiveMount::reportJobsBatchTransferred(): got an exception.”

Then we tried to remove sys.archive.error and resubmit closew, but it was not useful.

Next, we removed sys.archive.error and sys.cta.archive.objectstore.id (set to null) and resubmitted closew, but again there was no result.
Also I checked cta-admin fr ls --log, and apart from the reportJobs exception, there were no other exceptions or errors.

We dedicated one tape server to one of these tapepools, and In the last attempt, I changed the taped ArchiveFlushBytesFiles parameter from
32000000000,1000
to
32000000000,1,
and then restarted cta-taped.

After this change, all files went to tape, and the EOS redundancy of the files already on tape was updated to d0:t1.

An interesting point is that other files that were stuck on disk but belonged to other tapepools also went to tape.(from different tapeserver)

I am wondering: when we changed this parameter to 1, what exactly happened?

I would appreciate any help in understanding what happened and any clues about the root cause.

Dear Atefeh,

ArchiveFlushBytesFiles controls the maximum number of bytes and number of files that will be written to tape before a tape session flushes its completed work to the CTA catalogue and EOS (synchronising the tape mark with the catalogue and disk buffer info). In your case, the original setting 32GB, 1000 files.

By changing it to 32GB, 1 file, CTA is forced to flush and report after every single file. This suggests the issue may be related to the batch reporting phase (Catalogue/EOS), rather than the tape writing itself. Based on the information provided, it is possible that files were successfully written to tape, but not reported correctly (remaining stuck with EOS bits d1::t0).

To better understand the root cause, could you please provide detailed debugging information for one affected file (ideally one that was stuck before the change)? In particular, it would be helpful to have:

  • logs from the tape server (cta-taped)
  • corresponding logs from the frontend
  • relevant EOS logs and metadata
  • any related entries in failed requests after the failure: `cta-admin fr ls`

Additionally, could you share the cta-taped configuration ?
Is there a way to reproduce the issue?

Best regards,
Jaroslav