Cta 4.7.3 | object store | NFS | archiving | could not parse header

ViktorKotliar · 4 June 2025 13:41

Hi all!
we have faced a strange problem with archiving and object store. As soon as it is on old CTA version it could be known. Is there any way to make archiving works for this files?
Maybe we can clean up the ObjectStore or just remove files (objects) from it. Any help is appreciated.

Cheers,
Victor

Summary flowing.

Summary

Details

CTA version: 4.7.3-1, xrd-ssi v1.1 , catalogue schema 10.0 postgresql
Operating System and version: CentOS Linux release 7.9.2009 (Core) docker containers
Xrootd version: 4.12.6-1.
Objectstore backend: NFS SHARE

What is the current problem behavior?

tape servers can not prepare files to migrate for some files, but some other files looks like are going to tapes in other tape pools.

Relevant logs and/or screenshots

Here is the ERROR from the tape server log

Jun  4 12:19:40.546411 tape-1-3-1 cta-taped: LVL="ERROR" PID="832940" TID="832940" MSG="Failed to getFilesToMigrate" thread="MainThread" tapeDrive="LTO8-1-3-1-0" tapeVid="ST6510" mountId="602461" transactionId="602461" byteSizeThreshold="80000000000" maxFiles="4000" message="In ObjectOps<N3cta11objectstore11serializers17ArchiveQueueShardE>::getHeaderFromObjectData(): could not parse header:  size=24645 data(b64)='CFoQABpY..........

Possible causes

Probably many simulation requests or heavy load for archiving with NFS.

jleduc · 4 June 2025 14:31

Hello Viktor,
I guess the problem could become worse quickly as it looks like your tape servers cannot consume objects from one of the tapepool shard. It then would mean that they cannot dequeue anymore (from the start of the fifo queue) while users are still queueing on the tapepool queue (at the end of the queue) → archive backlog is building up.

If this is the case then you need to clean this as it won’t recover and backlog will pileup.

You could try to delete the problematic archive request objects in the objectstore, but you are running a very old version of CTA, so I am not sure it may be sufficient.

If you want to clean this tapepool archive queue you can:

set the write drive to 0 for the tapepool VO: to prevent the drives from messing with the queue
rename the tapepool queue object and run cta-objectstore-dereference-removed-queues to dereference the queue from root object.

At this point new user requests will recreate the queue for archival and re-reference it in root object.

You need then to requeue files that were in that queue: look for files stuck in archive buffer, inspect referenced object in file attributes, delete them from objectstore and requeue these files for archive.

Let me know if this is relevant with your current situation or if I missed the point.

Best regards,
Julien Leduc

ViktorKotliar · 5 June 2025 07:23

Hi Julien,
I have checked info from log in base64 crypto for ArchiveQueueShard it consists with 200+ archive requests. I have checked and have not found such requests in objects in the ObjectStore. For the moment I just removed broken ArchiveQueueShard and it started to archive files for this tapepool.

If there is more info I will add it here.

many thx!
Cheers
Victor

jleduc · 5 June 2025 08:30

Hi Viktor,
Happy to hear that archival is back!
Indeed it is always good to cleanup failed requests listed in cta-admin fr ls to limit noise in objectstore requests before messing with objects.
In your current state you may just need to babysit a bit the queue and make sure that it is deleted and dereferenced when it has fully been consumed.

Cheers,
Julien