SSS keytab/auth issue in EOS+CTA lab

Summary

Hello,
I am debugging an SSS/keytab authentication issue in a small EOS+CTA Vagrant lab.
At the moment, the initial sync::create request from the MGM to the CTA frontend is successful, but the workflow later fails during the FST-side closew/follow-up path. The client upload ends with a timeout, and I also see FST→MGM authentication failures.
The confusing part is that depending on how I split the client/server keytabs on the MGM, I can either make the MGM→CTA CREATE path work or I can preserve internal EOS communication, but not both at the same time.

Details

CTA version: 5.11.14.0
Operating System and version: Rocky Linux 9 (Vagrant lab)
Xrootd version: 5.9.1
Objectstore backend: VFS / file:///var/lib/cta/objectstore/

Steps to reproduce

My lab currently has:

  • 1 MGM
  • 1 FST
  • 1 CTA frontend/server
  • 1 client

Current keytabs:

MGM

  • /etc/eos.keytab

contains:


eosmaster daemon daemon

  • /etc/eos.cta.sss.keytab

contains:


cta_eosdev+ eosdev cta

Current MGM config:


sec.protocol sss -c /etc/eos.keytab -s /etc/eos.cta.sss.keytab

mgmofs.qdbpassword_file /etc/eos.keytab

CTA frontend

Config:


sec.protocol sss -s /etc/cta/eos.sss.keytab

Keytab:


cta_eosdev+ eosdev cta

FST

Config:


sec.protocol sss -s /etc/eos.keytab

fstofs.qdbpassword_file /etc/eos.keytab

Keytab:


eosmaster daemon daemon

Reproducer

From the client:

eosmcp /tmp/cta_test.txt /eos/dev/tokenlab/cta_test_$(date +%s).txt

eosmcp is a wrapper.

What is the current bug behaviour?

The file write starts, but the client ends with:


Run: [ERROR] Socket timeout: (destination)

On the CTA frontend, I can see that the CREATE event is received and processed successfully:


MSG="In WorkflowEvent::WorkflowEvent(): received event." user="eosdev@cta" eventType="CREATE" eosInstance="eosdev" diskFilePath="/eos/dev/tokenlab/cta_test_1773139026.txt" diskFileId="12"

MSG="In Scheduler::checkAndGetNextArchiveFileId(): Checked request and got next archive file ID" instanceName="eosdev" username="vagrant" usergroup="vagrant" storageClass="sc1" fileId="4294967300"

MSG="In WorkflowEvent::processCREATE(): assigning new archive file ID." diskFileId="12" diskFilePath="/eos/dev/tokenlab/cta_test_1773139026.txt" fileId="4294967300"

On the MGM, I see the workflow request being sent successfully:


SYNC::CREATE /eos/dev/tokenlab/cta_test_1773139026.txt 192.168.56.50:10955 fxid=0000000c

protoWFEndPoint="192.168.56.50:10955" protoWFResource="/ctafrontend" fullPath="/eos/dev/tokenlab/cta_test_1773139026.txt" event="sync::create" msg="Sent protocol buffer request"

However, later I see authentication failures between the FST and the MGM:

On the MGM


XrootdXeq: User authentication failed; Decryption key not found.

XrootdXeq: daemon.27092:460@fst1 disc 0:00:00

On the FST


msg="failed query request" request="/?mgm.pcmd=query2delete&mgm.target.nodename=/eos/fst1:1095/fst" status="[FATAL] Auth failed: No protocols left to try"

func=NotifyProtoWfEndPointClosew level=ERROR msg="Could not send request to outside service. Reason: [FATAL] Auth failed: No protocols left to try"

func=SendArchiveFailedToManager msg="sending error message to manager"

func=CallManager level=ERROR msg="MGM query failed"

func=QueueForArchiving level=CRIT msg="Failed to send archive failed event to manager"

The FST log around the client close shows that the file already has archive metadata and sync::closew is being attempted:


mgm.attributes=...sys.archive.file_id=4294967302...&mgm.event=sync::closew&mgm.instance=eosdev...

What is the expected correct behaviour?

I would expect:

  1. client upload to complete without timeout,

  2. MGM to send sync::create successfully to CTA,

  3. FST to send the follow-up sync::closew / related workflow events successfully,

  4. no FST→MGM authentication failures,

  5. the archive workflow to continue normally instead of ending in archive_failed / cleanup.

Relevant logs and/or screenshots

MGM config


sec.protocol sss -c /etc/eos.keytab -s /etc/eos.cta.sss.keytab

mgmofs.qdbpassword_file /etc/eos.keytab

MGM keytabs


xrdsssadmin list /etc/eos.keytab

1 32 ... eosmaster daemon daemon

xrdsssadmin list /etc/eos.cta.sss.keytab

1 32 ... cta_eosdev+ eosdev cta

MGM attributes

[root@mgm etc]#  sudo eos attr ls /eos/dev/tokenlab/
sys.archive.storage_class="sc1"
sys.attr.link="/eos/dev/proc/cta/workflow"
sys.eos.btime="1773136666.725268848"
sys.forced.checksum="adler"
sys.link.workflow.closew.default="proto"
sys.link.workflow.create.default="proto"
sys.link.workflow.delete.default="proto"
sys.link.workflow.prepare.default="proto"
sys.link.workflow.sync::abort_prepare.default="proto"
sys.link.workflow.sync::archive_failed.default="proto"
sys.link.workflow.sync::archived.default="proto"
sys.link.workflow.sync::closew.default="proto"
sys.link.workflow.sync::closew.retrieve_written="proto"
sys.link.workflow.sync::create.default="proto"
sys.link.workflow.sync::delete.default="proto"
sys.link.workflow.sync::evict_prepare.default="proto"
sys.link.workflow.sync::prepare.default="proto"
sys.link.workflow.sync::retrieve_failed.default="proto"

MGM virtual identities

[root@mgm etc]# sudo eos vid ls
publicaccesslevel: => 1024
sss:"<pwd>":gid => root
sss:"<pwd>":uid => root
sss:"daemon":gid => root
sss:"daemon":uid => root
sudoer                 => uids()
tokensudo              => always
unix:"vagrant":gid => vagrant
unix:"vagrant":uid => vagrant
[root@mgm etc]# 

CTA frontend config


sec.protocol sss -s /etc/cta/eos.sss.keytab

CTA keytab


xrdsssadmin list /etc/cta/eos.sss.keytab

1 32 ... cta_eosdev+ eosdev cta

FST config


sec.protocol sss -s /etc/eos.keytab

fstofs.qdbpassword_file /etc/eos.keytab

FST keytab


xrdsssadmin list /etc/eos.keytab

1 32 ... eosmaster daemon daemon

Client error


[vagrant@client ~]$ eosmcp /tmp/cta_test.txt /eos/dev/tokenlab/cta_test_$(date +%s).txt

[41B/41B][100%][==================================================][0B/s]

Run: [ERROR] Socket timeout: (destination)

Possible causes

I think I need a combined SSS client keytab on the FST side, because the FST seems to require different credentials for different communication paths:

  • MGM → CTA needs the cta_eosdev key

  • FST → MGM internal EOS communication needs the normal eosmaster/daemon key

  • FST → CTA may also require the cta_eosdev key for the sync::closew path

So it seems likely that the FST-side keytab would need to contain both keys. However, in my setup it currently looks as if only the most recently added key is actually being used, while the other one is ignored.

If anyone has a reference configuration for this MGM/FST/CTA SSS split, that would be very helpful.

I was able to solve the problem by combining the two separate keytabs into a new shared SSS keytab with this command. I’m not sure whether this is the cleanest solution, but it worked reliably in my lab.

sudo sh -c '{
  cat /etc/eos.keytab
  printf "\n"
  cat /etc/eos.cta.sss.keytab
  printf "\n"
} > /tmp/eos.keytab.new'
[vagrant@mgm etc]$ sudo xrdsssadmin list /tmp/eos.keytab.new
     Number Len Date/Time Created Expires  Keyname User & Group
     ------ --- --------- ------- -------- -------
          1  32 03/11/26 06:50:29 -------- cta_eosdev+ eosdev cta
          1  32 03/11/26 06:51:06 -------- eosmaster daemon daemon

Hi @aminekarabila, glad you found a solution.

Debugging SSS authentication is indeed a pain. The new gRPC Frontend will have better authentication options, which we are currently finalising and testing. We hope to have this in production in the next couple of months.

Hello,

Thank you for the explanation. I’m really looking forward to the new feature, especially if it makes this process easier.

Thanks again!