Summary
Hello,
I am debugging an SSS/keytab authentication issue in a small EOS+CTA Vagrant lab.
At the moment, the initial sync::create request from the MGM to the CTA frontend is successful, but the workflow later fails during the FST-side closew/follow-up path. The client upload ends with a timeout, and I also see FST→MGM authentication failures.
The confusing part is that depending on how I split the client/server keytabs on the MGM, I can either make the MGM→CTA CREATE path work or I can preserve internal EOS communication, but not both at the same time.
Details
CTA version: 5.11.14.0
Operating System and version: Rocky Linux 9 (Vagrant lab)
Xrootd version: 5.9.1
Objectstore backend: VFS / file:///var/lib/cta/objectstore/
Steps to reproduce
My lab currently has:
- 1 MGM
- 1 FST
- 1 CTA frontend/server
- 1 client
Current keytabs:
MGM
/etc/eos.keytab
contains:
eosmaster daemon daemon
/etc/eos.cta.sss.keytab
contains:
cta_eosdev+ eosdev cta
Current MGM config:
sec.protocol sss -c /etc/eos.keytab -s /etc/eos.cta.sss.keytab
mgmofs.qdbpassword_file /etc/eos.keytab
CTA frontend
Config:
sec.protocol sss -s /etc/cta/eos.sss.keytab
Keytab:
cta_eosdev+ eosdev cta
FST
Config:
sec.protocol sss -s /etc/eos.keytab
fstofs.qdbpassword_file /etc/eos.keytab
Keytab:
eosmaster daemon daemon
Reproducer
From the client:
eosmcp /tmp/cta_test.txt /eos/dev/tokenlab/cta_test_$(date +%s).txt
eosmcp is a wrapper.
What is the current bug behaviour?
The file write starts, but the client ends with:
Run: [ERROR] Socket timeout: (destination)
On the CTA frontend, I can see that the CREATE event is received and processed successfully:
MSG="In WorkflowEvent::WorkflowEvent(): received event." user="eosdev@cta" eventType="CREATE" eosInstance="eosdev" diskFilePath="/eos/dev/tokenlab/cta_test_1773139026.txt" diskFileId="12"
MSG="In Scheduler::checkAndGetNextArchiveFileId(): Checked request and got next archive file ID" instanceName="eosdev" username="vagrant" usergroup="vagrant" storageClass="sc1" fileId="4294967300"
MSG="In WorkflowEvent::processCREATE(): assigning new archive file ID." diskFileId="12" diskFilePath="/eos/dev/tokenlab/cta_test_1773139026.txt" fileId="4294967300"
On the MGM, I see the workflow request being sent successfully:
SYNC::CREATE /eos/dev/tokenlab/cta_test_1773139026.txt 192.168.56.50:10955 fxid=0000000c
protoWFEndPoint="192.168.56.50:10955" protoWFResource="/ctafrontend" fullPath="/eos/dev/tokenlab/cta_test_1773139026.txt" event="sync::create" msg="Sent protocol buffer request"
However, later I see authentication failures between the FST and the MGM:
On the MGM
XrootdXeq: User authentication failed; Decryption key not found.
XrootdXeq: daemon.27092:460@fst1 disc 0:00:00
On the FST
msg="failed query request" request="/?mgm.pcmd=query2delete&mgm.target.nodename=/eos/fst1:1095/fst" status="[FATAL] Auth failed: No protocols left to try"
func=NotifyProtoWfEndPointClosew level=ERROR msg="Could not send request to outside service. Reason: [FATAL] Auth failed: No protocols left to try"
func=SendArchiveFailedToManager msg="sending error message to manager"
func=CallManager level=ERROR msg="MGM query failed"
func=QueueForArchiving level=CRIT msg="Failed to send archive failed event to manager"
The FST log around the client close shows that the file already has archive metadata and sync::closew is being attempted:
mgm.attributes=...sys.archive.file_id=4294967302...&mgm.event=sync::closew&mgm.instance=eosdev...
What is the expected correct behaviour?
I would expect:
-
client upload to complete without timeout,
-
MGM to send
sync::createsuccessfully to CTA, -
FST to send the follow-up
sync::closew/ related workflow events successfully, -
no FST→MGM authentication failures,
-
the archive workflow to continue normally instead of ending in
archive_failed/ cleanup.
Relevant logs and/or screenshots
MGM config
sec.protocol sss -c /etc/eos.keytab -s /etc/eos.cta.sss.keytab
mgmofs.qdbpassword_file /etc/eos.keytab
MGM keytabs
xrdsssadmin list /etc/eos.keytab
1 32 ... eosmaster daemon daemon
xrdsssadmin list /etc/eos.cta.sss.keytab
1 32 ... cta_eosdev+ eosdev cta
MGM attributes
[root@mgm etc]# sudo eos attr ls /eos/dev/tokenlab/
sys.archive.storage_class="sc1"
sys.attr.link="/eos/dev/proc/cta/workflow"
sys.eos.btime="1773136666.725268848"
sys.forced.checksum="adler"
sys.link.workflow.closew.default="proto"
sys.link.workflow.create.default="proto"
sys.link.workflow.delete.default="proto"
sys.link.workflow.prepare.default="proto"
sys.link.workflow.sync::abort_prepare.default="proto"
sys.link.workflow.sync::archive_failed.default="proto"
sys.link.workflow.sync::archived.default="proto"
sys.link.workflow.sync::closew.default="proto"
sys.link.workflow.sync::closew.retrieve_written="proto"
sys.link.workflow.sync::create.default="proto"
sys.link.workflow.sync::delete.default="proto"
sys.link.workflow.sync::evict_prepare.default="proto"
sys.link.workflow.sync::prepare.default="proto"
sys.link.workflow.sync::retrieve_failed.default="proto"
MGM virtual identities
[root@mgm etc]# sudo eos vid ls
publicaccesslevel: => 1024
sss:"<pwd>":gid => root
sss:"<pwd>":uid => root
sss:"daemon":gid => root
sss:"daemon":uid => root
sudoer => uids()
tokensudo => always
unix:"vagrant":gid => vagrant
unix:"vagrant":uid => vagrant
[root@mgm etc]#
CTA frontend config
sec.protocol sss -s /etc/cta/eos.sss.keytab
CTA keytab
xrdsssadmin list /etc/cta/eos.sss.keytab
1 32 ... cta_eosdev+ eosdev cta
FST config
sec.protocol sss -s /etc/eos.keytab
fstofs.qdbpassword_file /etc/eos.keytab
FST keytab
xrdsssadmin list /etc/eos.keytab
1 32 ... eosmaster daemon daemon
Client error
[vagrant@client ~]$ eosmcp /tmp/cta_test.txt /eos/dev/tokenlab/cta_test_$(date +%s).txt
[41B/41B][100%][==================================================][0B/s]
Run: [ERROR] Socket timeout: (destination)
Possible causes
I think I need a combined SSS client keytab on the FST side, because the FST seems to require different credentials for different communication paths:
-
MGM → CTA needs the
cta_eosdevkey -
FST → MGM internal EOS communication needs the normal
eosmaster/daemonkey -
FST → CTA may also require the
cta_eosdevkey for thesync::closewpath
So it seems likely that the FST-side keytab would need to contain both keys. However, in my setup it currently looks as if only the most recently added key is actually being used, while the other one is ignored.
If anyone has a reference configuration for this MGM/FST/CTA SSS split, that would be very helpful.