MGM and cta-frontend

We have installed our system we’ll use for some scale and operations testing. However, data is not moving from EOS to tape. One thing that occurs to me, the instructions we followed seem to assume CTA frontend and EOS MGM are on the same machine. Is this a requirement? If so, is it possible to have multiple front-ends, one to handle cta-admin connections and another for the EOS workflow?

CTA frontend and EOS MGM do not have to be on the same machine: in CTA CI the frontend runs in a dedicated pod distinct from the EOS pod.

You can launch as many CTA frontend services as you need.
Starting with a 2 CTA frontend setup is good:

  • one frontend as a target for your EOS tape WFEs
  • one frontend for operators as a target for cta-admin command (aka ctafrontendops)

A third optional one could be deployed for debugging purposes: indeed attaching gdb to the other 2 types of frontend is kind of disruptive for production tape traffic. This third one can be used as a cold standby server for your primary CTA frontend as well.

A virtual machine is enough to run cta-frontend service: it is very low on network and disk resources and should not need much memory neither.

TL;DR I think we may have missed a part of the EOS setup telling it how to connect the workflow engine to CTA. We’re checking. Here’s a dump of what we have so far in case you see any problems once we verify that EOS workflow is configured correctly. I think we might have missed:

Configure the CTA Frontend endpoint and resources:

mgmofs.protowfendpoint ctafrontend.cern.ch:10955
mgmofs.protowfresource /ctafrontend

Ok. That’s great. Then we seem to have something missing in the CTA/EOS setup. Let me paste a few things here and then comment at the bottom.

> eos ls /eos/ctaeos/ctacms/functional_test/dec22b/
minikube-linux-amd64
minikube-linux-amd64.test

> eos attr ls /eos/ctaeos/ctacms/functional_test/dec22b/
sys.archive.storage_class="cms.cms11@cta"
sys.attr.link="/eos/cta/proc/cta/workflow"
sys.eos.btime="1702071109.991372727"
sys.forced.blocksize="4k"
sys.forced.checksum="adler"
sys.forced.layout="replica"
sys.forced.nstripes="2"
sys.forced.space="default"
sys.link.workflow.sync::abort_prepare.default="proto"
sys.link.workflow.sync::archive_failed.default="proto"
sys.link.workflow.sync::archived.default="proto"
sys.link.workflow.sync::closew.default="proto"
sys.link.workflow.sync::closew.retrieve_written="proto"
sys.link.workflow.sync::create.default="proto"
sys.link.workflow.sync::delete.default="proto"
sys.link.workflow.sync::evict_prepare.default="proto"
sys.link.workflow.sync::prepare.default="proto"
sys.link.workflow.sync::retrieve_failed.default="proto"

> eos info /eos/ctaeos/ctacms/functional_test/dec22b/minikube-linux-amd64.test
  File: '/eos/ctaeos/ctacms/functional_test/dec22b/minikube-linux-amd64.test'  Flags: 0640
  Size: 93670363
Status: healthy
Modify: Fri Dec 22 10:56:44 2023 Timestamp: 1703264204.502785000
Change: Fri Dec 22 10:56:44 2023 Timestamp: 1703264204.191418690
Access: Wed Dec 31 18:00:00 1969 Timestamp: 0.000000000
 Birth: Fri Dec 22 10:56:44 2023 Timestamp: 1703264204.191418690
  CUid: 0 CGid: 0 Fxid: 0000000a Fid: 10 Pid: 29 Pxid: 0000001d
XStype: adler    XS: ca ac 00 e4    ETAGs: "2684354560:caac00e4"
Layout: replica Stripes: 2 Blocksize: 4k LayoutId: 00100112 Redundancy: d2::t0 
  #Rep: 2
┌───┬──────┬─────────────────────────┬────────────────┬────────────────┬──────────┬──────────────┬────────────┬────────┬────────────────────────┐
│no.│ fs-id│                     host│      schedgroup│            path│      boot│  configstatus│       drain│  active│                  geotag│
└───┴──────┴─────────────────────────┴────────────────┴────────────────┴──────────┴──────────────┴────────────┴────────┴────────────────────────┘
 0     1002 cmscta-eosfst-01.fnal.gov        default.0   /storage/data2     booted             rw      nodrain   online             fcc::2::1388 
 1     1003 cmscta-eosfst-02.fnal.gov        default.0   /storage/data1     booted             rw      nodrain   online             fcc::2::1388 

*******

> eos ls /eos/cta/proc/cta/workflow

[no content]

The storage class cms.cms11@cta is registered in CTA. There is no indication from CTA that it notices files arriving in this directory.

Hi Eric and Happy New Year!

Do you have mgmofs.tapeenabled true in your configuration? None of the tape features work if that is not enabled.

Another common misconfiguration is the authentication between EOS and CTA.

The easist way to pin down the cause of the problem is to copy a file to EOS and then grep for the filename in the various logs. On the EOS server:

/var/eos/report/2024/01/20240108.eosreport
/var/log/eos/mgm/xrdlog.mgm
/var/log/eos/mgm/WFE.log

On the CTA server:

/var/log/cta/cta-frontend.log
/var/log/cta/cta-frontend-xrootd.log

Once you can pinpoint where the file is getting stuck, we can give more specific advice.

Cheers,

Michael

Thanks, we did figure that out. Now I’m to the point where things seem to be very close.

First issue is that we are using EOS 5.1.29 but I guess that has a problem so we are going to revert.

Second is that I was writing files into EOS so that they are owned by root. It seems like CTA has an explicit protection against that, but it also seems that the file gets written to tape BEFORE the exception is thrown. Am I correct? Would that cause a problem with the contents of the tapes and the DB getting out of sync?

I hope, by the end of the day, to have passed our functional test.

Thanks for the list of logs to look at. The EOS report is not something I was aware of.

Hi Eric,

Yes, please downgrade to EOS 5.1.28 (this is what we are currently using in production). We expect to upgrade to EOS 5.2 shortly.

CTA does have an explicit protection against files owned by root. You are right, this check is not made in the Frontend, but it would be trivial to add it. Could you please post the error message you see, so that I can reference where in the code this check is made? Thanks.

At the bottom is what I see from the tape server log. So this looks to me like things were written to tape and at the last minute it decides it should not have been written.

Conversely when I look at cta-admin tape ls for that tape this is part of what I get:

  {
    "vid": "FL0378",
    "mediaType": "LTO9",
    "vendor": "Unknown",
    "logicalLibrary": "TS4500G1_CTACMS",
    "tapepool": "cms.cms11",
    "vo": "cms",
    "encryptionKeyName": "-",
    "capacity": "18000000000000",
    "occupancy": "0",
    "lastFseq": "0",
    "full": false,
    "fromCastor": false,
    "readMountCount": "0",
    "writeMountCount": "9",
Jan 5 17:34:04.013539 gmv18017 cta-taped: LVL="**INFO**" PID="30037" TID="30095" MSG="File successfully transmitted to drive" thread="TapeWrite" tapeDrive="LTO8D0" tapeVid="FL0378" mountId="234" vo="cms" mediaType="LTO9" tapePool="cms.cms11" logicalLibrary="TS4500G1_CTACMS" mountType="ArchiveForUser" vendor="Unknown" capacityInBytes="18000000000000" fileId="4294967328" fileSize="1073741824" fSeq="4" diskURL="root://cmscta-eosmgm-01.fnal.gov//eos/ctaeos/ctacms/functional_test/jan05/1G.4?eos.lfn=fxid:5c" readWriteTime="3.735466" checksumingTime="1.353655" waitDataTime="0.003148" waitReportingTime="0.000306" transferTime="5.092575" totalTime="5.092538" dataVolume="1073741824" headerVolume="480" driveTransferSpeedMBps="210.846204" payloadTransferSpeedMBps="210.846109" reconciliationTime="0" LBPMode="LBP_On"

Jan 5 17:34:08.582907 gmv18017 cta-taped: LVL="**INFO**" PID="30037" TID="30095" MSG="File successfully transmitted to drive" thread="TapeWrite" tapeDrive="LTO8D0" tapeVid="FL0378" mountId="234" vo="cms" mediaType="LTO9" tapePool="cms.cms11" logicalLibrary="TS4500G1_CTACMS" mountType="ArchiveForUser" vendor="Unknown" capacityInBytes="18000000000000" fileId="4294967329" fileSize="1073741824" fSeq="5" diskURL="root://cmscta-eosmgm-01.fnal.gov//eos/ctaeos/ctacms/functional_test/jan05/1G.?eos.lfn=fxid:5d" readWriteTime="3.404474" checksumingTime="1.161382" waitDataTime="0.002712" waitReportingTime="0.000301" transferTime="4.568869" totalTime="4.568828" dataVolume="1073741824" headerVolume="480" driveTransferSpeedMBps="235.014823" payloadTransferSpeedMBps="235.014718" reconciliationTime="0" LBPMode="LBP_On"

Jan 5 17:34:13.203443 gmv18017 cta-taped: LVL="**INFO**" PID="30037" TID="30095" MSG="File successfully transmitted to drive" thread="TapeWrite" tapeDrive="LTO8D0" tapeVid="FL0378" mountId="234" vo="cms" mediaType="LTO9" tapePool="cms.cms11" logicalLibrary="TS4500G1_CTACMS" mountType="ArchiveForUser" vendor="Unknown" capacityInBytes="18000000000000" fileId="4294967330" fileSize="1073741824" fSeq="6" diskURL="root://cmscta-eosmgm-01.fnal.gov//eos/ctaeos/ctacms/functional_test/jan05/1G.5?eos.lfn=fxid:5e" readWriteTime="3.436824" checksumingTime="1.180283" waitDataTime="0.002759" waitReportingTime="0.000259" transferTime="4.620125" totalTime="4.620089" dataVolume="1073741824" headerVolume="480" driveTransferSpeedMBps="232.407277" payloadTransferSpeedMBps="232.407173" reconciliationTime="0" LBPMode="LBP_On"

Jan 5 17:34:15.398118 gmv18017 cta-taped: LVL="**INFO**" PID="30037" TID="30095" MSG="No more data to write on tape, unconditional flushing to the client" thread="TapeWrite" tapeDrive="LTO8D0" tapeVid="FL0378" mountId="234" vo="cms" mediaType="LTO9" tapePool="cms.cms11" logicalLibrary="TS4500G1_CTACMS" mountType="ArchiveForUser" vendor="Unknown" capacityInBytes="18000000000000" files="6" bytes="6442450944" flushTime="2.194307"

Jan 5 17:34:15.400413 gmv18017 cta-taped: LVL="**INFO**" PID="30037" TID="30096" MSG="In cta::ArchiveMount::reportJobsBatchTransferred(), archive job successful." thread="MainThread" tapeDrive="LTO8D0" tapeVid="FL0378" mountId="234" mountType="ARCHIVE_FOR_USER" fileId="4294967325" type="ReportSuccessful"

Jan 5 17:34:15.400905 gmv18017 cta-taped: LVL="**INFO**" PID="30037" TID="30096" MSG="In cta::ArchiveMount::reportJobsBatchTransferred(), archive job successful." thread="MainThread" tapeDrive="LTO8D0" mountId="234" tapeVid="FL0378" mountType="ARCHIVE_FOR_USER" fileId="4294967326" type="ReportSuccessful"

Jan 5 17:34:15.401101 gmv18017 cta-taped: LVL="**INFO**" PID="30037" TID="30096" MSG="In cta::ArchiveMount::reportJobsBatchTransferred(), archive job successful." thread="MainThread" tapeDrive="LTO8D0" mountId="234" tapeVid="FL0378" mountType="ARCHIVE_FOR_USER" fileId="4294967327" type="ReportSuccessful"

Jan 5 17:34:15.401253 gmv18017 cta-taped: LVL="**INFO**" PID="30037" TID="30096" MSG="In cta::ArchiveMount::reportJobsBatchTransferred(), archive job successful." thread="MainThread" tapeDrive="LTO8D0" mountId="234" tapeVid="FL0378" mountType="ARCHIVE_FOR_USER" fileId="4294967328" type="ReportSuccessful"

Jan 5 17:34:15.401387 gmv18017 cta-taped: LVL="**INFO**" PID="30037" TID="30096" MSG="In cta::ArchiveMount::reportJobsBatchTransferred(), archive job successful." thread="MainThread" tapeDrive="LTO8D0" mountId="234" tapeVid="FL0378" mountType="ARCHIVE_FOR_USER" fileId="4294967329" type="ReportSuccessful"

Jan 5 17:34:15.401517 gmv18017 cta-taped: LVL="**INFO**" PID="30037" TID="30096" MSG="In cta::ArchiveMount::reportJobsBatchTransferred(), archive job successful." thread="MainThread" tapeDrive="LTO8D0" mountId="234" tapeVid="FL0378" mountType="ARCHIVE_FOR_USER" fileId="4294967330" type="ReportSuccessful"

Jan 5 17:34:15.416142 gmv18017 cta-taped: LVL="**INFO**" PID="30037" TID="30095" MSG="Logging mount general statistics" thread="TapeWrite" tapeDrive="LTO8D0" tapeVid="FL0378" mountId="234" driveManufacturer="IBM" driveType="ULT3580-TD8" firmwareVersion="P380" serialNumber="00078D2BA4" mountTotalNonMediumErrorCounts="0"

Jan 5 17:34:15.425767 gmv18017 cta-taped: LVL="**ERROR**" PID="30037" TID="30096" MSG="In ArchiveMount::reportJobsBatchWritten(): got an exception" thread="MainThread" tapeDrive="LTO8D0" mountId="234" exceptionMessageValue="filesWrittenToTape: filesWrittenToTape failed: TapeFileWrittenEvent is invalid: diskFileOwnerUid is 0"

Thanks, I have created issue #562 to fix this, by detecting owner_uid=0 in the Frontend, before the file is queued for archival.