Bandwidth not showing in "cta sq" and "cta --json dr ls"

Hello team,

We are building some monitoring scripts and I just realized that if I run the command cta-admin sq or cta-admin --json sq the MB/s in human readable table or bytesPerSecond in the json fields is always zero.

# cta-admin sq
          type tapepool    vo library vid files queued data queued oldest youngest priority min age read max drives write max drives cur. mounts cur. files cur. data MB/s tapes capacity files on tapes data on tapes full tapes writable tapes
ArchiveForUser    dteam dteam       -   -           39      399.4G   2751      219        1      60               1                1           1         46    471.0G    0          24.0T            195        909.2G          1              1
# cta-admin --json sq | jq
[
  {
    "mountType": "ARCHIVE_FOR_USER",
    "tapepool": "dteam",
    "logicalLibrary": "-",
    "vid": "-",
    "queuedFiles": "39",
    "queuedBytes": "399360000000",
    "oldestAge": "2759",
    "priority": "1",
    "minAge": "60",
    "curMounts": "1",
    "curFiles": "46",
    "curBytes": "471040000000",
    "bytesPerSecond": "0",
    "tapesCapacity": "24000000000000",
    "tapesFiles": "195",
    "tapesBytes": "909233524568",
    "fullTapes": "1",
    "writableTapes": "1",
    "sleepingForSpace": false,
    "sleepStartTime": "0",
    "diskSystemSleptFor": "",
    "vo": "dteam",
    "readMaxDrives": "1",
    "writeMaxDrives": "1",
    "mountPolicies": [
      "ctamp"
    ],
    "youngestAge": "227",
    "highestPriorityMountPolicy": "ctamp",
    "lowestRequestAgeMountPolicy": "ctamp"
  }
]

But also when I check on cta-admin dr ls it shows a value, but when the output is json, it doesn’t

[root@ctamon01 ~]# cta-admin dr ls
library drive      host desired        request   status since    vid tapepool    vo files   data MB/s session priority activity age reason
    cta   257 ctatps001      Up ArchiveForUser Transfer  6047 V03650    dteam dteam    47 481.3G 79.1     195        0        -  14 -
    cta   258 ctatps002      Up              -     Free  5352      -        -     -     -      -    -       -        0        -   5 -
[root@ctamon01 ~]# cta --json dr ls | jq -r '.[] | del(.driveConfig) | select(.vo=="dteam" and .tapepool=="dteam")'
{
  "logicalLibrary": "cta",
  "driveName": "257",
  "host": "ctatps001",
  "desiredDriveState": "UP",
  "mountType": "ARCHIVE_FOR_USER",
  "driveStatus": "TRANSFERRING",
  "driveStatusSince": "6099",
  "vid": "V03650",
  "tapepool": "dteam",
  "filesTransferredInSession": "47",
  "bytesTransferredInSession": "481280000000",
  "latestBandwidth": "0",
  "sessionId": "195",
  "timeSinceLastUpdate": "6",
  "currentPriority": "0",
  "currentActivity": "",
  "ctaVersion": "5.7.14-1",
  "devFileName": "/dev/nst0",
  "rawLibrarySlot": "smc1",
  "comment": "",
  "reason": "",
  "vo": "dteam",
  "diskSystemName": "",
  "reservedBytes": "0",
  "sessionElapsedTime": "6144",
  "logicalLibraryDisabled": false
}

I guess this is a bug, but it would be really appreciated if it can be fixed in future versions.

Thank you!

Hello Jordi,

This seems to be a bug indeed, and it’s also affecting us.

Fortunately, it seems to be easy to fix.

I have created a dev issue.

Thank you,
Joao

1 Like

Jordi,

if I may add something, I would avoid basing long term CTA monitoring on the MB/s value from the output of dr ls. That value is a moving number simply re-calculated by dividing the amount of data transfered by number of seconds. If was implemented just for the operators to see that the drive is actually doing something …

Instead you should look into the log files of cta-taped on all tape servers for the message Tape session finished and extract the session statistics from there. That is what we use for long term statistis.

Best regards,

Vladimir

Just to nail what Vladimir said: this immediate bandwidth number is not for monitoring consumption…

You can sample bytesTransferredInSession and derive the bandwith to get some nice graphs.
But really forget about latestBandwidth: after so many years in tape performance monitoring I still have no idea what it was trying to display or what it means…

Best regards,
Julien Leduc

Hi Joao, Vladimir and Julien,

Thanks for your answers. I wanted to use the json output to build our monitoring since I have read some times here in the forum coming from the CTA team, that this was the best option to be used to build the monitoring system.

I don’t really feel the option to check the logs once the files are finished, because we wanted to know the bandwitdh used every minute, since we are sending the metrics every minute, and then I’m not sure that metrics would be “aligned” in the same time ranges.

I will try to figure out a way to get the bandwidth, even if they are approximate values, with the suggestion of Julien, using bytesTransferredInSession to derive it and get the bandwidth.

Honestly, we like the json output since it eases a lot the gathering of metrics to graphite, and we will try as much as we can only use that as the source of our data.

Just for information, the monitoring we are trying to put in place at the moment is based on Bash scripts querying the cta-admin cli with json output sending metrics to a Graphite backend and plotting with Grafana.

Thank you so much for your answers, I will surely ask you some more things!

Jordi

Hi Jordin,

We decided to remove the MB/s (table output) and bytesPerSecond (json output) of the cta-admin [--json] sq command. It does not make much sense for us to keep this metric in the sq command, since it is mostly influenced by the tape drive and not by the queue itself.

Instead, as Julien mentioned, the best option is to monitor the bytesTransferredInSession value of the cta --json dr ls command.

You are correct that checking the --json output is the way to go when monitoring any of the data provided by cta-admin. It’s just that this specific field bytesPerSecond has been broken for a while…

Cheers,
Joao

Hi Joao,

Yes, I followed your first suggestion and got some values with the derivative of bytesTransferredInSession, converting to a perSecond summarizing and averaging the number to get what I think should be a reliable number, but we’ll have to check with people here to confirm.

I’m glad that at least my first question was useful to identify values that were not working :slight_smile:

Thanks again for your help.

Jordi

1 Like