Bad response from nameserver when querying namespace

Hello friends,

We have just upgraded to eos5 (previously 4.8.75), running cta 4.7.8-1.
The upgrade worked smoothly, except namespace lookups stopped working:

[root@ctafrontend-0 yum.repos.d]# cta-admin --json tf ls -v A00760 -l | jq .[0]
{
  "af": {
    "archiveId": "561391",
    "storageClass": "single-copy-backup",
    "creationTime": "1637099602",
    "checksum": [
      {
        "type": "ADLER32",
        "value": "bf1ff879"
      }
    ],
    "size": "1050064779"
  },
  "df": {
    "diskId": "909637",
    "diskInstance": "cta",
    "ownerId": {
      "uid": 48,
      "gid": 48
    },
    "path": "Bad response from nameserver"
  },
  "tf": {
    "vid": "A00760",
    "copyNb": 1,
    "blockId": "0",
    "fSeq": "1"
  }
}

I have verified that our grpc link is working fine. I can see the vid mapping is taking effect:

client : apache                   := {TOKENREDACTED}@10.54.1.95                 ( grpc) [ 10.54.1.95                               ] { grpc     } 27s idle time

The only thing that really changed is the major eos version upgrade.

I had a quick look at cta source code, and it appears to be throwing this error when the path response from the mgm is empty: eos_grpc_client/GrpcEndpoint.cpp · v4.7.8-1 · cta / CTA · GitLab

Is it possible eos5 uses a different response format?

I tcpdumped all outgoing DNS traffic from both cta frontend and mgm, and they do not appear to be doing anything interesting at all, just to rule out actual dns issues.

Will paste all software versions below for completeness:

CTA Frontend:

[root@ctafrontend-0 yum.repos.d]# rpm -qa | egrep "eos|xrootd|cta"
xrootd-server-4.12.4-1.el7.x86_64
libmicrohttpd-0.9.38-eos.yves.el7.cern.x86_64
xrootd-4.12.4-1.el7.x86_64
xrootd-client-4.12.4-1.el7.x86_64
xrootd-debuginfo-4.12.4-1.el7.x86_64
cta-lib-4.7.8-1.el7_9.x86_64
cta-common-4.7.8-1.el7_9.x86_64
eos-xrootd-4.12.8-1.el7.cern.x86_64
eos-folly-2019.11.11.00-1.el7.cern.x86_64
eos-protobuf3-3.5.1-5.el7.cern.eos.x86_64
cta-systemtests-4.7.8-1.el7_9.x86_64
cta-release-4.7.8-1.el7_9.x86_64
cta-frontend-4.7.8-1.el7_9.x86_64
cta-readtp-4.7.8-1.el7_9.x86_64
cta-tape-label-4.7.8-1.el7_9.x86_64
cta-systemtest-helpers-4.7.8-1.el7_9.x86_64
cta-cli-4.7.8-1.el7_9.x86_64
cta-smc-4.7.8-1.el7_9.x86_64
eos-test-4.8.87-1.el7.cern.x86_64
xrootd-client-libs-4.12.4-1.el7.x86_64
xrootd-server-libs-4.12.4-1.el7.x86_64
xrootd-selinux-4.12.4-1.el7.noarch
cta-lib-common-4.7.8-1.el7_9.x86_64
cta-lib-catalogue-4.7.8-1.el7_9.x86_64
eos-folly-deps-2019.11.11.00-1.el7.cern.x86_64
cta-taped-4.7.8-1.el7_9.x86_64
cta-fst-gcd-4.7.8-1.el7_9.x86_64
cta-rmcd-4.7.8-1.el7_9.x86_64
cta-frontend-grpc-4.7.8-1.el7_9.x86_64
cta-objectstore-tools-4.7.8-1.el7_9.x86_64
cta-catalogueutils-4.7.8-1.el7_9.x86_64
cta-immutable-file-test-4.7.8-1.el7_9.x86_64
cta-debuginfo-4.7.8-1.el7_9.x86_64
eos-client-4.8.87-1.el7.cern.x86_64
eos-server-4.8.87-1.el7.cern.x86_64
xrootd-libs-4.12.4-1.el7.x86_64

MGM:

[root@eos-mgm-0 /]# rpm -qa | egrep "eos|xrootd"
eos-xrootd-debuginfo-5.5.5-1.el7.cern.x86_64
eos-test-5.1.9-1.el7.cern.x86_64
eos-protobuf3-3.17.3-1.el7.cern.eos.x86_64
xrootd-client-libs-5.5.1-1.el7.x86_64
eos-grpc-1.41.0-1.el7.x86_64
eos-folly-deps-2019.11.11.00-1.el7.cern.x86_64
eos-client-5.1.8-1.el7.cern.x86_64
eos-server-5.1.8-1.el7.cern.x86_64
xrootd-client-5.5.1-1.el7.x86_64
xrootd-libs-5.5.1-1.el7.x86_64
eos-grpc-devel-1.41.0-1.el7.x86_64
eos-folly-2019.11.11.00-1.el7.cern.x86_64
eos-xrootd-5.5.5-1.el7.cern.x86_64
eos-libmicrohttpd-0.9.38-eos.el7.cern.x86_64
eos-ns-inspect-5.1.8-1.el7.cern.x86_64
eos-librichacl-1.12-14.el7.cern.x86_64

Thank you as always :slight_smile:

Denis

By the way, I can do this from the frontend server without any issues:

[root@ctafrontend-0 yum.repos.d]# eos-grpc-md --endpoint eos-mgm-0.mgm.cta.svc.cluster.archive:50051 --token {TOKENREDACTED} -l /eos/
{
 "type": "CONTAINER",
 "cmd": {
  "id": "2",
  "parentId": "1",
  "uid": "0",
  "gid": "0",
  "treeSize": "5152810716748769",
  "mode": 16893,
  "flags": 0,
  "name": "ZW9z",
  "ctime": {
   "sec": "1634011886",
   "nSec": "919833620"
  },
  "mtime": {
   "sec": "0",
   "nSec": "140145143686310"
  },
  "stime": {
   "sec": "0",
   "nSec": "140145143686310"
  },
  "xattrs": {
   "sys.forced.checksum": "YWRsZXI=",
   "sys.forced.layout": "cmVwbGljYQ=="
  },
  "path": "L2Vvcy8=",
  "etag": "2:0.140145143"
 }
}

{
 "type": "CONTAINER",
 "cmd": {
  "id": "3",
  "parentId": "2",
  "uid": "0",
  "gid": "0",
  "treeSize": "5152810716748769",
  "mode": 16893,
  "flags": 0,
  "name": "Y3Rh",
  "ctime": {
   "sec": "1634011886",
   "nSec": "919965798"
  },
  "mtime": {
   "sec": "1675126965",
   "nSec": "966450275"
  },
  "stime": {
   "sec": "1675126965",
   "nSec": "966450275"
  },
  "xattrs": {
   "sys.forced.checksum": "YWRsZXI=",
   "sys.forced.layout": "cmVwbGljYQ=="
  },
  "path": "L2Vvcy9jdGEv",
  "etag": "3:1675126965.966"
 }
}

request took 3103 micro seconds

Hi Denis, we are trying to reproduce the behaviour now :slight_smile:

Hi Lasse,

Thank you very much :slight_smile:

Warm Regards,

Denis

Hi Denis,

We have now reproduced the problem and are looking at possible fixes.

Hello Lasse,

Thank you very much again for the update and for looking into the issue :slight_smile:

Warm Regards,

Denis

Hi Denis,

To summarize the the impact of this problem, the current tools affected are:

  • cta-admin tf ls -l
  • cta-eos-namespace-inject
  • cta-restore-deleted-files
  • cta-change-storage-class

These all use gRPC for some of the communication with EOS.

Hello Denis,

How important is this fix for you for the moment?

We have reproduced it, and are looking into it, but it may take a few days to sort it out :slight_smile:

Hi Joao,

Thank you very much for asking. We don’t currently heavily rely on this function, so on a scale of 1-10, maybe a 1? I just wanted to report it more than anything.

Warm Regards,

Denis

Hi Denis,

Ok! Then, thank you for reporting this :slight_smile:
We will write here once we have more details.

Best regards,
Joao