Hello everyone! I wonder if anyone else has run into this issue with the cta frontend xrootd binary crashing. I am not 100% on the best way to collect debug data / diagnostics. Please let me know if you need more logs / info.
Basically, we’ve got cta running in production. Generally the archive queue is fairly short. I have recently been testing doing retrieves (restic restores), and I found when the retrieve queue got fairly large (2000 requests), the frontend container crashed 5 times in a row (container restarts upon crash) before recovering. It could be unrelated to the retrieve testing, but definitely worth mentioning. The frontend container has been 100% stable up until that point.
Here is the error from the container host syslog:
Nov 11 02:16:02 crlt-v4 kernel: traps: xrootd[92282] general protection fault ip:7f9eab6ccbc7 sp:7f9e7a9fb9f0 error:0 in libc-2.17.so[7f9eab695000+1c4000]
Nov 11 02:17:01 crlt-v4 kernel: traps: xrootd[153458] general protection fault ip:7f0ae675fbc7 sp:7f0ab9bfc9f0 error:0 in libc-2.17.so[7f0ae6728000+1c4000]
Nov 11 02:18:01 crlt-v4 kernel: traps: xrootd[235328] general protection fault ip:7f7ed0298bc7 sp:7f7eac9fb9f0 error:0 in libc-2.17.so[7f7ed0261000+1c4000]
Nov 11 02:19:02 crlt-v4 kernel: traps: xrootd[101964] general protection fault ip:7f2167962bc7 sp:7f21445f49f0 error:0 in libc-2.17.so[7f216792b000+1c4000]
Nov 11 02:21:02 crlt-v4 kernel: traps: xrootd[178779] general protection fault ip:7f79e54cebc7 sp:7f79b9df49f0 error:0 in libc-2.17.so[7f79e5497000+1c4000]
There are entries in between the above, but they are simply of the container crashing and restarting, so I am not sure they are relevant.
I also have a brief snippet from the container stdout, right before the last crash ( in case relevant):
{"log":"pure virtual method called\n","stream":"stderr","time":"2021-11-11T02:21:01.988288772Z"}
{"log":"terminate called without an active exception\n","stream":"stderr","time":"2021-11-11T02:21:01.988302418Z"}
We run a slightly out of date version 4.0-2
[root@ctafrontend-0 /]# rpm -qa | egrep 'cta|xrootd'
xrootd-client-4.12.8-1.el7.x86_64
xrootd-debuginfo-4.12.8-1.el7.x86_64
cta-lib-0-1.el7_9.x86_64
cta-common-0-1.el7_9.x86_64
cta-taped-0-1.el7_9.x86_64
cta-fst-gcd-0-1.el7_9.x86_64
cta-rmcd-0-1.el7_9.x86_64
cta-objectstore-tools-0-1.el7_9.x86_64
cta-systemtest-helpers-0-1.el7_9.x86_64
cta-immutable-file-test-0-1.el7_9.x86_64
cta-smc-0-1.el7_9.x86_64
xrootd-client-libs-4.12.8-1.el7.x86_64
xrootd-server-libs-4.12.8-1.el7.x86_64
xrootd-selinux-4.12.8-1.el7.noarch
xrootd-server-4.12.8-1.el7.x86_64
xrootd-4.12.8-1.el7.x86_64
cta-lib-common-0-1.el7_9.x86_64
cta-lib-catalogue-0-1.el7_9.x86_64
eos-xrootd-4.12.8-1.el7.cern.x86_64
cta-systemtests-0-1.el7_9.x86_64
cta-frontend-0-1.el7_9.x86_64
cta-catalogueutils-0-1.el7_9.x86_64
cta-tape-label-0-1.el7_9.x86_64
cta-cli-0-1.el7_9.x86_64
cta-debuginfo-0-1.el7_9.x86_64
xrootd-libs-4.12.8-1.el7.x86_64
Any idea what may have caused it? Currently it’s running production workloads, so I am hesitate to resume retrieve testing.
Thanks as always
Warm Regards,
Denis