Frontend crash with general protection fault

Hello everyone! I wonder if anyone else has run into this issue with the cta frontend xrootd binary crashing. I am not 100% on the best way to collect debug data / diagnostics. Please let me know if you need more logs / info.

Basically, we’ve got cta running in production. Generally the archive queue is fairly short. I have recently been testing doing retrieves (restic restores), and I found when the retrieve queue got fairly large (2000 requests), the frontend container crashed 5 times in a row (container restarts upon crash) before recovering. It could be unrelated to the retrieve testing, but definitely worth mentioning. The frontend container has been 100% stable up until that point.

Here is the error from the container host syslog:

Nov 11 02:16:02 crlt-v4 kernel: traps: xrootd[92282] general protection fault ip:7f9eab6ccbc7 sp:7f9e7a9fb9f0 error:0 in libc-2.17.so[7f9eab695000+1c4000]
Nov 11 02:17:01 crlt-v4 kernel: traps: xrootd[153458] general protection fault ip:7f0ae675fbc7 sp:7f0ab9bfc9f0 error:0 in libc-2.17.so[7f0ae6728000+1c4000]
Nov 11 02:18:01 crlt-v4 kernel: traps: xrootd[235328] general protection fault ip:7f7ed0298bc7 sp:7f7eac9fb9f0 error:0 in libc-2.17.so[7f7ed0261000+1c4000]
Nov 11 02:19:02 crlt-v4 kernel: traps: xrootd[101964] general protection fault ip:7f2167962bc7 sp:7f21445f49f0 error:0 in libc-2.17.so[7f216792b000+1c4000]
Nov 11 02:21:02 crlt-v4 kernel: traps: xrootd[178779] general protection fault ip:7f79e54cebc7 sp:7f79b9df49f0 error:0 in libc-2.17.so[7f79e5497000+1c4000]

There are entries in between the above, but they are simply of the container crashing and restarting, so I am not sure they are relevant.

I also have a brief snippet from the container stdout, right before the last crash ( in case relevant):

{"log":"pure virtual method called\n","stream":"stderr","time":"2021-11-11T02:21:01.988288772Z"}
{"log":"terminate called without an active exception\n","stream":"stderr","time":"2021-11-11T02:21:01.988302418Z"}

We run a slightly out of date version 4.0-2

[root@ctafrontend-0 /]# rpm -qa | egrep 'cta|xrootd'
xrootd-client-4.12.8-1.el7.x86_64
xrootd-debuginfo-4.12.8-1.el7.x86_64
cta-lib-0-1.el7_9.x86_64
cta-common-0-1.el7_9.x86_64
cta-taped-0-1.el7_9.x86_64
cta-fst-gcd-0-1.el7_9.x86_64
cta-rmcd-0-1.el7_9.x86_64
cta-objectstore-tools-0-1.el7_9.x86_64
cta-systemtest-helpers-0-1.el7_9.x86_64
cta-immutable-file-test-0-1.el7_9.x86_64
cta-smc-0-1.el7_9.x86_64
xrootd-client-libs-4.12.8-1.el7.x86_64
xrootd-server-libs-4.12.8-1.el7.x86_64
xrootd-selinux-4.12.8-1.el7.noarch
xrootd-server-4.12.8-1.el7.x86_64
xrootd-4.12.8-1.el7.x86_64
cta-lib-common-0-1.el7_9.x86_64
cta-lib-catalogue-0-1.el7_9.x86_64
eos-xrootd-4.12.8-1.el7.cern.x86_64
cta-systemtests-0-1.el7_9.x86_64
cta-frontend-0-1.el7_9.x86_64
cta-catalogueutils-0-1.el7_9.x86_64
cta-tape-label-0-1.el7_9.x86_64
cta-cli-0-1.el7_9.x86_64
cta-debuginfo-0-1.el7_9.x86_64
xrootd-libs-4.12.8-1.el7.x86_64

Any idea what may have caused it? Currently it’s running production workloads, so I am hesitate to resume retrieve testing.

Thanks as always :slight_smile:

Warm Regards,

Denis

“pure virtual method called” could be caused when a C++ object is deleted but there is a dangling pointer. If a method is called on the pointer, it no longer points to the deleted object, but to the pure virtual methods of the abstract superclass (which of course should never be called and hence the exception).

Your CTA packages have a strange version number 0-1, presumably you built these yourself? If you always set the version to 0-1, there is no way to validate that the packages and libraries are all compatible versions.

If you build the packages using the git CI jobs, they should be properly versioned. I would recommend that you fix that first, rebuild the RPMs with proper version numbers, and deploy those everywhere.

I would also recommend upgrading to the latest 4.1-1 tagged release as it contains some bug fixes (though that’s probably not what is causing this problem.)

If you do all of that and still see the problem, please obtain a core dump or at least a stack trace from the process which is crashing and we can take a look.

Hi Michael,

Thanks very much for that! I’ve got some homework to do now and some upgrades :slight_smile:

Cheers,

Denis

Hi Denis,

I’d like to add that we hope to deploy a new CTA version here at CERN in the next couple of weeks and we’ll make a public releases of the binaries, it may be of interest to switch to this (in any case, we’d be happy for someone to test it).

Cheers,

Oliver.

Hi Oliver,

Thanks very much for your reply. We are still getting a dev instance of cta up and running, but once it is up, I will be more than happy to test the new version.

Regards,

Denis

Hi Guys,

Do these versions look better?

cta-lib-catalogue-0-v4.6.3.el7_9.x86_64
cta-common-0-v4.6.3.el7_9.x86_64
cta-systemtests-0-v4.6.3.el7_9.x86_64
cta-release-0-v4.6.3.el7_9.x86_64
cta-rmcd-0-v4.6.3.el7_9.x86_64
cta-objectstore-tools-0-v4.6.3.el7_9.x86_64
cta-tape-label-0-v4.6.3.el7_9.x86_64
cta-cli-0-v4.6.3.el7_9.x86_64
cta-debuginfo-0-v4.6.3.el7_9.x86_64
cta-lib-common-0-v4.6.3.el7_9.x86_64
cta-lib-0-v4.6.3.el7_9.x86_64
cta-taped-0-v4.6.3.el7_9.x86_64
cta-fst-gcd-0-v4.6.3.el7_9.x86_64
cta-frontend-0-v4.6.3.el7_9.x86_64
cta-readtp-0-v4.6.3.el7_9.x86_64
cta-systemtest-helpers-0-v4.6.3.el7_9.x86_64
cta-catalogueutils-0-v4.6.3.el7_9.x86_64
cta-immutable-file-test-0-v4.6.3.el7_9.x86_64
cta-smc-0-v4.6.3.el7_9.x86_64

Cheers,

Denis

Hi @denis.lujanski,
not quite. You prepend 0-v before the version number, but this is not needed. Also note the fourth number in our tagging scheme 4.6.3-1, which indicates an update in packaging, but not to the software source itself. It should look more like:

[root@ctaproductionfrontend02 ~]# rpm -qa | egrep 'cta|xrootd'
cta-catalogueutils-4.6.3-1.el7.cern.x86_64
ctaops-data-volume-monitoring-0.4-52.noarch
cta-lib-catalogue-4.6.3-1.el7.cern.x86_64
ctaops-tape-alerting-system-0.4-52.noarch
ctaops-admin-0.4-52.noarch
cta-frontend-4.6.3-1.el7.cern.x86_64
xrootd-client-libs-4.12.6-1.el7.x86_64
eos-xrootd-4.12.8-1.el7.cern.x86_64
ctaops-it-overview-generator-0.4-52.noarch
cta-common-4.6.3-1.el7.cern.x86_64
cta-cli-4.6.3-1.el7.cern.x86_64
xrootd-libs-4.12.6-1.el7.x86_64
xrootd-server-libs-4.12.6-1.el7.x86_64
ctaops-queues-monitoring-0.4-52.noarch
ctaops-tape-namespace-generator-0.4-52.noarch
ctaops-pool-supply-0.4-52.noarch
xrootd-debuginfo-4.12.6-1.el7.x86_64
cta-lib-4.6.3-1.el7.cern.x86_64
xrootd-server-4.12.6-1.el7.x86_64
ctaops-rd-helpers-0.4-52.noarch
ctaops-generate-CERN-tape-usage-json-0.4-52.noarch
cta-objectstore-tools-4.6.3-1.el7.cern.x86_64
ctaops-availability-generator-0.4-52.noarch
cta-lib-common-4.6.3-1.el7.cern.x86_64
cta-debuginfo-4.6.3-1.el7.cern.x86_64
xrootd-client-4.12.6-1.el7.x86_64
ctaops-tsmod-daily-report-0.4-52.noarch
ctaops-cern-tape-utils-0.4-52.noarch

The versioning convention is explained in the CTA docs here: Tagging a new CTA release - EOSCTA Docs .
You can use our gitlab-ci file as a reference for extracting the version numbers from the repo tags: .gitlab-ci.yml · master · cta / CTA .

Are you still experiencing the crashes mentioned back in November? Did the upgrade to the new version help at all?
Cheers,
-Richard