Downgrading just gRPC frontend?

We seem to have found an issue with the gRPC frontend in 5.11. From what we can tell, the problem is in the logging at high rates, suggesting a threading problem.

My thought is that perhaps we could downgrade just the gRPC frontend to 5.11.0 since it wouldn’t complain about the new schema but also not actually require anything from the new schema since those changes (IIRC) are mostly about normalizing the database when it comes to supply pools.

Would this work or do we need to attempt a full downgrade to 5.11.0 or even a rollback of the schema and going back to 5.10?

Hi Eric, sorry for the late reply. To answer the question, yes, you should be able to downgrade just the CTA Frontend to 5.11.0, as indeed the protobuf changes for gRPC were added after 5.11.0, also the catalogue version is 15.0 and you don’t need to rollback the catalogue changes.

Regarding the particular problem you are facing, could you provide some more details about the situation?
What behavior do you observe and what are the relevant logs?

Regards,
Konstantina

Thanks for your reply. We will try and let you know.

Generally what we see is that if we rapidly issue something like 500-1000 retrieve requests, the GRPC server core dumps. The actual traceback varies but always involves the logging system leading us to believe it’s not thread safe. Let me provide you with a sample backtrace.

           PID: 2619008 (cta-frontend-gr)
           UID: 1000 (cta)
           GID: 33 (tape)
        Signal: 11 (SEGV)
     Timestamp: Fri 2025-03-07 15:47:39 CST (2min 4s ago)
  Command Line: /usr/bin/cta-frontend-grpc -c /etc/cta/cta-frontend-grpc.conf
    Executable: /usr/bin/cta-frontend-grpc
 Control Group: /system.slice/cta-frontend-grpc.service
          Unit: cta-frontend-grpc.service
         Slice: system.slice
       Boot ID: dc872fcf6a2442a9a259425b69e37631
    Machine ID: e5b9b85bafa646228dc0f31bfe9096cd
      Hostname: ctaitb01.fnal.gov
       Storage: /var/lib/systemd/coredump/core.cta-frontend-gr.1000.dc872fcf6a2442a9a259425b69e37631.2619008.1741384059000000.zst (present)
  Size on Disk: 11.0M
       Message: Process 2619008 (cta-frontend-gr) of user 1000 dumped core.
                
                Stack trace of thread 2631569:
                #0  0x00007f4be754d1de _ZNKSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE7compareERKS4_ (libstdc++.so.6 + 0x14d1de)
                #1  0x000000000047a6a0 _ZStltIcSt11char_traitsIcESaIcEEbRKNSt7__cxx1112basic_stringIT_T0_T1_EESA_ (cta-frontend-grpc + 0x7a6a0)
                #2  0x000000000047b5b0 _ZNKSt4lessINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEclERKS5_S8_ (cta-frontend-grpc + 0x7b5b0)
                #3  0x000000000048b5f7 _ZNKSt8_Rb_treeINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES5_St9_IdentityIS5_ESt4lessIS5_ESaIS5_EE14_M_lower_boundEPKSt13_Rb_tree_nodeIS5_EPKSt18_Rb_tree_node_baseRKS5_ (cta-frontend-grpc + 0x8b5f7)
                #4  0x0000000000489c14 _ZNKSt8_Rb_treeINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES5_St9_IdentityIS5_ESt4lessIS5_ESaIS5_EE4findERKS5_ (cta-frontend-grpc + 0x89c14)
                #5  0x00007f4becbc498a _ZNKSt3setINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt4lessIS5_ESaIS5_EE4findERKS5_ (libctacommon.so.0 + 0x1c498a)
                #6  0x00007f4becbc4533 _ZNK3cta3log10LogContext16ParamNameMatcherclERKNS0_5ParamE (libctacommon.so.0 + 0x1c4533)
                #7  0x00007f4becbc6dc9 _ZN9__gnu_cxx5__ops10_Iter_predIN3cta3log10LogContext16ParamNameMatcherEEclISt14_List_iteratorINS3_5ParamEEEEbT_ (libctacommon.so.0 + 0x1c6dc9)
                #8  0x00007f4becbc6232 _ZSt11__remove_ifISt14_List_iteratorIN3cta3log5ParamEEN9__gnu_cxx5__ops10_Iter_predINS2_10LogContext16ParamNameMatcherEEEET_SB_SB_T0_ (libctacommon.so.0 + 0x1c6232)
                #9  0x00007f4becbc50ec _ZSt9remove_ifISt14_List_iteratorIN3cta3log5ParamEENS2_10LogContext16ParamNameMatcherEET_S7_S7_T0_ (libctacommon.so.0 + 0x1c50ec)
                #10 0x00007f4becbc3b34 _ZN3cta3log10LogContext5eraseERKSt3setINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt4lessIS8_ESaIS8_EE (libctacommon.so.0 + 0x1c3b34)
                #11 0x00000000004795b7 _ZN3cta3log20ScopedParamContainerD2Ev (cta-frontend-grpc + 0x795b7)
                #12 0x00000000004787cb _ZN3cta8frontend4grpc10CtaRpcImpl8RetrieveEPN4grpc13ServerContextEPKNS_3xrd7RequestEPNS6_8ResponseE (cta-frontend-grpc + 0x787cb)
                #13 0x00000000004fbf5e _ZZN3cta3xrd6CtaRpc7ServiceC4EvENKUlPS2_PN4grpc13ServerContextEPKNS0_7RequestEPNS0_8ResponseEE2_clES3_S6_S9_SB_ (cta-frontend-grpc + 0xfbf5e)
                #14 0x0000000000500ec0 __invoke_impl<grpc::Status, cta::xrd::CtaRpc::Service::Service()::<lambda(cta::xrd::CtaRpc::Service*, grpc::ServerContext*, const cta::xrd::Request*, cta::xrd::Response*)>&, cta::xrd::CtaRpc::Service*, grpc::ServerContext*, const cta::xrd::Request*, cta::xrd::Response*> (cta-frontend-grpc + 0x100ec0)
                #15 0x00000000004ffa6b __invoke_r<grpc::Status, cta::xrd::CtaRpc::Service::Service()::<lambda(cta::xrd::CtaRpc::Service*, grpc::ServerContext*, const cta::xrd::Request*, cta::xrd::Response*)>&, cta::xrd::CtaRpc::Service*, grpc::ServerContext*, const cta::xrd::Request*, cta::xrd::Response*> (cta-frontend-grpc + 0xffa6b)
                #16 0x00000000004fe6b6 _M_invoke (cta-frontend-grpc + 0xfe6b6)
                #17 0x000000000052d2b1 _ZNKSt8functionIFN4grpc6StatusEPN3cta3xrd6CtaRpc7ServiceEPNS0_13ServerContextEPKNS3_7RequestEPNS3_8ResponseEEEclES6_S8_SB_SD_ (cta-frontend-grpc + 0x12d2b1)
                #18 0x0000000000524773 _ZZN4grpc8internal16RpcMethodHandlerIN3cta3xrd6CtaRpc7ServiceENS3_7RequestENS3_8ResponseEN6google8protobuf11MessageLiteESA_E10RunHandlerERKNS0_13MethodHandler16HandlerParameterEENKUlvE_clEv (cta-frontend-grpc + 0x124773)
                #19 0x000000000052d309 _ZN4grpc8internal23CatchingFunctionHandlerIZNS0_16RpcMethodHandlerIN3cta3xrd6CtaRpc7ServiceENS4_7RequestENS4_8ResponseEN6google8protobuf11MessageLiteESB_E10RunHandlerERKNS0_13MethodHandler16HandlerParameterEEUlvE_EENS_6StatusEOT_ (cta-frontend-grpc + 0x12d309)
                #20 0x0000000000524839 _ZN4grpc8internal16RpcMethodHandlerIN3cta3xrd6CtaRpc7ServiceENS3_7RequestENS3_8ResponseEN6google8protobuf11MessageLiteESA_E10RunHandlerERKNS0_13MethodHandler16HandlerParameterE (cta-frontend-grpc + 0x124839)
                #21 0x00007f4bef281c1d _ZN4grpc6Server11SyncRequest28ContinueRunAfterInterceptionEv (libgrpc++.so.1.46 + 0xb1c1d)
                #22 0x00007f4bef2884dc _ZN4grpc13ThreadManager12MainWorkLoopEv (libgrpc++.so.1.46 + 0xb84dc)
                #23 0x00007f4bef288610 _ZN4grpc13ThreadManager12WorkerThread3RunEv (libgrpc++.so.1.46 + 0xb8610)
                #24 0x00007f4bef173f81 _ZZN9grpc_core12_GLOBAL__N_120ThreadInternalsPosixC4EPKcPFvPvES4_PbRKNS_6Thread7OptionsEENUlS4_E_4_FUNES4_ (libgpr.so.24 + 0x11f81)
                #25 0x00007f4be7089c02 start_thread (libc.so.6 + 0x89c02)
                #26 0x00007f4be710ec40 __clone3 (libc.so.6 + 0x10ec40)
                
                Stack trace of thread 2619011:
                #0  0x00007f4be710e21e epoll_wait (libc.so.6 + 0x10e21e)
                #1  0x00007f4be63fe748 _ZN11EpollDriver10event_waitERSt6vectorI14FiredFileEventSaIS1_EEP7timeval (libceph-common.so.2 + 0x3fe748)
                #2  0x00007f4be63fcda6 _ZN11EventCenter14process_eventsEjPNSt6chrono8durationImSt5ratioILl1ELl1000000000EEEE (libceph-common.so.2 + 0x3fcda6)
                #3  0x00007f4be63fd916 _ZNSt17_Function_handlerIFvvEZN12NetworkStack10add_threadEP6WorkerEUlvE_E9_M_invokeERKSt9_Any_data (libceph-common.so.2 + 0x3fd916)
                #4  0x00007f4be74dbad4 execute_native_thread_routine (libstdc++.so.6 + 0xdbad4)
                #5  0x00007f4be7089c02 start_thread (libc.so.6 + 0x89c02)
                #6  0x00007f4be710ec40 __clone3 (libc.so.6 + 0x10ec40)```

See 9 & 10 in particular:

#9  0x00007f4becbc50ec in std::_List_iterator<cta::log::Param> std::remove_if<std::_List_iterator<cta::log::Param>, cta::log::LogContext::ParamNameMatcher>(std::_List_iterator<cta::log::Param>, std::_List_iterator<cta::log::Param>, cta::log::LogContext::ParamNameMatcher) () from /usr/lib64/libctacommon.so.0
#10 0x00007f4becbc3b34 in cta::log::LogContext::erase(std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) () from /usr/lib64/libctacommon.so.0

As far as I can tell the fault occurs in random places indicating no thread safety. E.g. another example of stack trace:

(gdb) bt
#0  0x00007f654454d1de in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::compare(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const () from /usr/lib64/libstdc++.so.6
#1  0x000000000047a6a0 in std::operator< <char, std::char_traits<char>, std::allocator<char> > (__lhs="groupname", __rhs=<error reading variable: Cannot access memory at address 0x7f676a0832e0>) at /usr/include/c++/11/bits/basic_string.h:6343
#2  0x000000000047b5b0 in std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >::operator() (this=0x7f6101bb4930, __x="groupname", 
    __y=<error reading variable: Cannot access memory at address 0x7f676a0832e0>) at /usr/include/c++/11/bits/stl_function.h:400
#3  0x000000000048b5f7 in std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::_Identity<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::_M_lower_bound (this=0x7f6101bb4930, __x=0x7f609c01ce80, __y=0x7f6101bb4938, __k=<error reading variable: Cannot access memory at address 0x7f676a0832e0>) at /usr/include/c++/11/bits/stl_tree.h:1921
#4  0x0000000000489c14 in std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::_Identity<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::find (this=0x7f6101bb4930, __k=<error reading variable: Cannot access memory at address 0x7f676a0832e0>) at /usr/include/c++/11/bits/stl_tree.h:2536
#5  0x00007f6549bc498a in std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::find(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const () from /usr/lib64/libctacommon.so.0
#6  0x00007f6549bc4533 in cta::log::LogContext::ParamNameMatcher::operator()(cta::log::Param const&) const () from /usr/lib64/libctacommon.so.0
#7  0x00007f6549bc6dc9 in bool __gnu_cxx::__ops::_Iter_pred<cta::log::LogContext::ParamNameMatcher>::operator()<std::_List_iterator<cta::log::Param> >(std::_List_iterator<cta::log::Param>) () from /usr/lib64/libctacommon.so.0
#8  0x00007f6549bc6bec in std::_List_iterator<cta::log::Param> std::__find_if<std::_List_iterator<cta::log::Param>, __gnu_cxx::__ops::_Iter_pred<cta::log::LogContext::ParamNameMatcher> >(std::_List_iterator<cta::log::Param>, std::_List_iterator<cta::log::Param>, __gnu_cxx::__ops::_Iter_pred<cta::log::LogContext::ParamNameMatcher>, std::input_iterator_tag) () from /usr/lib64/libctacommon.so.0
#9  0x00007f6549bc5ceb in std::_List_iterator<cta::log::Param> std::__find_if<std::_List_iterator<cta::log::Param>, __gnu_cxx::__ops::_Iter_pred<cta::log::LogContext::ParamNameMatcher> >(std::_List_iterator<cta::log::Param>, std::_List_iterator<cta::log::Param>, __gnu_cxx::__ops::_Iter_pred<cta::log::LogContext::ParamNameMatcher>) () from /usr/lib64/libctacommon.so.0
#10 0x00007f6549bc4c0a in std::_List_iterator<cta::log::Param> std::find_if<std::_List_iterator<cta::log::Param>, cta::log::LogContext::ParamNameMatcher>(std::_List_iterator<cta::log::Param>, std::_List_iterator<cta::log::Param>, cta::log::LogContext::ParamNameMatcher) () from /usr/lib64/libctacommon.so.0
#11 0x00007f6549bc37f3 in cta::log::LogContext::pushOrReplace(cta::log::Param const&) () from /usr/lib64/libctacommon.so.0
#12 0x000000000047a963 in cta::log::ScopedParamContainer::add<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > (this=0x7f6101bb4be0, s="groupname", t="eosusers")
    at /usr/src/debug/cta-5.11.2.0-1.el9.x86_64/common/log/LogContext.hpp:148
#13 0x00000000004785df in cta::frontend::grpc::CtaRpcImpl::Retrieve (this=0x7ffcf7e03ea0, context=0x7f609c014998, request=0x7f609c0037f0, response=0x7f6101bb5040)
    at /usr/src/debug/cta-5.11.2.0-1.el9.x86_64/frontend/grpc/FrontendGrpcService.cpp:139
#14 0x00000000004fbf5e in operator() (__closure=0x203f608, service=0x7ffcf7e03ea0, ctx=0x7f609c014998, req=0x7f609c0037f0, resp=0x7f6101bb5040) at /usr/src/debug/cta-5.11.2.0-1.el9.x86_64/build/eos_cta/cta_frontend.grpc.pb.cc:251
#15 0x0000000000500ec0 in std::__invoke_impl<grpc::Status, cta::xrd::CtaRpc::Service::Service()::<lambda(cta::xrd::CtaRpc::Service*, grpc::ServerContext*, const cta::xrd::Request*, cta::xrd::Response*)>&, cta::xrd::CtaRpc::Service*, grpc::ServerContext*, const cta::xrd::Request*, cta::xrd::Response*>(std::__invoke_other, struct {...} &) (__f=...) at /usr/include/c++/11/bits/invoke.h:61
#16 0x00000000004ffa6b in std::__invoke_r<grpc::Status, cta::xrd::CtaRpc::Service::Service()::<lambda(cta::xrd::CtaRpc::Service*, grpc::ServerContext*, const cta::xrd::Request*, cta::xrd::Response*)>&, cta::xrd::CtaRpc::Service*, grpc::ServerContext*, const cta::xrd::Request*, cta::xrd::Response*>(struct {...} &) (__fn=...) at /usr/include/c++/11/bits/invoke.h:116
#17 0x00000000004fe6b6 in std::_Function_handler<grpc::Status(cta::xrd::CtaRpc::Service*, grpc::ServerContext*, const cta::xrd::Request*, cta::xrd::Response*), cta::xrd::CtaRpc::Service::Service()::<lambda(cta::xrd::CtaRpc::Service*, grpc::ServerContext*, const cta::xrd::Request*, cta::xrd::Response*)> >::_M_invoke(const std::_Any_data &, cta::xrd::CtaRpc::Service *&&, grpc::ServerContext *&&, const cta::xrd::Request *&&, cta::xrd::Response *&&) (__functor=..., 
    __args#0=@0x7f6101bb4e78: 0x7ffcf7e03ea0, __args#1=@0x7f6101bb4e70: 0x7f609c014998, __args#2=@0x7f6101bb4e68: 0x7f609c0037f0, __args#3=@0x7f6101bb4e60: 0x7f6101bb5040) at /usr/include/c++/11/bits/std_function.h:291
#18 0x000000000052d2b1 in std::function<grpc::Status (cta::xrd::CtaRpc::Service*, grpc::ServerContext*, cta::xrd::Request const*, cta::xrd::Response*)>::operator()(cta::xrd::CtaRpc::Service*, grpc::ServerContext*, cta::xrd::Request const*, cta::xrd::Response*) const (this=0x203f608, __args#0=0x7ffcf7e03ea0, __args#1=0x7f609c014998, __args#2=0x7f609c0037f0, __args#3=0x7f6101bb5040) at /usr/include/c++/11/bits/std_function.h:590
#19 0x0000000000524773 in grpc::internal::RpcMethodHandler<cta::xrd::CtaRpc::Service, cta::xrd::Request, cta::xrd::Response, google::protobuf::MessageLite, google::protobuf::MessageLite>::RunHandler(grpc::internal::MethodHandler::HandlerParameter const&)::{lambda()#1}::operator()() const (__closure=0x7f6101bb4f80) at /usr/include/grpcpp/impl/codegen/method_handler.h:116
#20 0x000000000052d309 in grpc::internal::CatchingFunctionHandler<grpc::internal::RpcMethodHandler<cta::xrd::CtaRpc::Service, cta::xrd::Request, cta::xrd::Response, google::protobuf::MessageLite, google::protobuf::MessageLite>::RunHandler(grpc::internal::MethodHandler::HandlerParameter const&)::{lambda()#1}>(grpc::internal::RpcMethodHandler<cta::xrd::CtaRpc::Service, cta::xrd::Request, cta::xrd::Response, google::protobuf::MessageLite, google::protobuf::MessageLite>::RunHandler(grpc::internal::MethodHandler::HandlerParameter const&)::{lambda()#1}&&) (handler=...) at /usr/include/grpcpp/impl/codegen/method_handler.h:44
#21 0x0000000000524839 in grpc::internal::RpcMethodHandler<cta::xrd::CtaRpc::Service, cta::xrd::Request, cta::xrd::Response, google::protobuf::MessageLite, google::protobuf::MessageLite>::RunHandler (this=0x203f600, param=...)
    at /usr/include/grpcpp/impl/codegen/method_handler.h:113
#22 0x00007f654bfc2c1d in grpc::Server::SyncRequest::ContinueRunAfterInterception() () from /usr/lib64/libgrpc++.so.1.46
#23 0x00007f654bfc94dc in grpc::ThreadManager::MainWorkLoop() () from /usr/lib64/libgrpc++.so.1.46
#24 0x00007f654bfc9610 in grpc::ThreadManager::WorkerThread::Run() () from /usr/lib64/libgrpc++.so.1.46
#25 0x00007f654c365f81 in grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*, grpc_core::Thread::Options const&)::{lambda(void*)#1}::_FUN(void*) ()
   from /usr/lib64/libgpr.so.24
#26 0x00007f6544089c02 in start_thread () from /usr/lib64/libc.so.6
--Type <RET> for more, q to quit, c to continue without paging--
#27 0x00007f654410ec40 in clone3 () from /usr/lib64/libc.so.6
#13 0x00000000004785df in cta::frontend::grpc::CtaRpcImpl::Retrieve (this=0x7ffcf7e03ea0, context=0x7f609c014998, request=0x7f609c0037f0, response=0x7f6101bb5040)
    at /usr/src/debug/cta-5.11.2.0-1.el9.x86_64/frontend/grpc/FrontendGrpcService.cpp:139

on line 139:

  sp.add("groupname", request->notification().cli().user().groupname());

where sp is ScopedParamContainer

#12 0x000000000047a963 in cta::log::ScopedParamContainer::add<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > (this=0x7f6101bb4be0, s="groupname", t="eosusers")
    at /usr/src/debug/cta-5.11.2.0-1.el9.x86_64/common/log/LogContext.hpp:148

line 148 of /usr/src/debug/cta-5.11.2.0-1.el9.x86_64/common/log/LogContext.hpp:

  ScopedParamContainer& add(const std::string& s, const T& t) {
    m_context.pushOrReplace(Param(s,t)); <--- line 148

I think you can reproduce the problem by hiting the GRPC front-end with muliple simultaneous Retrieve requests.

Hope this helps,
Dmitry

Hello Eric and Dmitry:

You should find a fix for this issue in the new patch release 5.11.2.1-1:

Functionality wise, it should be the same as the version 5.11.2.0-1 that we have in our stable repo, with the addition of only this fix.

Please test it with dCache and let us know if you find any new issues.