Problems with containerized instructions

ewv · 13 May 2021 16:20

I’ve tried following the instructions here: A CTA instance built from source in a standalone VM - EOSCTA Docs but it’s been unsuccessful.

Most significantly, none of the containers in the k8s install start correctly. For instance the front-end container complains:
Creating symlinks for CTA binaries.
find: ‘/home/cta/CTA-build’: No such file or directory
ln: missing file operand
Try ‘ln --help’ for more information.
Creating symlinks for CTA libraries.
find: ‘/home/cta/CTA-build’: No such file or directory
Creating symlink for frontend configuration file.
/opt/run/bin/mkSymlinks.sh: line 12: /home/cta/CTA-build/CMakeCache.txt: No such file or directory

and the reason is there there is nothing in /home/cta/

Similarly the tape server pods complain that
tail: cannot open ‘/var/log/cta/cta-taped.log’ for reading: No such file or directory
bash: /usr/bin/cta-taped: No such file or directory
taped died

because they, or course, don’t contain that executable in /usr/bin

Further, I can’t figure out how/why the container WOULD have these files available. What am I doing wrong?

jocamare · 17 May 2021 09:34

Hello Eric,

If /home/cta is empty is because the script ./bootstrapSystem.sh cta in CTA/continuousintegration/buildtree_runner/vmBootstrap failed.

Could you try to run it again? If the folder /home/cta/ is still empty could you copy here the output of the script?

If it’s successfull, try to run:

su - cta
cd ~/CTA/continuousintegration/buildtree_runner/vmBootstrap
./bootstrapCTA.sh cta

And tell me if you get as last ouput: CTA setup finished successfully. If it wasn’t successful, could you copy here the complete ouput of the script?

Best regards

ewv · 18 May 2021 12:35

It is completing successfully. It’s not that /home/cta is empty on my VM, of course, but that it’s empty on the container/pod when starting up kubernetes ON this VM.

okeeble · 18 May 2021 13:37

Hi,
When you run create_instance.sh you should be invoking with the -b ~ switch which defines which buildtree path to map into the container (in this case the home dir). Can you check you’re doing this?
Oliver.

ewv · 25 May 2021 20:34

Thanks Oliver. That did, indeed help. After that I couldn’t get the various containers to all start at the same time, so I broke things down a little bit into a number of different scripts. I can now get all of the kubernetes pods to at least start.

However, the ctaeos pod never becomes “ready”

(kubectl --namespace=cta exec ctaeos – bash -c ‘[ -f /EOSOK ] && echo -n Ready || echo -n Not ready’) never succeeds (peeking into the pod I can see that the EOSOK file never appears)

The last messages I see in the log file for the ctaeos pod are these:

unix:"<pwd>":gid => nobody
unix:"<pwd>":uid => nobody
success: mode of file/directory /eos/grpctest is now '777'
success: mode of file/directory /eos/ctaeos/cta is now '555'
success: mode of file/directory /eos/ctaeos/tmp is now '777'
Waiting for the EOS disk filesystem using /fst to boot and come on-line

Where should I look for a hint to what’s going wrong?

okeeble · 26 May 2021 06:41

Hi,

We’ll need to check the output of

kubectl exec ctaeos eos fs ls

and look in the contents of the log

kubectl exec ctaeos cat /var/log/eos/fst/xrdlog.fst

for clues. Probably worth throwing in

kubectl exec ctaeos ps auxw

while you’re at it.

Oliver.

ewv · 26 May 2021 15:10

[cta@ewv-cta orchestration (master)]$ kubectl -n cta exec ctaeos eos fs ls                                                                                        
┌────────────────────────┬────┬──────┬────────────────────────────────┬────────────────┬────────────────┬────────────┬──────────────┬────────────┬────────┬────────────────┐
│host                    │port│    id│                            path│      schedgroup│          geotag│        boot│  configstatus│       drain│  active│          health│
└────────────────────────┴────┴──────┴────────────────────────────────┴────────────────┴────────────────┴────────────┴──────────────┴────────────┴────────┴────────────────┘
 localhost                1234  65535                  /does_not_exist           tape.0                                          off      nodrain  offline                  

[cta@ewv-cta orchestration (master)]$ kubectl -n cta exec ctaeos ps auxw
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0  11688  1316 ?        Ss   17:05   0:00 /bin/bash /opt/run/bin/ctaeos-mgm-log-wrapper.sh none
root         8  0.0  0.0  11692  1584 ?        S    17:05   0:00 /bin/bash /opt/run/bin/ctaeos-mgm.sh
root         9  0.0  0.0   4364   352 ?        S    17:05   0:00 tee -a /var/log/ctaeos-mgm.log
daemon     277  0.3  0.8 259852 62148 ?        SLl  17:06   0:00 /opt/eos/xrootd/bin/xrootd -n mq -c /etc/xrd.cf.mq -l /var/log/eos/xrdlog.mq -b -Rdaemon
daemon     295  0.9  2.0 1614380 147540 ?      SLl  17:06   0:01 /opt/eos/xrootd/bin/xrootd -n mgm -c /etc/xrd.cf.mgm -m -l /var/log/eos/xrdlog.mgm -b -Rdaemon
daemon     405  0.0  0.5 1011768 42892 ?       S    17:06   0:00 /opt/eos/xrootd/bin/xrootd -n mgm -c /etc/xrd.cf.mgm -m -l /var/log/eos/xrdlog.mgm -b -Rdaemon
root       444  0.3  0.1 608628 13736 ?        Sl   17:06   0:00 eos -b console log _MGMID_
daemon     487  0.2  1.4 589672 103328 ?       SLl  17:06   0:00 /opt/eos/xrootd/bin/xrootd -n fst -c /etc/xrd.cf.fst -l /var/log/eos/xrdlog.fst -b -Rdaemon
daemon     550  0.0  1.1 543572 81244 ?        S    17:06   0:00 /opt/eos/xrootd/bin/xrootd -n fst -c /etc/xrd.cf.fst -l /var/log/eos/xrdlog.fst -b -Rdaemon
root      2715  0.0  0.0  16844   568 ?        S    17:08   0:00 sleep 1
root      2716  0.0  0.0  51732  1720 ?        Rs   17:08   0:00 ps auxw

And I’ll find a way to attach the log which is quite long. I have not investigated why yet, wanting to perhaps catch you before the end of the day.

ewv · 26 May 2021 15:17

Here’s the log. Is there a better way to share this (first time I’m using discourse) xrdlog.fst - Google Drive

okeeble · 26 May 2021 15:56

Hi,

Well, the FST appears to be up and running, and it even knows how to find its MGM, but in the end it doesn’t get registered (it’s not there on eos fs ls).
Next thing to look at is what’s happening at exactly that time (your last xrdlog.fst log entry) on the MGM

kubectl exec ctaeos cat /var/log/eos/mgm/xrdlog.mgm

Also, was there any indication of a problem in the output of ./create_instance.sh ?

Oliver.

ewv · 26 May 2021 20:25

Hi Oliver,

Yes, I had a few problems with the create_instance script like not being able to touch “OKTOSTART” in ctaeos pod. So I copied all the pod definition files from /tmp to somewhere I could kubectl apply -f them one and then broke the script into a few logical parts such that I can start doing what “create_instance” did “in the middle”. So it’s this which I am trying to debug.

Here’s the MGM log file: xrdlog.mgm - Google Drive

okeeble · 27 May 2021 12:34

Hi,
I’m afraid I don’t see any smoking guns here. To continue digging, I’d suggest checking the log of the last major EOS component

kubectl exec ctaeos cat /var/log/eos/mq/xrdlog.mq

Trying to see if the FST node is registered

eos node ls

And if so, trying to register the filesystem to see if this triggers an error

eosfstregister /fst default:1

All that said, you doubtless noticed that the ./create_instance.sh script does more than just instantiate containers. Despite its fragility, this script does work when used as indicated in the instructions. I think it might be more profitable to wipe your installation and then recheck your steps, and when you get to ./create_instance.sh run it as

sudo bash -xv ./create_instance.sh -n cta -b ~ -B CTA-build -D -O -d <db.yaml>

to see what’s going on.

ewv · 27 May 2021 14:39

Here’s those outputs:

[cta@ewv-cta orchestration (master)]$ kubectl exec -n cta ctaeos -- eos node ls
┌──────────┬─────────────────────────────────┬────────────────┬──────────┬────────────┬──────┬──────────┬────────┬────────┬────────────────┬─────┐
│type      │                         hostport│          geotag│    status│   activated│  txgw│ gw-queued│  gw-ntx│ gw-rate│  heartbeatdelta│ nofs│
└──────────┴─────────────────────────────────┴────────────────┴──────────┴────────────┴──────┴──────────┴────────┴────────┴────────────────┴─────┘
 nodesview  ctaeos.cta.svc.cluster.local:1095             flat     online           on    off          0       10      120                1     0 
 nodesview                     localhost:1234              ???    unknown          ???    off          0       10      120                ~     1

and

[cta@ewv-cta orchestration (master)]$ kubectl exec -n cta ctaeos -- eosfstregister /fst default:1
###########################
# <eosfstregister> v1.0.0
###########################
error: cannot automatically determine to which MGM I should connect - set it via EOS_MGM_URL in /etc/sysconfig/eos or CDB_CLUSTER variable in /etc/quattor_install_info!
error: you have to provide a manager name <host>[:<port>]
usage: eosfstregister [-i] [-r] [--force] [--port <port>] [<host[:port]>] <mount-prefix> [<space1>:<nfilesystems1>] [<space2>:<nfilesystems2>] [-h|--help]
                       -r : allows to register file systems mounted on the root partition
                       -i : ignore if one of the filesystems has already a filesystem id stored and continue
                  --force : delete old filesystem files and force registration of the new filesystem
                   --port : the FST port
 hint: if <mount-prefix> ends with '/' subdirectories are scanned in this directory, 
       if <mount-prefix> does not end with '/' directories starting with <mount-prefix> are scanned ( e.g. /data* )
       if <space>='spare' all filesystems will be registered without scheduling group and you can move them in to production via 'fs mv spare <prod-space>'

The URL value does not appear in /etc/sysconfig/eos and the Quattor file does not exist at all.

I will try again with the script as you suggest. I left alone the parts which I thought were successful and tried to part out the rest to see if I could figure out where it went off the rails, but maybe I missed something further up.

Thanks for your help!

Eric

ewv · 27 May 2021 19:24

[edited: if you got an earlier e-mail with wildly different content, ignore it. I was running the script as root instead of with sudo and the definition of ~ is very different]

So this is what happens when I wiped my server and restarted. First, in ./bootstrapMHVTL.sh (second time through, after a reboot) I get:

depmod: ERROR: fstatat(5, openafs.ko): No such file or directory

Is that a problem? The script seems to continue happily without it.

OK, then sudo bash -xv ./create_instance.sh -n cta -b ~ -B CTA-build -D -O -d database.yaml as suggested. First time through a DNS error on Postgres although the pod exists: (database.yaml is internal_postgres.yaml)

Wiping database
Aborting: create failed: PostgresConn connection failed: could not translate host name "postgres" to address: Name or service not known
ERROR: Could not wipe database. cta-catalogue-schema-drop /etc/cta/cta-catalogue.conf FAILED
+ die 'ERROR: init pod in ErERROR: init pod in Error state. Initialization failed.'
+ echo 'ERROR: init pod in ErERROR: init pod in Error state. Initialization failed.'
ERROR: init pod in ErERROR: init pod in Error state. Initialization failed.
+ exit 1
[cta@ewv-cta2 orchestration (master)]$ kubectl get pods -n cta
NAME       READY     STATUS    RESTARTS   AGE
postgres   1/1       Running   0          19s

Second time through:

Creating cta instance
kubectl create namespace ${instance} || die "FAILED"
+ kubectl create namespace cta
Error from server (AlreadyExists): namespaces "cta" already exists
+ die FAILED
+ echo FAILED
FAILED

Third time through, back to the same thing with Postgres not being able to be found.

This was the place, before where I decided to split the script up into pieces.

ewv · 27 May 2021 21:59

However if I create a pod based on init but which just sleeps, I the postgres name is acceptable:

[root@ewv-cta2 tmp.I8NldScMDX]# kubectl exec -n cta -it sleep -- /bin/bash
[root@sleep /]# ping postgres
PING postgres.cta.svc.cluster.local (10.254.30.3) 56(84) bytes of data.
64 bytes from 10.254.30.3 (10.254.30.3): icmp_seq=1 ttl=64 time=0.317 ms
64 bytes from 10.254.30.3 (10.254.30.3): icmp_seq=2 ttl=64 time=0.089 ms
64 bytes from 10.254.30.3 (10.254.30.3): icmp_seq=3 ttl=64 time=0.089 ms

okeeble · 28 May 2021 12:47

Hi Eric,

First thing - if you still have the setup we were initially discussing, or if you ever reach this point again, please try the fs registration again as follows

kubectl exec ctaeos -- bash -c "EOS_MGM_URL=localhost /usr/sbin/eosfstregister /fst default:1"

and check

kubectl exec ctaeos -- ls -al /fst/

We should also look earlier in the kubectl logs ctaeos output to what’s happening when eosfstregister was run.

Second thing - to clean and redo your k8s setup in general you really have to clean things up;

kubectl delete all --all --namespace=cta
kubectl delete pv --all
kubectl delete pvc claimlibrary --namespace=cta
kubectl delete ns cta
~/CTA/continuousintegration/buildtree_runner/recreate_buildtree_running_environment.sh

before running create_instance.sh again.
Then - I recommend editing create_instance to put a sleep 10 after kubectl create -f ${config_database} --namespace=${instance} in case your problems are related to the db not being ready.

ewv · 28 May 2021 15:36

Hi Oliver. I think that sleep might actually have been the key. I had reinstalled my node a couple of times yesterday to get a fresh setup. But now after your reset instructions and just adding the “sleep” seems to have helped.

The various diagnostic things you asked for above show the following:

[cta@ewv-cta2 orchestration (master)]$ kubectl exec -n cta ctaeos eos fs ls
┌────────────────────────────┬────┬──────┬────────────────────────────────┬────────────────┬────────────────┬────────────┬──────────────┬────────────┬────────┬────────────────┐
│host                        │port│    id│                            path│      schedgroup│          geotag│        boot│  configstatus│       drain│  active│          health│
└────────────────────────────┴────┴──────┴────────────────────────────────┴────────────────┴────────────────┴────────────┴──────────────┴────────────┴────────┴────────────────┘
 ctaeos.cta.svc.cluster.local 1095      1                             /fst        default.0             flat       booted             rw      nodrain   online              N/A 
 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 
 localhost                    1234  65535                  /does_not_exist           tape.0                                          off      nodrain  offline                  

[cta@ewv-cta2 orchestration (master)]$ kubectl exec -n cta ctaeos eos node ls
┌──────────┬─────────────────────────────────┬────────────────┬──────────┬────────────┬──────┬──────────┬────────┬────────┬────────────────┬─────┐
│type      │                         hostport│          geotag│    status│   activated│  txgw│ gw-queued│  gw-ntx│ gw-rate│  heartbeatdelta│ nofs│
└──────────┴─────────────────────────────────┴────────────────┴──────────┴────────────┴──────┴──────────┴────────┴────────┴────────────────┴─────┘
 nodesview  ctaeos.cta.svc.cluster.local:1095             flat     online           on    off          0       10      120                3     1 
 nodesview                     localhost:1234              ???    unknown          ???    off          0       10      120                ~     1 

[cta@ewv-cta2 orchestration (master)]$ kubectl exec -n cta ctaeos -- ls -al /fst/
total 6152
drwxr-xr-x. 5 daemon daemon     248 May 28 17:28 .
drwxr-xr-x. 1 root   root       138 May 28 17:23 ..
-rwxr-xr-x. 1 daemon daemon       1 May 28 17:23 .eosfsid
-rw-r--r--. 1 daemon daemon      37 May 28 17:23 .eosfsuuid
drwxr-xr-x. 2 daemon daemon       6 May 28 17:23 .eosorphans
drwxr-xr-x. 2 daemon daemon       6 May 28 17:23 .eostransaction
drwxr-xr-x. 2 daemon daemon      22 May 28 17:23 00000000
-rwx------. 1 daemon daemon 1048576 May 28 17:28 scrub.re-write.1
-rwx------. 1 daemon daemon 1048576 May 28 17:28 scrub.re-write.2
-rwx------. 1 daemon daemon 1048576 May 28 17:28 scrub.re-write.3
-rwx------. 1 daemon daemon 1048576 May 28 17:28 scrub.write-once.1
-rwx------. 1 daemon daemon 1048576 May 28 17:28 scrub.write-once.2
-rwx------. 1 daemon daemon 1048576 May 28 17:28 scrub.write-once.3

Does this all look good to you?

ewv · 28 May 2021 15:57

Next problem, though:

./archive_retrieve.sh -n cta

gives (in fact it comes from prepare_tests).

05/28 17:53:39   114 rmc_srv_mount: RMC92 - mount request by 0,0 from localhost
05/28 17:53:39   114 rmc_srv_mount: RMC98 - mount V01001/0 on drive 0
05/28 17:53:39   114 lasterror: Function entered: asc=4 ascq=3 save_errno=5 rc=-4 sensekey=4 skvalid=1
05/28 17:53:39   114 lasterror: No matching entry in scsierr_acttbl
05/28 17:53:39   114 rmc_sendrep: smc_mount: SR018 - mount of V01001 on drive 0 failed : /dev/smc : scsi error : Hardware error ASC=4 ASCQ=3
05/28 17:53:39   114 rmc_srv_mount: returns 2203
05/28 17:53:39   114 rmc_sendrep: RMC03 - illegal function 4

in the tape server #1 log.

Should I be looking at problems in the MHVTL setup? BTW, I’m doing the VM setup with the non-CERN RPMs. I hope that’s supported.

okeeble · 31 May 2021 15:19

Hi,
The EOS side looks healthier now, but yes, something’s probably up with MHVTL.
You should see something like the following

$ kubectl exec tpsrv01 -- mtx -f /dev/smc status
Defaulting container name to rmcd.
Use 'kubectl describe pod/tpsrv01' to see all of the containers in this pod.
  Storage Changer /dev/smc:3 Drives, 11 Slots ( 1 Import/Export )
Data Transfer Element 0:Empty
Data Transfer Element 1:Empty
Data Transfer Element 2:Empty
      Storage Element 1:Full :VolumeTag=V01001TA                            
      Storage Element 2:Full :VolumeTag=V01002TA                            
      Storage Element 3:Full :VolumeTag=V01003TA                            
      Storage Element 4:Full :VolumeTag=V01004TA                            
      Storage Element 5:Full :VolumeTag=V01005TA                            
      Storage Element 6:Full :VolumeTag=V01006TA                            
      Storage Element 7:Full :VolumeTag=V01007TA                            
      Storage Element 8:Empty
      Storage Element 9:Empty
      Storage Element 10:Empty
      Storage Element 11 IMPORT/EXPORT:Empty

and then try

kubectl exec tpsrv01 -- mtx -f /dev/smc load 1 0
kubectl exec tpsrv01 -- mtx -f /dev/smc status
kubectl exec tpsrv01 -- mtx -f /dev/smc unload 1 0

MHVTL is running on your host VM so check it’s up and give it a restart

systemctl status mhvtl
systemctl restart mhvtl

I guess I’ll make the same comment as before though - this whole k8s system is designed for continuous integration, it expects to have a clean system, be run once and then discarded. You could do worse than blowing everything away and starting from scratch, certainly there’s no guarantee that re-running bits and pieces from time to time won’t lead to inconsistencies. In this case, both ./recreate_buildtree_running_environment.sh and ./bootstrapMHVTL.sh need to have finished successfully to get MHVTL going.

Oliver.

ewv · 1 June 2021 16:42

I think I found the problem:

[root@ewv-cta2 tests]# systemctl restart mhvtl
Failed to restart mhvtl.service: Unit not found.

I essentially did do everything from scratch since I reinstalled my VM and ran the series of bootstrap scripts. However, I ran the MHVTL script twice (rebooting in between as suggested if a problem was seen).

The one thing that doesn’t quite look write in the bootstrap MHVTL script is that the first time through I see this error:

install: cannot stat ‘mhvtl.ko’: No such file or directory

reboot, try again and the second time I see this

depmod: ERROR: fstatat(5, openafs.ko): No such file or directory

but it continues without complaint after that. (This is a CC7 CERN VM and is using AFS)

okeeble · 2 June 2021 11:50

Hi,

We need to check how far this mhvtl installation got.

I guess you’re running ./bootstrapMHVTL.sh without the cern argument, right?

We need to check if the build worked. You should be able to follow ./bootstrapMHVTL.sh which does a basic checkout and build to see if this step is failing.

If this OK, some diagnostics;

systemctl list-units | grep vtl
systemctl status mhvtl.target
lsmod | grep mhvtl
lsscsi -g
mtx -f `lsscsi -g | awk '$2~/mediumx/{print $7}' | head -1` status
ps auxw | grep vtl

Oliver.