CASTOR to CTA migration (CTADirsHelper view)

george_patargias · 10 January 2022 16:47

Hello,

We are currently testing the CASTOR to CTA migration protocol by injecting a very small part of the CASTOR namespace to EOS and importing a small tape pool of 5 tapes.

When we run export_production_tapepool_to_cta.sh, eos-import-files crashes because of the presence of NULL values in the CTADirsHelper view (https://gitlab.cern.ch/cta/CTA/-/blob/master/migration/gRPC/EosImportFiles.cpp#L611)

In line 276 of castorns_ctamigration_schema.sql (line 276) we see this:
DELETE FROM CTADeltaDirsHelper WHERE path IS NULL; as per the comment -- drop null paths due to the above recursive query going to the top /castor level. Then, a few lines below, you merge CTADeltaDirsHelper to Dirs_Full_Path which updates the CTADirsHelper view which, therefore, does not have rows with NULL path values

However, because we are working with a static CASTOR NS snapshot (imported to the Oracle CTA DB), the CTADeltaDirsHelper table is (and always be) empty, so the above DELETE is not executed and as result we end up with NULL path values in the CTADirsHelper view.

So, we think we need to add a DELETE statement somewhere in castorns_ctamigration_schema.sql to remove NULL path values
from the CTADirsHelper view.

Can you please confirm if our understanding as described above is correct and whether we need to add an extra DELETE statement?

Many thanks,

George

mdavis · 11 January 2022 16:49

Hi George,

No I don’t think your understanding is correct. CTADeltaDirsHelper will contain any new directories that were created after CTADirsHelper was created, which should be added to the import. If no new directories were created, it will be empty and that is fine.

CTADirsHelper is a view on Dirs_Full_Path. Did you check that Dirs_Full_Path does not contain any NULL path values before you started?

Michael

george_patargias · 11 January 2022 17:41

Hi Michael,

Thanks. Yes we checked that Dirs_Full_Path does not contain any NULL path values. CTADirsHelper contains NULL path entries. The creation of the view has this expression
substr(D.path, length('/castor/ads.rl.ac.uk/test/')) as path,

so for all those entries in which the path as string is shorter than
‘/castor/ads.rl.ac.uk/test/’, the substr() function will return NULL
For example

substr(/, length(’/castor/ads.rl.ac.uk/test’))
substr(/castor, length(’/castor/ads.rl.ac.uk/test’))
…

and this will cause eos-import-files to crash.

We thought that the way to get round this was to delete these NULL path entries from the CTADirsHelper. Is this correct and if not what we need to do?

Best,

George

mdavis · 12 January 2022 15:00

So, the reason it is giving you a NULL value is that you have some paths in Dirs_Full_Path which do not begin with /castor/ads.rl.ac.uk/test/. The prefix specified in the view should exist in all paths.

I suggest you remove /test/ from the end of the prefix and change it to /castor/ads.rl.ac.uk/.

If you then want to strip off /test from the beginning of the files in the EOS namespace, you can add this line to /etc/cta/castor-migration.conf:

castor.prefix          /test

tom.byrne · 13 January 2022 11:56

Hi Michael,

Thanks for your comments, we will try your suggestion.

Just to check, our Dirs_Full_Path contains the top level directories:

file_id	parent	name	path	depth
2	0		/	0
1284	2	castor	/castor	1
1285	1284	ads.rl.ac.uk	/castor/ads.rl.ac.uk	2
170899825	1285	dteam	/castor/ads.rl.ac.uk/dteam	3
224001180	1285	snoplus	/castor/ads.rl.ac.uk/snoplus	3

From your comment, it sounds like you don’t expect this to be the case. Should we be revisiting what castorns_build_dirsfullpath.sql is doing in our case?

Cheers,
Tom

tom.byrne · 17 January 2022 17:37

Resetting and trying with the prefix set to the directory above (/castor/ads.rl.ac.uk/) yielded the same error.

Dropping the rows with null paths from the CTADIRSHELPER view (which in turn deleted the rows from the cns_file_metadata table on our test copy of the castor namespace) allowed eos-import-files to run, and we have castor files in the EOS namespace now.

I’m still not happy that I have actually ‘fixed’ the issue here, and open to suggestions for better things to do to avoid this error.

Cheers,
Tom

mdavis · 18 January 2022 13:21

Hi Tom,

Deleting rows from CTADIRSHELPER will solve your immediate problem but clearly you cannot do this in a production environment as you would be deleting rows from your CASTOR production database.

I should modify my statement above slightly to say that the prefix should exist in all paths, except the 3 top-level paths that you will never import. We are stripping off /castor/cern.ch from the front of all paths, so in the view, the paths /, /castor and /castor/cern.ch are indeed NULL. However, this does not matter as we never tried to import ALL directories into CTA. We did our migration in many stages, so we always imported one top-level directory at a time. For example, eos-import-dirs /atlas works fine.

Hope that helps,

Michael

tom.byrne · 19 January 2022 15:14

Hi Michael,

Agreed, and thanks for clarifying.

After a bit more poking, the issue seems to be that the tape pool we were using for migration testing had a few stray files too high up in the directory structure, so the join eos-import-files performs between the CTAFILESHELPER table and CTADIRSHELPER view ended up with null paths for those files, and so we saw the resulting crash.

Deleting the top level directories (/, /castor etc.) from the castor namespace (and therefore CTADIRSHELPER) masked the issue, because those files were no longer returned by the join.

I believe the correct fix is making sure the CASTOR namespace and tapepools are sane before migrating - i.e. the tape pool only contains files within the ‘top level domain’ we have imported. We have been cleaning up the namespace already, but there are a few oddities that slipped through the net. Now we know what we’re looking for, we can deal with the edge cases. I’ll follow up when we have confirmed this fixes the issue for us.

Thanks for your input on this, and as always, any comments would be appreciated.

Cheers,
Tom

mdavis · 21 January 2022 15:07

Hi Tom,

Glad you figured it out and it all makes sense now.

As you are migrating to CASTOR all in one go, you can do as many test runs as you need beforehand. If you run the migration scripts with the --dry-run flag, nothing will be changed in CASTOR, so you can check everything works.

If there is an error, you can wipe the CTA Catalogue/EOS namespace, fix the problem and run the whole thing again.

Just one hint: when you are wiping EOS, it is not sufficient to delete the files, as there can be leftover “tombstones” in QuarkDB, which can cause problems if you try to re-create the file with the same fileid at a later stage. So when you wipe EOS after a test import, you really need to reinitialise QuarkDB to start with a clean system.

Good luck!

Michael

tom.byrne · 24 January 2022 15:35

Hi Michael,

A clean-up of the stray files in the tape pool worked, we’ve now followed the whole process through without issue, and have now tested staging and retrieval of migrated CASTOR files in a testing tape pool via CTA, which is very heartening to see working!

Thanks for your comments, it’s encouraging that they are not completely alien to us - we’ve been through “a few” wipes of the CTA Catalogue/EOS namespace now while working through issues. We also came across the deletion tombstone problem early on, and had fun figuring out the correct incantation of RocksDB compactions and QuarkDB restarts to get around it, but settled on wiping the QuarkDBs and recreating EOS each time for cleanliness reasons.

I have a query about the actual migration - the CERN migration scripts are set up to work directly against the CASTOR instance so they can disable user access, set migrated flags in the DB’s etc. As we are planning on migrating everything in one go - is there a reason for us not to follow a modified procedure?

Turn CASTOR off at the start of our migration (disable user access etc).
Take an up to date snapshot of the CASTOR namespace, and put it on our CTA DB.
Import everything from the snapshot into EOS/CTA (as we have been doing in testing).
Assuming things go well, never turn CASTOR back on again, and everyone starts using CTA. If things don’t go well, we can wipe CTA and turn CASTOR back on, and it will be like nothing happened from CASTOR’s perspective.

There are a few reasons why this appeals to us; the main one being it is most similar to what we can test, so (possibly) least likely to throw up any surprises. However, we might be overlooking something here, and it would be interesting to hear your thoughts on this.

Cheers,
Tom

mdavis · 28 January 2022 08:31

Hi Tom,

Very happy to hear that you have performed a successful migration!

Yes, wiping EOS/QuarkDB between test migrations is the way to go. Cleaning up a partial/failed migration on a live instance is a messy business.

Yes, as you are migrating all in one go, you can follow a simplified migration process as you outlined. In fact I suggested something similar to Alison a while back.

A few comments:

We were under the constraint of having to disable parts of CASTOR while leaving the rest still running. You don’t have that constraint, so you don’t need to change any permissions in CASTOR before the migration. Simply disable access to the nameservers and you are ready to migrate.
Yes, exactly. Once the nameservers are off, no further metadata operations are possible. Recreate DIRS_FULL_PATH and you are ready to go.
Yes, you can do the import in --dry-run mode. There is no difference between --dry-run and --doit on the EOS/CTA side. In CASTOR, with --doit, the files are marked as exported to CTA and the tapes are disabled in CASTOR. As you are not going to switch CASTOR back on, you don’t need that step.
Yes. Tapes imported from CASTOR to CTA are marked as read-only in CTA, so you can always switch back to CASTOR. (Caveat: If you do this, any files deleted in CTA will reappear in CASTOR. Files/tapes created in CTA will of course not be known to CASTOR. We don’t have a procedure to import tapes from CTA to CASTOR, so this would have to be done manually.)

Cheers,

Michael

george_patargias · 27 June 2022 12:43

Hello MIchael,

Just wanted to quickly check with you: is the following script that we need to run

in order to clean up the intermeridate CASTOR tables created during CASTOR to CTA migration. When we run cta-catalogue-schema-verify we get the following warnings

cta-catalogue-schema-verify /etc/cta/cta-catalogue.conf
Schema version : 4.6
Checking indexes…
SUCCESS
Checking tables, columns and constraints…
ERROR: TABLE TEMP_REMOVE_CASTOR_METADATA is missing in the schema but is defined in the catalogue database.
FAILED
Status of the checking : FAILED
WARNING: PROCEDURE IMPORTTAPEPOOL exists in the catalogue database
WARNING: PROCEDURE POPULATECTAFILESFROMCASTOR exists in the catalogue database
WARNING: PROCEDURE IMPORTFROMCASTOR exists in the catalogue database
WARNING: PROCEDURE REMOVECASTORMETADATA exists in the catalogue database
WARNING: SYNONYM CNS_CLASS_METADATA exists in the catalogue database
WARNING: SYNONYM CNS_CTADIRSHELPER exists in the catalogue database
WARNING: SYNONYM CNS_CTAFILES2NDCOPYHELPER exists in the catalogue database
WARNING: SYNONYM CNS_CTAFILESHELPER exists in the catalogue database
WARNING: SYNONYM CNS_CTALOG exists in the catalogue database
WARNING: SYNONYM CNS_CTAMIGRATIONLOG exists in the catalogue database
WARNING: SYNONYM CNS_DIRSFORCTAEXPORT exists in the catalogue database
WARNING: SYNONYM CNS_DIRS_FULL_PATH exists in the catalogue database
WARNING: SYNONYM CNS_FILESFORCTAEXPORT exists in the catalogue database
WARNING: SYNONYM CNS_GETTIME exists in the catalogue database
WARNING: SYNONYM CNS_ZEROBYTEFILESFORCTAEXPORT exists in the catalogue database
WARNING: SYNONYM VMGR_TAPE_DGNMAP exists in the catalogue database
WARNING: SYNONYM VMGR_TAPE_INFO exists in the catalogue database
WARNING: SYNONYM VMGR_TAPE_SIDE exists in the catalogue database
WARNING: TYPE NUMLIST exists in the catalogue database
WARNING: Error logging table ERR$_TAPE_FILE exists in the catalogue database

mdavis · 29 June 2022 14:19

Yes, you should run that script to remove the DB objects related to the CASTOR to CTA migration.

There is only one object causing the verification to fail, the table TEMP_REMOVE_CASTOR_METADATA. The other objects are just causing warnings, but anyway if the migration from CASTOR is complete you should remove them.