Upgrade from Ops perspective (practical)

ewv · 23 January 2025 16:11

I’m curious how CERN actually does upgrades from one version to the next, even for minor point upgrades. I know the upgrade to 5.n+1.0, run the schema upgrade, upgrade to 5.n+1.1 dance.

What I mean is do you bring tape drives DOWN and let them finish everything they need to before upgrading their software? Do you bring all the drives DOWN and upgrade everything or do you do a rolling upgrade. Same sort of question for the frontend(s). And I presume ONLY the machine/container doing the major upgrades needs to worry about the x.y.0 version, everything else goes to the final version?

And finally, do you do different things for major vs. minor upgrades? I guess a major upgrade does require a full shutdown of the entire system, correct?

Thanks!

Eric

rbachman · 23 January 2025 16:39

Hi Eric,
thank you for showing interest in this topic!

The CTA team has internal documentation for this, but it is very CERN-specific and therefore likely not so useful to the community at large.
I had actually started writing a more general-purpose page for the public docs, but did not manage to finish it before my time with the team was up, my bad.
On my side I can only finish it best-effort-style, but once someone finished it there is a spot in the wiki set aside: Upgrading CTA - CTA Docs

Cheers!

jleduc · 24 January 2025 16:57

Hi Eric,
we were thinking to CTA upgrade use case as a handson session for the next CTA workshop.
Our new helm based CI should allow us to go through upgrade and turns handson text into a practical guide for operators upgrading a CTA instance.

When you type cta-admin dr down ... the drive desired state is set to DOWN but the drive finishes whatever it is currently doing: this does not interrupt an ongoing tape session but ensure that the drive is DOWN when it is finished with transfers.

Then the principle is to upgrade the software stack on machines that have all their drive to DOWN state that have not been upgraded yet: this allows the CTA upgrade to take place between user sessions not interrupting anything: this is what we mean by rolling upgrade.

Putting all our 180 tape drives DOWN at once would be too painful: we cannot afford having no bandwidth to/from tape like this and as Service Manager I would not have any good excuse to justify something like this.

For the frontend upgrades: the service will restart and we are still missing a HA model for VMs/physical machine to avoid service interruption when we do so: this part should be much simpler in containers/kubernetes world. Restarting the frontends only takes a few seconds: not a big deal for the service.

We’ll prepare a companion document to the worshop handson to explain all this: please add your questions there.

Cheers,
Julien

ewv · 24 January 2025 17:22

This is very helpful. One question.

For a major upgrade where the schema changes do the taped/rmcds segfault if the schema changes underneath them? Or do you take care with backwards compatibility?

Eg. our system is running 5.10. If I ask the drive to go down (and I assume it finishes a large amount of work which was in the queue), can I upgrade the schema to 5.11 (schema 15 I believe) without causing a problem for that taped?

afonso · 5 February 2025 12:44

Hi Eric,

To upgrade the CTA Catalogue schema to version 15.0, you should indeed first upgrade the RPMs to the next major (aka “pivot”) version (5.11.0.x).

This “pivot” version works with both CTA Catalogue versions 14.0 and 15.0, allowing the schema to be upgraded underneath without impact on the rest of the system.

After the CTA Catalogue is upgraded to 15.0, you can install newer RPMs versions, such as 5.11.2.0-1.