Upgrade from 2018 #101

reidmv · 2020-06-05T02:20:14Z

This PR adds the ability to upgrade from PE 2018.1 to PE 2019.7.

This will allow the convert plan to be used to update trusted certificate extensions without enforcing node group changes. Such a capability is useful for upgrading from 2018.1 to 2019.7

Puppet 5, doesn't have the `puppet ssl` command.

New style compilers don't have a peadm_role cert extension anymore; they only have a pp_auth_role.

So that compilers classify successfully and can run Puppet

When using the orchestrator transport there is a problem with the built-in service task when the orchestrator is upgraded but the pxp-agents are not. Switching to run_command and `systemctl stop` during this time avoids the problem.

Otherwise, the certs potentially can't be signed due to having authorization extensions.

If it was stopped before, it should still be stopped after.

We now use this plan at a time when the master is 2019.x but agents could be 2018.x. So, make it compatible.

timidri · 2020-06-05T08:04:11Z

@reidmv there is 1 rubocop issue (single quotes preferred over double quotes) breaking syntax check.

timidri · 2020-06-05T14:46:47Z

@reidmv Can you recommend a straightforward way of testing this?
I have tried to setup a 2018.1 architecture using peadm but received error messages about peadm not supporting that version and suggested me to check out older versions. I reverted to 1.2.0 to no avail (similar message) and then 0.4.2 which didn't work for other reasons, and then I gave up :-)

vchepkov · 2020-06-05T15:08:35Z

I don't think you can install PE2018.1 using this module at all, only upgrade
It's using Deferred type during installation, which is not available in puppet 5

reidmv · 2020-06-05T15:52:40Z

@timidri the 0.4.x branch, in which the module is named "pe_xl" rather than "peadm", is the only one that can be used to actually install 2018.1. I've been using the autope project to deploy test stacks, modifying the plan to change out "peadm" for "pe_xl".

The 2.x version of peadm supports installing 2019.7 (nothing older), and upgrading from 2018.1.x, and 2019.1.0 or newer.

timidri · 2020-06-05T15:58:48Z

@reidmv I did the same, but using the aws provider pe_xl fails with the error we have fixed since:

  "kind": "bolt/plan-failure",
  "msg": "Hostname / DNS name mismatch: target ec2-3-120-158-94.eu-central-1.compute.amazonaws.com reports 'ip-10-138-1-58.eu-central-1.compute.internal'",
  "details": {
  }
}```

I'll try the GCP provider now.

timidri · 2020-06-05T15:59:47Z

And btw, I've created a symlink from peadm to pe_xl and the plan worked under its old name

reidmv · 2020-06-05T16:22:58Z

Yeah, the 0.4.x version insisted on hostnames matching exactly the inventory names used to connect. I can confirm that in GCP, that condition is met so the deployment seems to go smoothly.

logicminds · 2020-06-05T16:30:34Z

I use the docker examples in this repo to create the 2018 stack. Will need to switch to the 0.4.x stack to do so. @vchepkov @timidri

See docker examples

logicminds · 2020-06-05T16:39:11Z

plans/convert.pp

@@ -13,6 +13,9 @@
  # Common Configuration
  String                            $compiler_pool_address = $master_host,
  Array[String]                     $dns_alt_names         = [ ],
+
+  # Options
+  Boolean                           $configure_node_groups = true,


I think the $configure_node_groups should be dynamically generated rather than relying on a human. Create a task to find the value of PE version in order to make the boolean.

I think the $configure_node_groups should be dynamically generated rather than relying on a human. Create a task to find the value of PE version in order to make the boolean.

I think that option not to create additional classification is very useful. There is no need for it in standard configuration with only primary and replica. Also module doesn't provide a plan to promote a replica and one would have issues with removing classifications added by the module when primary is not available to use standard promote procedure

logicminds · 2020-06-05T16:41:30Z

plans/convert.pp

+  $pe_version = run_task('peadm::read_file', $master_target,
+    path => '/opt/puppetlabs/server/pe_version',
+  )[0][content].chomp
+


Oh we have the PE version here already. So my comment above should use this value.

logicminds · 2020-06-05T16:47:30Z

plans/upgrade.pp

-  )
+  # Shut down PuppetDB on CMs that use the PM's PDB PG. Use run_command instead
+  # of run_task(service, ...) so that upgrading from 2018.1 works over PCP.
+  run_command('systemctl stop pe-puppetdb', $compiler_m1_targets)


This assumes systemctl is available on the host. Since 2018 supports RHEL6 this wouldn't work in all cases.

https://puppet.com/docs/pe/2018.1/supported_operating_systems.html#supported_operating_systems

logicminds · 2020-06-05T16:48:34Z

plans/convert.pp

@@ -13,6 +13,9 @@
  # Common Configuration
  String                            $compiler_pool_address = $master_host,
  Array[String]                     $dns_alt_names         = [ ],
+
+  # Options
+  Boolean                           $configure_node_groups = true,


Would like to see this dynamically calculated instead. We can't trust humans to figure this out.

logicminds · 2020-06-05T16:49:55Z

plans/upgrade.pp

+  # Shut down PuppetDB on CMs that use the replica's PDB PG. Use run_command
+  # instead of run_task(service, ...) so that upgrading from 2018.1 works
+  # over PCP.
+  run_command('systemctl stop pe-puppetdb', $compiler_m2_targets)


Same systemctl issue here

vchepkov · 2020-06-06T13:58:57Z

I don't want to mud the water here, I can open a ticket if v0.4 is still supported, but module doesn't work for me. I use Vagrant and this json:

$ python -m json.tool < provision.json 
{
    "console_password": "puppet2018",
    "dns_alt_names": [
        "puppet.localdomain"
    ],
    "master_host": "primary.localdomain",
    "master_replica_host": "replica.localdomain",
    "pe_conf_data": {
        "pe_install::disable_mco": false,
        "puppet_enterprise::profile::console::display_local_time": true,
        "puppet_enterprise::profile::master::check_for_updates": false,
        "puppet_enterprise::send_analytics_data": false
    },
    "stagingdir": "/opt/staging",
    "version": "2018.1.15"
}

$ cat Puppetfile 
#
# Forge modules
#
forge 'http://forge.puppetlabs.com'

mod 'WhatsARanjit/node_manager', :latest
mod 'puppetlabs/stdlib', :latest

# To upgrade PE2018
mod 'peadm',
  :git => 'https://github.com/puppetlabs/puppetlabs-peadm.git',
  :ref => '320b60e8404c85b6cf3a78fed5149201f98c3e6b'
# To install PE2018
mod 'pe_xl',
  :git => 'https://github.com/puppetlabs/puppetlabs-peadm.git',
  :tag => '0.4.2'

Plan fails:

$ bolt plan run pe_xl::provision --params @provision.json 
Starting: plan pe_xl::provision
Starting: plan pe_xl::unit::install
Starting: task pe_xl::hostname on primary.localdomain, replica.localdomain
Finished: task pe_xl::hostname with 0 failures in 1.05 sec
Starting: file upload from /var/folders/nn/3h090sqs4g58grz5h4bhvpsr0000gn/T/pe_xl20200606-13813-1irj15t to /tmp/pe.conf on primary.localdomain
Finished: file upload from /var/folders/nn/3h090sqs4g58grz5h4bhvpsr0000gn/T/pe_xl20200606-13813-1irj15t to /tmp/pe.conf with 0 failures in 1.11 sec
Starting: plan pe_xl::util::retrieve_and_upload
Starting: task pe_xl::filesize on local://localhost
Finished: task pe_xl::filesize with 0 failures in 0.12 sec
Starting: task pe_xl::filesize on primary.localdomain
Finished: task pe_xl::filesize with 0 failures in 0.92 sec
Finished: plan pe_xl::util::retrieve_and_upload in 1.07 sec
Starting: task pe_xl::mkdir_p_file on primary.localdomain
Finished: task pe_xl::mkdir_p_file with 0 failures in 1.34 sec
Starting: task pe_xl::pe_install on primary.localdomain
Finished: task pe_xl::pe_install on primary.localdomain
Starting: task pe_xl::mkdir_p_file on primary.localdomain
Finished: task pe_xl::mkdir_p_file with 1 failure in 0.83 sec
Finished: plan pe_xl::unit::install in 6.38 sec
Finished: plan pe_xl::provision in 6.45 sec
Failed on primary.localdomain:
  The task failed with exit code 1 and no stdout, but stderr contained:
  chown: invalid user: ‘pe-puppet’
Failed on 1 target: primary.localdomain
Ran on 1 target

timidri · 2020-06-08T11:38:10Z

I had the same result as @vchepkov when using autope in GCP.
However, in my case it was due to the pe_xl::filesize task assuming it's executed on Linux - the stat flags on Darwin are different. The current version of the task takes the differences into account. I replaced the task and the plan succeeded for me.

reidmv · 2020-06-08T16:29:18Z

@vchepkov I talked to @timidri and he found one issue on 0.4.x which might be the same one you're running into.

When the plan fails, it is failing right after pe_xl::pe_install:

...
Starting: task pe_xl::pe_install on primary.localdomain
Finished: task pe_xl::pe_install on primary.localdomain
Starting: task pe_xl::mkdir_p_file on primary.localdomain
Finished: task pe_xl::mkdir_p_file with 1 failure in 0.83 sec

Because the PE installer is expected to fail on first install (Puppet can't run successfully before the database node is installed as well), the plan doesn't halt there. In 0.4.x particularly, any failure is ignored, and the plan proceeds. It looks like the pe_xl::mkdir_p_file is failing, but the real problem will have been pe_xl::pe_install failing. One way it could fail is if the PE tarball doesn't exist in /tmp at all, and so the PE installer can't be run. If the PE installer can't be run, then the pe-puppet user won't be created. If the pe-puppet user doesn't exist, an attempt to chown files to it will fail (causing pe_xl::mkdir_p_file to fail).

The bug @timidri found in 0.4.x is that pe_xl::filesize doesn't work correctly on Mac OSX in such a way that it will cause the PE installer file not to be uploaded to target systems, which will cause pe_xl::pe_install to fail. I was able to successfully provision in GCP because I was running pe_xl::provision from a Linux machine. Dimitri saw this failure because he was running from his Macbook.

If you are running from a Mac OSX machine you may have seen the same failure. If you are not, try running with the --verbose flag to get more detailed information about what is going wrong.

bolt plan run pe_xl::provision --params @provision.json --verbose

vchepkov · 2020-06-08T16:41:56Z

Yes, I am indeed using Mac and I discovered that installation file isn't being copied, so I transferred it manually and plan failed when provisioning replica instead.
I am not that concerned about it though. It's highly unlikely I will provision a new 2018 PE infrastructure anytime soon, waiting for a new LTS at this point. Thanks for following up though. As I offered earlier, I can submit a separate issue, if you still want to support PE2018 deployments at this point.

reidmv · 2020-06-08T16:52:18Z

@vchepkov ah, understood. If you don't need to deploy 2018.1 yourself I would say then let's not worry about it. I definitely don't have any unsolved need to deploy it; the only reason it's coming up is @timidri trying to help me test the ability to upgrade from it.

New deployments of 2018.1 are not considered to be supported. 😄

timidri · 2020-06-15T08:16:06Z

@reidmv Unfortunately I couldn't complete my test run (with a 2018 environment created by pe_xl v 0.42). The famous last words from my peadm::upgrade plan before hanging forever were:

Starting: task peadm::puppet_infra_upgrade on pe-master-6ec0c0-0.europe-west4-a.c.dbt-gcp-project.internal
Starting: task peadm::puppet_infra_upgrade on pe-master-6ec0c0-0.europe-west4-a.c.dbt-gcp-project.internal
Running task peadm::puppet_infra_upgrade with '{"type":"compiler","targets":["pe-compiler-6ec0c0-0.europe-west4-a.c.dbt-gcp-project.internal","pe-compiler-6ec0c0-2.europe-west4-c.c.dbt-gcp-project.internal"],"_task":"peadm::puppet_infra_upgrade"}' on ["pe-master-6ec0c0-0.europe-west4-a.c.dbt-gcp-project.internal"]
Running task 'peadm::puppet_infra_upgrade' on pe-master-6ec0c0-0.europe-west4-a.c.dbt-gcp-project.internal
Initializing ssh connection to pe-master-6ec0c0-0.europe-west4-a.c.dbt-gcp-project.internal
Opened session
Running '/Users/dimitri.tischenko/git/puppetlabs-autope/modules/peadm/tasks/puppet_infra_upgrade.rb' with {"type":"compiler","targets":["pe-compiler-6ec0c0-0.europe-west4-a.c.dbt-gcp-project.internal","pe-compiler-6ec0c0-2.europe-west4-c.c.dbt-gcp-project.internal"],"_task":"peadm::puppet_infra_upgrade"}
Executing: mkdir -m 700 /tmp/c73f2498-ccb3-47d4-a566-23f3e108c0d2
Command returned successfully
Uploading /Users/dimitri.tischenko/git/puppetlabs-autope/modules/peadm/tasks/puppet_infra_upgrade.rb, to /tmp/c73f2498-ccb3-47d4-a566-23f3e108c0d2/puppet_infra_upgrade.rb
Executing: chmod u\+x /tmp/c73f2498-ccb3-47d4-a566-23f3e108c0d2/puppet_infra_upgrade.rb
Command returned successfully
Executing: chmod u\+x /tmp/c73f2498-ccb3-47d4-a566-23f3e108c0d2/wrapper.sh
Command returned successfully
Executing: id -g root
Command returned successfully
Executing: sudo -S -H -u root -p \[sudo\]\ Bolt\ needs\ to\ run\ as\ another\ user,\ password:\  sh -c cd\;\ chown\ -R\ root:0\ /tmp/c73f2498-ccb3-47d4-a566-23f3e108c0d2
Command returned successfully
Executing: sudo -S -H -u root -p \[sudo\]\ Bolt\ needs\ to\ run\ as\ another\ user,\ password:\  sh -c cd\;\ /tmp/c73f2498-ccb3-47d4-a566-23f3e108c0d2/wrapper.sh

Otherwise it seems there's a chance it'll re-create files we need to be absent.

reidmv added 13 commits June 2, 2020 16:13

Make convert node group application optional

3aa998f

This will allow the convert plan to be used to update trusted certificate extensions without enforcing node group changes. Such a capability is useful for upgrading from 2018.1 to 2019.7

Merge branch 'enable-partial-convert'

3338dda

Add convert compatibility for PE 2018.1

440244a

Puppet 5, doesn't have the `puppet ssl` command.

Run puppet after master convert cert

0bcd354

Enable correct cert conversion of 2018

a96e3b2

Ensure Puppet stopped during cert regen

3b782d3

Fix upgrade check

dfb1ecb

New style compilers don't have a peadm_role cert extension anymore; they only have a pp_auth_role.

Update classification earlier in the upgrade

95b2b39

So that compilers classify successfully and can run Puppet

Switch to run_command for compiler service stops

24d58c9

When using the orchestrator transport there is a problem with the built-in service task when the orchestrator is upgraded but the pxp-agents are not. Switching to run_command and `systemctl stop` during this time avoids the problem.

Don't re-cert compilers until master is upgraded

d5ba341

Otherwise, the certs potentially can't be signed due to having authorization extensions.

Don't start Puppet re-issuing certs if its stopped

2f57979

If it was stopped before, it should still be stopped after.

Use run_command for compatibility

be989e6

We now use this plan at a time when the master is 2019.x but agents could be 2018.x. So, make it compatible.

Add basic 2018.1 upgrade docs note

b6bad09

reidmv requested a review from a team as a code owner June 5, 2020 02:20

Fix lint errors

51a88e8

logicminds suggested changes Jun 5, 2020

View reviewed changes

Add orchestrator wait after running Puppet

78943a3

reidmv added 2 commits June 15, 2020 15:26

Update supported OS data

e6e0794

Shut down puppetserver before rm'ing files

9af145a

Otherwise it seems there's a chance it'll re-create files we need to be absent.

reidmv merged commit 5653b66 into master Jun 24, 2020

reidmv deleted the upgrade-from-2018 branch July 7, 2020 20:40

Upgrade from 2018 #101

Upgrade from 2018 #101

Uh oh!

Conversation

reidmv commented Jun 5, 2020

Uh oh!

timidri commented Jun 5, 2020

Uh oh!

timidri commented Jun 5, 2020

Uh oh!

vchepkov commented Jun 5, 2020

Uh oh!

reidmv commented Jun 5, 2020

Uh oh!

timidri commented Jun 5, 2020

Uh oh!

timidri commented Jun 5, 2020

Uh oh!

reidmv commented Jun 5, 2020

Uh oh!

logicminds commented Jun 5, 2020

Uh oh!

logicminds Jun 5, 2020

Choose a reason for hiding this comment

Uh oh!

vchepkov Jun 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

logicminds Jun 5, 2020

Choose a reason for hiding this comment

Uh oh!

logicminds Jun 5, 2020

Choose a reason for hiding this comment

Uh oh!

logicminds Jun 5, 2020

Choose a reason for hiding this comment

Uh oh!

logicminds Jun 5, 2020

Choose a reason for hiding this comment

Uh oh!

vchepkov commented Jun 6, 2020

Uh oh!

timidri commented Jun 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

reidmv commented Jun 8, 2020

Uh oh!

vchepkov commented Jun 8, 2020

Uh oh!

reidmv commented Jun 8, 2020

Uh oh!

timidri commented Jun 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

vchepkov Jun 14, 2020 •

edited

Loading

timidri commented Jun 8, 2020 •

edited

Loading

timidri commented Jun 15, 2020 •

edited

Loading