Skip to content
Moritz Onken edited this page Aug 8, 2014 · 59 revisions

SysAdmin FAQ

Manual maintenance issues

How to reindex a missing module?

cd api.metacpan.org
bin/metacpan release http://cpan.metacpan.org/authors/id/X/XS/XSAWYERX/MetaCPAN-API-0.33.tar.gz --latest

How to index all unindexed dists uploaded over the last 24 hours?

bin/metacpan release --skip --age 24 --latest ~/CPAN/authors/id/

How to index the latest Perl release

bin/metacpan release http://cpan.metacpan.org/authors/id/R/RJ/RJBS/perl-5.16.0.tar.bz2
bin/metacpan release --status latest http://cpan.metacpan.org/authors/id/R/RJ/RJBS/perl-5.16.1.tar.bz2

The above syntax will force the status bit to "latest", which we need to do manually only for new, latest Perl releases. Make sure you reindex the predecessor as well.

Restarting services

The following services are set up:

metacpan-www
metacpan-api
metacpan-rrr
metacpan-watcher
elasticsearch

Each of those services can be restarted by calling service $name restart (as superuser).

How to deploy a new version of MetaCPAN-API and MetaCPAN-Web?

Log in as the metacpan user (which loads perlbrew automatically), go to the appropriate folder (~/metacpan.org or ~/api.metacpan.org), pull from github and restart the service as root (rcmetacpan-ww restart or rcmetacpan-api restart).

For now prereqs can be installed manually with sudo /home/metacpan/bin/install_modules Foo::Bar.

How to format filesystem for metacpan.org/tmp folder

That folder contains the unpacked tarballs and as such consists of millions of small files. We will run faster out of inodes than disk space. When formatting that filesystem, ensure that inode-size = block-size:

mkfs.ext4 -i 4096 /dev/mapper/$lv # assuming block size of that volume is 4k

How to increase storage space for ElasticSearch and the CPAN mirror?

The CPAN mirror and the ElasticSearch data are stored in /var/cpan and /var/elasticsearch, respectively. Those are filesystems on top of the LVM LVs /dev/mapper/vg0-cpan and /dev/mapper/vg0-elasticsearch.

To increase the space available on one of them, change the following example, which adds an additional 100 MB for the CPAN mirror. There's no need to unmount anything.

# Show current usage (and what is free, see note below)
pvscan
# Grow the LVM volume
lvextend -L +100M /dev/mapper/vg0-cpan
# Extend the filesystem to the fit the new LV size
resize2fs /dev/mapper/vg0-cpan

Do NOT allocate all the unused space to logical volumes. We need some free space to use by LVM snapshots during the backup process. We haven't actually checked how much spare space we need for that, so let's play it safe and say that at least 1.5 GiB should be left alone.

Need to clean out /var/tmp/metacpan/ ?

find /var/tmp/metacpan/source/ -maxdepth 2 -type d -mtime +215 | head -5000 | xargs sudo rm -rf

This doesn't solve it as such - but cleans up files that haven't been modified/extracted in a long while

Network/high level issues

Managing DNS

We are currently sponsored by Dyn with a DynECT Managed DNS Lite account.

URL: https://manage.dynect.net

When delegating your domain names, please use the following nameservers:

ns1.p24.dynect.net

ns2.p24.dynect.net

ns3.p24.dynect.net

ns4.p24.dynect.net

The best place to get started is look at the DynECT Managed DNS Lite User Manual located at: https://manage.dynect.net/help

Our contact at Dyn is Chris Gonyea [email protected]. Also, alh in #metacpan works for Dyn and can help with technical issues.

ByteMark mirror

Specs:

What is the procedure if the server is unreachable?

ByteMark

We should be able to fix most stuff because we have console access (see above)

Booking.com

Contact the booking staff either by email ([email protected]) or in emergencies by phone (+31207153409). Most problems are better solved on IRC. Our contact on irc.perl.org is Seveas.

Where are log files stored?

ElasticSearch logs can be found in /opt/elasticsearch-0.20.2/logs

Where are system monitoring reports stored?

The ElasticSearch status can be queried from within the box:

$ curl localhost:9200/cpan/_status?pretty

$ curl localhost:9200/_cluster/health/cpan_v1?level=shards

http://munin.bm-n2.metacpan.org/metacpan.org/bm-n2.metacpan.org/ http://nagios.omega.pqpq.de/

web api watcher

Where are backups stored

[22:03:08]  <mo>	 [20:12:26] sudo -i -u metacpan # become metacpan user
[22:03:08]  <mo>	 [20:12:32] we don't have root on that box
...
[22:03:09]  <mo>	 [20:26:53] $ mount /mnt/backup
[22:03:09]  <mo>	 [20:27:22] and backups are in /home/metacpan/api.metacpan.org/var/backup
[22:03:09]  <mo>	 [20:27:33] run bin/metacpan backup to restore

SSL certificates

Certificates are currently minted by StartSSL using their free Class 1 level process. The Class 1 level certs are good for one year, cover a top-level domain as well as a subdomain, and only require minimal personal identity information (name, email, physical address, phone). Best of all, they're free. The only caveat is that you cannot mint a cert if an existing cert exists for the same DNS name and the existing cert's expiry is more than two weeks out. To do so would first require a 25$ revocation fee for the original cert before being able to mint a new one.

Where are certs stored?

Certificates are stored in /etc/puppet/private/bm-n2/ssl/<dns-name>. Each DNS name directory contains:

  • server.key - Copy of the 2048-bit RSA private key
  • server.csr - Certificate Signing Request sent to StartSSL
  • server.pub - Public cert provided by StartSSL
  • server.crt - Combined public cert + intermediate cert + StartSSL CA root

The combined server.crt file is generated using the /etc/puppet/private/bm-n2/ssl/chain-to-startssl script and files in /etc/puppet/private/bm-n2/ssl/startssl-ca.

The server.key for api.metacpan.org, cpan.metacpan.org, and metacpan.org is a copy of 2014-01-09.key. Sharing a private key makes renewing certs easier. Note that currently www.metacpan.org has a different key.

The containing directory, ssl/, is a local git repository. This aids in not losing our keys or certs, which may be hard or impossible to replace easily if overwritten during a botched update.

Current certificates

Only four vhosts currently use SSL. You can check which vhosts expect SSL with a grep like:

cd /etc/puppet
git grep -P '(nginx::vhost|ssl)' modules/metacpan/manifests/web

api.metacpan.org

Good for api.metacpan.org and metacpan.org. Only used by api. Minted by trs.

cpan.metacpan.org

Good for cpan.metacpan.org and metacpan.org. Used by both DNS names. Minted by trs.

metacpan.org

Copy of cpan.metacpan.org files. See above.

Good for www.metacpan.org and metacpan.org. Only used by www. before redirection. Minted by Olaf.

Renewing certificates

Renewing the certs from StartSSL requires:

  1. Re-validate control over metacpan.org via their process. Email to hostmaster@ or [email protected] will both go to [email protected] where you can see it.
  2. Skip their private key generation step and supply a CSR directly. You can and should reuse the existing CSRs to avoid problems. If you regenerate a CSR, make sure it matches the key! Compare the output of:
    openssl rsa -noout -modulus < server.key | sha1sum
    openssl req -noout -modulus < new.csr | sha1sum
  1. Save the new public cert to <dns-name>/server.pub
  2. Run chain-to-startssl <dns-name> to generate a server.crt with the appropriate certificate chain.
  3. Install new certs and restart services by running puppet: /etc/puppet/run.sh
  4. Manually verify that everything worked!
  5. Commit your changes to the local git repo:
    cd /etc/puppet/private/bm-n2/ssl
    sudo git add -A
    sudo git commit --author='Your Name <[email protected]>'

Notes from attempted upgrade on 2014-02-21

[15:55:46]  <rafl>	 how do i get a serial console on bm-n1?
[15:56:44] ranguard	 doesn't know - anyone else ?
[15:56:58]  <rwstauner>	 oalders?
[15:57:03]  <oalders>	 no idea
[15:57:15]  <oalders>	 mo is mostly the person who messed around on bm-n1
[15:57:19]  <rwstauner>	 i thought i remembered hearing about someone doing it before
[15:57:22]  <oalders>	 and it's 4 AM where he is
[15:57:26]  <rwstauner>	 d'oh
[15:57:35]  <rafl>	 oh well
[15:59:07]  <rwstauner>	 this page doesn't say "how" just that it exists http://www.bytemark.co.uk/support/technical_documents/consoleshell
[15:59:10] ranguard	 finds an email saying there is one!
[16:00:31]  <rwstauner>	 The password for the console shell will be the same as the root password we
[16:00:31]  <rwstauner>	 originally set for your VM. If you can�t remember this it will be in an email that we sent when you signed up.
[16:00:31]  <oalders>	 i didn't realize we could actually power cycle the machine
[16:01:51]  <oalders>	 i found that email from mo as well "We also have access to a remote console"
[16:01:58]  <oalders>	 but no email on how to get to this console
[16:03:03]  <ranguard>	 mo: ^^ could you send instructions to [email protected] please (for future reference) :)
[16:03:34]  <rafl>	 alright. backup done
[16:03:38]  <rafl>	 let's try this new kernel
[16:06:19]  <rafl>	 did i break shit on the first step already? :)
[16:06:24]  <rafl>	 seems like it's taking awfully long to boot
[16:06:58]  <ranguard>	 probably not fschecked in a long time?
[16:07:25]  <ranguard>	 does lvm have that issue?
[16:08:47]  <rafl>	 doesn't look like there's a lot of io going on
[16:09:08]  <ranguard>	 so maybe you did break shit :)
[16:09:51]  <rafl>	 yay
[16:12:04] rafl	 reboots back to the old kernel
[16:14:03]  <oalders>	 so no joy on the kernel?
[16:14:21]  <rafl>	 not yet apparently
[16:15:08]  <rafl>	 nor on the old one it seems
[16:18:19]  <ranguard>	 I never liked the site that much in anycase :)
[16:18:31]  <rafl>	 heh
[16:23:16]  <rafl>	 *sigh*
[16:23:27]  <rafl>	 and of course the volume groups on n1 and n2 have the same names
[16:39:38]  <rafl>	 alright.. looks like the old kernel is finally booting again
[16:39:50]  <rafl>	 at least there's lots of cpu usage from n2
[16:39:50]  <ranguard>	 rafl++
[16:39:54] rafl	 gives it a couple of minutes
[16:39:57]  <rwstauner>	 :-)
[16:41:02]  <rafl>	 it's really annoying that the serial console to n2 doesn't work
[16:41:13]  <rafl>	 no idea what's actually going on in the vm as it boots
[16:42:37]  <rafl>	 man. this wasn't how i thought i'd spend my friday evening :-/
[16:42:49]  <rwstauner>	 :-/
[16:42:52]  <rwstauner>	 :-(
[16:42:56]  <ranguard>	 :(
[16:43:02]  <rwstauner>	 i can't remember how to log into n1
[16:43:56]  <ranguard>	 rwstauner: 46.43.35.68 port 22
[16:44:28]  <rwstauner>	 oh silly me
[16:44:37]  <rwstauner>	 i tried 2201 and 4010
[16:45:05]  <oalders>	 rafl: anything you need help with?
[16:46:29]  <rafl>	 fixing the thing? :)
[16:48:16]  <oalders>	 :D
[16:52:27]  <rafl>	 ah. finally.. an early boot console
[16:55:24]  <rafl>	 it appears as if the initrd doesn't set up the logical volumes anymore, even though both the initrd and the kernel have the same checksum in /boot as they have in yesterday's backup before we touched anything
[16:55:52]  <rafl>	 i think i'm just gonna bring it up manually
[16:59:23]  <rafl>	 it appears the last reboot of n2 was in october last year. does that sound about right?
[17:00:23]  <oalders>	 that was probably when the box got moved physically
[17:01:06]  <rafl>	 alright.. let's see what backups i got. something in the early boot must've changed since then and today before we started
[17:04:05]  <ranguard>	 woo, ssh access :)
[17:04:42]  <rafl>	 http://gist.github.com/2d6ee6504916813975c8
[17:05:00]  <rafl>	 the grub.cfg changes are irrelevant
[17:05:34]  <rafl>	 but someone seems to have overwritten the initrd and kernel with something that doesn't actually boot without problems anymore
[17:05:50]  <rwstauner>	 oh dear
[17:06:49]  <rafl>	 alright
[17:06:58]  <ranguard>	 I upgraded: apache2-utils base-files file libmagic1 tzdata tzdata-java
[17:07:12]  <rafl>	 so we're not gonna do any system upgrade this evening
[17:07:41]  <rafl>	 i wanna take some time to go through those two initrds and find what's causing the trouble we're seeing from just trying to reboot into the kernel we had running for months
[17:07:50]  <ranguard>	 fair enough
[17:08:06]  <ranguard>	 we can also get the console info from mo
[17:08:29]  <rafl>	 we could do it today, but i'm afraid that'd kinda late and i'd start making dumb mistakes
[17:08:42]  <ranguard>	 there is no rush
[17:08:49]  <rafl>	 so until we give upgrading another try, could you guys please have a look at n2 and see what's going on?
[17:09:13]  <rafl>	 ES seems like it's spinning
[17:09:26]  <rafl>	 but maybe it's just still spinning up or something?
[17:09:47] ranguard	 takes down the www
[17:09:54]  <oalders>	 ES takes a while to warm up
[17:10:35]  <ranguard>	 that's what was thrassing cpu, starting up again...
[17:11:33]  <ranguard>	 home page loading
[17:12:21]  <rafl>	 alright. looks like stuff is pretty much back to normal, albeit somewhat slow while the caches are warming up
[17:12:39]  <ranguard>	 thanks for trying rafl, really glad we had you looking at it, think we'd have been stuffed otherwise!
[17:12:58]  <ranguard>	 this makes it really clear we MUST get a second machine
[17:13:50]  <oalders>	 yeah, if things go badly we've got a huge problem to deal with
[17:14:12]  <ranguard>	 also upgrade ES to version 1 so we can take backups to S3
[17:14:42]  <oalders>	 right
[17:15:07]  <rafl>	 http://gist.github.com/a432c6a0ecfa49f3d5e3
[17:15:38]  <rafl>	 when did we ever have mdraid on bm-n2?
[17:17:50]  <ranguard>	 mo and you setup all the disk stuff
[17:19:11]  <rafl>	 to my knowledge, we never had any dmraid on that machine
[17:19:54]  <rafl>	 http://gist.github.com/98049516e57dd8328aa0 is the only other change i can find in the early boot process
[17:20:26]  <rafl>	 that should be fine, though, as the UUID and the device path refer to the same device
[17:21:16]  <rafl>	 unless perhaps that difference changes how the initramfs enables the lvm volumes?
[17:21:24]  <rafl>	 very much doubt that
[17:21:24] ranguard	 needs to head to bed now
[17:22:03]  <rafl>	 night, ranguard. thanks for your help
[17:22:06]  <ranguard>	 we'll have to try again another day
[17:22:16]  <ranguard>	 heh, I just watched - you were to superstar :)
[17:22:48] ranguard	 waves - ttfn
[17:25:11]  <rafl>	 o/
[17:25:30]  <rafl>	 hope we figure out what's wrong with the initramfs soon so we can try again before upgrading becomes urgent
[17:28:41]  <rafl>	 got it, i think
[17:28:46]  <rafl>	 https://gist.github.com/rafl/ac28fec130afbf969ceb
[17:29:08]  <rafl>	 the way activate_vg is implemented, it's not gonna do anything useful with a block device uuid
[17:29:41]  <rafl>	 so reverting that grub configuration change should do the trick
[17:29:55]  <rafl>	 just gotta figure out what caused the generated grub.cfg to change
[17:30:08]  <rafl>	 gonna do that over the weekend, i think. do feel free to beat me to it :)
[17:31:02]  <oalders>	 rafl: thanks for doing this. i have no clue about this stuff
[17:36:31]  <rwstauner>	 ralf+++++
[17:51:07]  <rafl>	 http://gist.github.com/b01aa952a6b9aa6da44b # in case you guys gotta restart that machine before i got things back to normal
[18:00:19]  <oalders>	 thanks. let's hope it doesn't come to that :)

Meta

Clone this wiki locally