Running an Ubuntu mirror with Juju!

Thu 31 August 2017

Running a mirror for your favorite distribution is much easier than it sounds when leveraging the right tools (and the work of others, including yours truly).

You might be interested to see what we (the Public Cloud team and Web Operations team at Canonical) developed over the years to mirror the Ubuntu archives for our cloud partners (taking a hybrid cache approach).

It's been available as free software from the start but perhaps lacked a little bit of visibility, not being a top-level project in launchpad.

A rear view mirror

Traditional approaches to building local archive mirrors

There are generally two main strategies when it comes to archive mirroring: an actual mirror of everything, or a cache of everything.

1. Mirroring everything

The go-to solution for this particular strategy is either to roll your own sync scripts, or use something like ubumirror as I've written about in a previous blog post, to rsync all of the archive's contents to a local disk.

This might be a good solution for places with very limited bandwidth (the initial sync being done with a good old hard drive, for example), but has a few drawbacks:

It's very big. The whole Ubuntu archive (for amd64 and i386) sits at around 2.5Tb. That's a lot of disk space and bandwidth.
A simple rsync script won't usually check the consistency of the metadata, and since the metadata transfer can take some time, it's likely to introduce inconsistencies. These typically manifest by "hashsum mismatches" errors on the apt client side.

2. Caching everything

This strategy usually means running squid or another caching proxy specifically configured to be aware of debian packages and metadata layout. One such project is squid-deb-proxy, that will happily cache debian packages in a sensible way.

More viable than the "full mirror" case, this currently still exhibits the same metadata inconsistency problem as the full mirror option.

3. Our strategy: a hybrid

In the charm's case, a hybrid approach was taken: the metadata (roughly, this is everything except pool/) is mirrored locally on a schedule, and only served to clients once its internal consistency has been established. Thus the package indices never produce "hashsum mismatch" errors.

The pool/ part of the archive is then cached using squid3. This means that with default settings disk space requirements are much lower than a full mirror, and the hit rate much better than mirroring everything under the sun (most packages having a very small chance of being used - even more so in a cloud environment).

As an added benefit, running the charm means you're running the exact same software that our infrastructure team runs in the biggest public clouds in the world - and benefit from the same vigilance and expertise that we apply to our own systems. For free.

Deploying your own caching mirror with juju

For this deployment we'll first need to configure Juju. If it's not your first time playing with Juju you can skip right ahead to the juicy bits :)

This is intended to kick-start a deployment on a local LXD (as an example), but of course will work on any other cloud supported by juju. For a more detailed introduction to juju please refer to the juju documentation.

To make sure we're up-to-date, let's install juju from the snap package:

sudo snap install --classic juju

Juju comes with a "localhost" cloud leveraging LXD, since that is free (both as in speech and as in beer), we'll use this as a reference cloud, but the instructions should work for any other supported cloud provider (see juju list-clouds).

# This gives you a local LXD backed juju environment
juju bootstrap localhost

Creating a charm configuration file

Let's ask our test deployment to only care about xenial, in order to speed up initial sync with the upstream archives and save disk space:

cat > cache.yaml << EOF
ubuntu-repository-cache
  mirror-series: xenial
EOF

Each series will download around 2.4Gb of metadata on creation, and then download it again every hour. It keeps at most two copies of the metadata on disk and therefore about 5Gb per series should be planned.

Actually deploying the charms

As usual in Juju land, the actual deployment couldn't be easier!

juju deploy --config=cache.yaml ubuntu-repository-cache
juju deploy haproxy
juju add-relation ubuntu-repository-cache haproxy
juju expose haproxy

If you're using a local deployment, you can see that juju spawned 3 LXD containers: one juju controller, an HAproxy machine and the archive charm itself.

Scaling our deployment

You can then scale the archive cluster up by simply running

juju add-unit ubuntu-repository-charm

Using our fresh new archive

Simply pointing apt at the newly exposed HAproxy (public) IP address should just work!

Here's an example snippet to add to your sources.list configuration file:

deb http://<HAproxy's IP address>/ubuntu/ xenial main universe

Our deployments in the clouds

This is the charm that serves the Ubuntu archives for most of the cloud instances you boot on the major cloud providers. We have one deployment per cloud region, and make sure cloud-init sets the default ubuntu archive's address to them when relevant.

For a production deployment we use at least 2 ubuntu-repository-cache instances (juju deploy -n 2 ubuntu-repository-cache) behind 2 HAproxy instances (juju deploy -n 2 haproxy) that are balanced with DNS round-robin.

A diagram of our deployments in the clouds

Disk and memory sizing for the ubuntu-repository-cache units

The squid cache space is computed based on available memory and disk space.

On a machine with 12Gb of RAM and 300Gb of root disk, the following usage is observed: ~200Gb of disk space dedicated to package caching, plus about 22Gb of disk for the default series metadata (the default behavior, in other words, what you get without passing a configuration file when deploying).

You can replicate this setup easily with the following deployment command:

juju deploy --constraints "mem=12G, root-disk=300G" ubuntu-repository-cache

This unfortunately doesn't work with the local LXD provider right now, but should work with most other cloud options offered by juju.

A note on disk size for the local provider

At the time of writing, the local substrate does not honour disk size constraints unfortunately, so all LXD containers are created with a root disk of 10Gb regardless of what is specified. This only applies to the LXD substrate however, and I'm sure the problem will be fixed in a future version.

More information

More information about the myriad of deployment options for monitoring and general configuration can be found in the charm's store page or by browsing the code

Comments? Questions?

Don't hesitate to leave a comment on Reddit!