Running a mirror for your favorite distribution is much easier than it sounds when leveraging the right tools (and the work of others, including yours truly).
You might be interested to see what we (the Public Cloud team and Web Operations team at Canonical) developed over the years to mirror the Ubuntu archives for our cloud partners (taking a hybrid cache approach).
It's been available as free software from the start but perhaps lacked a little bit of visibility, not being a top-level project in launchpad.
Traditional approaches to building local archive mirrors
There are generally two main strategies when it comes to archive mirroring: an actual mirror of everything, or a cache of everything.
1. Mirroring everything
The go-to solution for this particular strategy is either to roll your own sync scripts, or use something like ubumirror as I've written about in a previous blog post, to rsync all of the archive's contents to a local disk.
This might be a good solution for places with very limited bandwidth (the initial sync being done with a good old hard drive, for example), but has a few drawbacks:
- It's very big. The whole Ubuntu archive (for amd64 and i386) sits at around 2.5Tb. That's a lot of disk space and bandwidth.
- A simple rsync script won't usually check the consistency of the metadata, and since the metadata transfer can take some time, it's likely to introduce inconsistencies. These typically manifest by "hashsum mismatches" errors on the apt client side.
2. Caching everything
This strategy usually means running squid or another caching proxy specifically configured to be aware of debian packages and metadata layout. One such project is squid-deb-proxy, that will happily cache debian packages in a sensible way.
More viable than the "full mirror" case, this currently still exhibits the same metadata inconsistency problem as the full mirror option.
3. Our strategy: a hybrid
In the charm's case, a hybrid approach was taken: the metadata (roughly, this is everything except pool/) is mirrored locally on a schedule, and only served to clients once its internal consistency has been established. Thus the package indices never produce "hashsum mismatch" errors.
The pool/ part of the archive is then cached using squid3. This means that with default settings disk space requirements are much lower than a full mirror, and the hit rate much better than mirroring everything under the sun (most packages having a very small chance of being used - even more so in a cloud environment).
As an added benefit, running the charm means you're running the exact same software that our infrastructure team runs in the biggest public clouds in the world - and benefit from the same vigilance and expertise that we apply to our own systems. For free.
Deploying your own caching mirror with juju
For this deployment we'll first need to configure Juju. If it's not your first time playing with Juju you can skip right ahead to the juicy bits :)
This is intended to kick-start a deployment on a local LXD (as an example), but of course will work on any other cloud supported by juju. For a more detailed introduction to juju please refer to the juju documentation.
To make sure we're up-to-date, let's install juju from the snap package:
sudo snap install --classic juju
Juju comes with a "localhost" cloud leveraging LXD, since that is free
(both as in speech and as in beer), we'll use this as a reference cloud, but
the instructions should work for any other supported cloud provider (see
# This gives you a local LXD backed juju environment juju bootstrap localhost
Creating a charm configuration file
Let's ask our test deployment to only care about xenial, in order to speed up initial sync with the upstream archives and save disk space:
cat > cache.yaml << EOF ubuntu-repository-cache mirror-series: xenial EOF
Each series will download around 2.4Gb of metadata on creation, and then download it again every hour. It keeps at most two copies of the metadata on disk and therefore about 5Gb per series should be planned.
Actually deploying the charms
As usual in Juju land, the actual deployment couldn't be easier!
juju deploy --config=cache.yaml ubuntu-repository-cache juju deploy haproxy juju add-relation ubuntu-repository-cache haproxy juju expose haproxy
If you're using a local deployment, you can see that juju spawned 3 LXD containers: one juju controller, an HAproxy machine and the archive charm itself.
Scaling our deployment
You can then scale the archive cluster up by simply running
juju add-unit ubuntu-repository-charm
Using our fresh new archive
Simply pointing apt at the newly exposed HAproxy (public) IP address should just work!
Here's an example snippet to add to your sources.list configuration file:
deb http://<HAproxy's IP address>/ubuntu/ xenial main universe
Our deployments in the clouds
This is the charm that serves the Ubuntu archives for most of the cloud instances you boot on the major cloud providers. We have one deployment per cloud region, and make sure cloud-init sets the default ubuntu archive's address to them when relevant.
For a production deployment we use at least 2 ubuntu-repository-cache instances (juju deploy -n 2 ubuntu-repository-cache) behind 2 HAproxy instances (juju deploy -n 2 haproxy) that are balanced with DNS round-robin.
Disk and memory sizing for the ubuntu-repository-cache units
The squid cache space is computed based on available memory and disk space.
On a machine with 12Gb of RAM and 300Gb of root disk, the following usage is observed: ~200Gb of disk space dedicated to package caching, plus about 22Gb of disk for the default series metadata (the default behavior, in other words, what you get without passing a configuration file when deploying).
You can replicate this setup easily with the following deployment command:
juju deploy --constraints "mem=12G, root-disk=300G" ubuntu-repository-cache
This unfortunately doesn't work with the local LXD provider right now, but should work with most other cloud options offered by juju.
A note on disk size for the local provider
At the time of writing, the local substrate does not honour disk size constraints unfortunately, so all LXD containers are created with a root disk of 10Gb regardless of what is specified. This only applies to the LXD substrate however, and I'm sure the problem will be fixed in a future version.
Don't hesitate to leave a comment on Reddit!