Since the very beginning of the Data Center Light project our servers have been somewhat stateless and booted from their operating system from the network.
From today on this changes and our servers are switched to boot from an disk (SSD/NVMe/HDD). While this first seems counter intuitive with growing a data center, let us explain why this makes sense for us.
Netboot in a nutshell
There are different variants of how to netboot a server. In either case, the server loads an executable from the network, typically via TFTP or HTTP and then hands over execution to it.
The first option is to load the kernel and then later switch to an NFS based filesystem. If the filesystem is read write, you usually need one location per server or you mount it read only and possibly apply an overlay for runtime configuration.
The second option is to load the kernel and an initramfs into memory and stay inside the initramfs. The advantage of this approach is that no NFS server is needed, but the whole operating system is inside the memory.
The second option is what we used in Data Center Light for the last couple of years.
Netboot history at Data Center Light
Originally all our servers started with IPv4 PXE based netboot. However as our data center is generally speaking IPv6 only, the IPv4 DHCP+TFTP combination is an extra maintenance and also a hindrance for network debugging: if you are in a single stack IPv6 only network, things are much easier to debug. No need to look for two routing tables, no need to work around DHCP settings that might interfere with what one wants to achieve via IPv6.
As the IPv4 addresses became more of a technical debt in our infrastructure, we started flashing our network cards with ipxe, which allows even older network cards to boot in IPv6 only networks.
Also in an IPv6 only netboot environment, it is easier to run active-active routers, as hosts are not assigned DHCP leases. They assign addresses themselves, which scales much nicer.
Migrating away from netbooting
So why are we migrating away from netbooting, even after we migrated to IPv6 only networking? There are multiple aspects:
On power failure, netbooted hosts lose their state. The operating system that is loaded is the same for every server and needs some configuration post-boot. We have solved this using cdist, however the authentication-trigger mechanism is non-trivial, if you want to keep your netboot images and build steps public.
The second reason is state synchronisation: as we are having multiple boot servers, we need to maintain the same state on multiple machines. That is solvable via CI/CD pipelines, however the level of automation on build servers is rather low, because the amount of OS changes are low.
The third and main point is our ongoing migration towards kubernetes. Originally our servers would boot up, get configured for providing ceph storage or to be a virtualisation host. The amount of binaries to keep in our in-memory image was tiny, in the best case around 150MB. With the migration towards kubernetes, every node is downloading the containers, which can be comparable huge (gigabytes of data). The additional pivot_root workarounds that are required for initramfs usage are just an additional minor point that made us question our current setup.
Automating disk based boot
We have servers from a variety of brands and each of them comes with a variety of disk controllers: from simple pass-through SATA controllers to full fledged hardware raid with onboard cache and battery for protecting the cache - everything is in the mix.
So it is not easily possible to generate a stack of disks somewhere and then add them, as the disk controller might add some (RAID0) meta data to it.
To work around this problem, we insert the disk that is becoming the boot disk in the future into the netbooted servers, install the operating system from the running environment and at the next maintenance window ensure that the server is actually booting from it.
The road continues
While a data center needs to be stable, it also needs to adapt to newer technologies or different flows. The disk based boot is our current solution for our path towards kubernetes migration, but who knows - in the future things might look different again.
If you want to join the discussion, we have a Hacking and Learning (#hacking-and-learning:ungleich.ch) channel on Matrix for an open exchange.
Oh and in case you were wondering what we did today, we switched to disk based booting - that case is full of SSDs, not 1'000 CHF banknotes.