I've got medium-sized project now that's just nearing the end of the "sloppy caffeine-powered prototypes for client demos" phase and transitioning into "think about the future" phase. The project consists of Linux-based devices with software and firmware, and a central administrative web server.
Not being well-versed in the art of auto-updates, and being short on time, I had quickly rolled my own software deployment / auto-update strategy and, frankly, it sucks. It currently consists of the following:
A hosted git repo (GitLab) with a production release branch (note the web server source is also in this same repo, as well as a few other things).
A "deploy update" button on the web interface that:
Pulls the latest version from the production release branch into a local repo area and also copies it to a temporary package prep staging area.
Runs a sanitization script (stored in the repo) in the staging area to remove unrelated source files (e.g. server source, firmware source, etc.) and .git files.
Writes the current git hash to a file in the update package (purpose will become clear below).
If all went well, it gzips it and makes it ready to serve by overwriting the previous gzipped package with a file of the same name, then deletes the staging area.
Note that there are now two copies of the current device software on the server, which are expected to be in sync: A full local git repo on the latest production branch, and a ready-to-go gzipped package that is now assumed to represent that same version.
Software on the device is self-contained in a directory named /opt/example/current, which is a symlink to the current version of the software.
An auto-update function on the device that, on boot:
Checks for the presence of a do_not_update file and takes no further action if it exists (for dev devices, see below).
Reads the current commit hash from the above mentioned text file.
Makes an HTTP request to the server with that hash as a query parameter. The server will either respond with a 304 (hash is current version) or will serve the gzipped update package.
Installs the update package, if one was received, into /opt/example by:
Renaming the current software root folder to backup.
Extracting the updated software into a folder named latest.
Updating the current symlink to point to latest.
Running a post installation script from the update package that does things like write new firmware to various hardware components, makes other necessary local changes for that update, etc.
Reboots the device.
There is also the issue of initial deployment on newly constructed devices. The devices are currently SD card based (has its own set of problems, out of scope here) so this process consists of:
An SD image exists that has some stable earlier version of the software on it.
An SD card is created from this image.
On first boot various first-time device-specific (serial number based) initialization takes place and then the auto-updater grabs and installs the latest production version of the software as per usual.
Additionally I needed support for development devices. For development devices:
A full local git repo is maintained on the device.
The current symlink points to the development directory.
A local do_not_update file exists which prevents the auto-updater from blowing away development code with a production update.
Now, the deployment process was theoretically intended to be:
Once code is ready for deployment push it to the release branch.
Press the "deploy update" button on the server.
The update is now live and devices will auto-update the next time they check.
However there are a ton of problems in practice:
The web server code is in the same repo as the device code, and the server has a local git repo that I execute out of. The latest web server code is not on the same branch as the latest device code. The directory structure is problematic. When the "deploy update" button pulls the latest version from the production branch, it pulls it into a subdirectory of the server code. This means that when I deploy to a server from scratch, I have to manually "seed" this subdirectory by grabbing the device production branch into it, because, probably from git user error on my part, if I don't the deployment attempts to pull the device code from the parent directory's web server branch. I think this is solvable by making the staging area not be a subdirectory of the server's local git repo.
The web server currently does not maintain the git hash of the device software persistently. On server startup it does a git rev-parse HEAD in its local device software repo to retrieve the current hash. For reasons I can't wrap my head around this is also causing a ton of logic errors that I won't describe here, suffice it to say that sometimes restarting the server screws things up, especially the server is brand new and no production branch repo has been pulled yet. I'd happily share the source for that logic if requested, but this post is getting long.
One of the biggest problems is: There is currently no separated updater daemon running on the device. Due to complications waiting for wifi internet access to come up and some last minute hackery, its the main device control software itself that checks and updates the device. This means that if somehow a poorly tested version makes it into production, and the control software can't start, the device is essentially bricked, as it can no longer update itself. Same deal if the device loses power at an unlucky time.
The other major problem is: There is no support for incremental updates. If a device, say, isn't turned on for a while, then the next time its updated it skips a bunch of release versions, it has to be able to do a direct version-skipping update. The consequence of this is update deployment is a nightmare of making sure that any given update can be applied on top of any given past version.
There is a slight complication due to the fact that two (and more in the future) versions of the hardware exist. The current version of the hardware is actually stored as an environment variable on its initial SD image (they can't self-identify) and all software is designed to be compatible with all versions of the devices. Firmware updates are chosen based on this environment variable and the update package contains firmware for all versions of the hardware. I can live with this although it is a bit clunky.
A bunch of other frustrations and general unsafeness.
So... that was long. But my question boils down to this:
How do I do this properly and safely? Are there small adjustments I can make to my existing process? Is there a time-tested strategy so that I don't have to roll my own crappy update system? Or if I do have to roll my own, what are the things that must be true in order for a deployment/update process to be safe and successful? I have to also be able to include development devices in the mix.
I hope the question is clear. I realize it's a bit fuzzy, but I am 100% sure that this is a problem that has been tackled before and successfully solved, I just do not know what the current accepted strategies are.