19th January 2017
Once a month at Mashbo we have a Patch Party. No, it isn’t as exciting as it sounds – and no, there isn’t actually any partying (or pirates) involved.
It is however a crucial part of our work, allowing us to keep our clients’ sites as secure as possible.
We use the term 'Patch Party' to refer to our monthly cycle of reviewing and applying security patches to our applications and environments. The process involves: aggregating any security advisories related to our applications, dependencies and system software packages; reviewing them into a list of tasks/actual work; and carrying out said work in a manner that has minimal negative effect for our clients.
With a security update the work will typically be, 'Update this package across all our environments.' We've put in a lot of work over the last 12 months bringing our legacy applications into line to allow us to carry out this work with a high level of efficiency. As always there's more to do: further efficiencies to be made, work that can be automated and tools to leverage to help expedite the manual steps in the meantime.
At a high level, the process can be described as:
- Check for security advisories relating to our application, its frameworks and dependencies
- Take an up-to-date backup of our application data
- Update any system security updates via the appropriate package manager
- If necessary, restart the environment
- If necessary, fork from Master and apply any updates identified from step 1. Test application functionality locally. Deploy.
- Evaluate production environment to ensure continued functionality
It's worth mentioning that pretty crucial to our process are version locked dependencies, with as little deviation from Semver as possible. Where there are deviations: we prefer to have our dependencies locked at specific commits rather than a subject-to-change development branch (even if it is labelled 'master'). Loose, whimsical dependencies like that will bite you on a project somewhere. Just don't tolerate them.
Also, we can - with a high degree of confidence - roll changes out to our production environments thanks to a combination of software tests, deployment/rollback processes and replicable environments. A subject unto themselves: but the result is that in the event of some unforeseen consequences of our application updates, we're never more than 1 command from restoring our application to its previously working state.
As efficient as processes get: we can always improve. As we take on larger volumes of projects/environments to manage previously suitable processes require further refinement to meet the same efficiency. When we inherit code arbitrary client restraints/poor code mean that we simply cannot do things like 'automatically update packages'.
Put simply: updating these applications requires more time. Even identifying which updates are necessary takes time, and doing this continuously as updates are released detracts from the rest of the team’s planned work. It stops being 'preventative maintenance' and becomes 'firefighting' - two terms anyone from operations or involved with the DevOps movement will recognise.
To remove the negative effect this has on our day to day work, we specifically allocate time as planned to deal with all this stuff that needs doing. And that time is our 'Patch Party'. How often this happens depends on how much work you have to do and how much time you have available. We've found that monthly is a good balance for us given the number of projects we’re running, the rate at which updates tend to appear, and the amount of time it takes to apply them.
The second (and arguably more important!) reason for this time is because it presents an opportunity for work to be identified to further improve our processes. Legacy projects can be brought up to date, tests can be written around existing applications, infrastructure can be specified as code, deployment strategies can be implemented etc. This work is often too large to be carried out during the Patch Party itself, but it can be identified, logged, and assigned as future planned work, often against a retainer or our internal development cycles as part of ongoing improvements. It's these development cycles that pay off technical debt and improve our efficiencies.
A process of ongoing improvement is worth giving some special mention to: no matter how good you are at your work, at the processes that make it happen, without regular cycles of improvement those processes will let you down.
Unreplicable changes: hotfixes, on going change requests, new features etc. They all contribute to the gradual shift from order to disorder in your work. With that disorder comes edge cases, scenarios that your processes were never designed for.
When the process fails, you're forced to improvise a solution: and so the cycle continues. System stability is compromised, change becomes high risk, work throughput suffers as bottlenecks and a cultural fear of change develop. Clients become dissatisfied, unwilling to commit funds towards future work likely to be stalled in the never ending cycle of incomplete work and broken environments.
Eventually the processes no longer apply. Everything is special. There are no patch parties. Every second of every day is a patch party -- filled with bug fixes you're uncomfortable deploying because you're ever fearful they'll introduce more problems then they'll solve.
By iterating over your working processes to continually improve them, you can avoid this nightmare. Your processes will adapt to incorporate changes that arise naturally as part of day to day life.
Currently best practices stem heavily from Agile methodologies and the DevOps movement. Cloud infrastructure and Infrastructure-as-code, Automated tests (either as part of TDD or otherwise), semantically versioned (and locked!) dependencies, environment monitoring, version controlled source code, continuous deployment, and security conscious development all come together to form an ongoing process that can adapt to further changes/improvements.
There is no 'one best practice' however: in order to adapt whatever processes you put in place need to fit not only the projects/ infrastructure you're working with but with your developers as well. It includes the education of new hires to bring them up to speed with the process, which can only happen with the rigid support from the current development team: which in turn will only happen if the process actually helps them to get their work done.
The 'Patch Party' itself is simply some allocated work that forms one small (but significant!) part of our development and improvement process. It's the time we set aside each month to look at our responsibilities and allow ourselves to see the work that needs to be done (preventative maintenance) before we are hit with business critical, blocking, unplanned work: firefighting.
Ideally, it doesn't. But not all clients are equal.
Some have more rigid concerns about when you do or do not complete work. Nobody wants work to be carried out on their e-commerce platform during their busiest hours of the week. Others are more than happy to allow you to manage their digital infrastructure, and the application logic is simple and isolated enough that changes are almost always defined as low risk. As long as you're adding business value to the client and you're getting paid for the work, everyone's happy :)
With the majority of our retainers, and any clients we manage the hosting for, we've agreed to put one day aside every month for the Patch Party. We can typically update all of our current projects/infrastructure within that day window, with a few exceptions depending on exactly how much work needs to be carried out.
Over the last 12 months we've been building this process and running it against increasing numbers of projects/infrastructure. Any service outage typically lasts the length of single request. Beyond that the service is back up and running. Given the traffic of our current applications, this is suitable - although as I said before: improvements, improvements, improvements...
As time goes on, we've started to notice bottle necks in the process when it comes to sourcing security advisories, and spinning up none existing local environments to test changes before they are deployed into production. As the number of projects increases, the sheer volume of work increases.
One change to a Symfony dependency and that's 3 applications to implement a change on. This can represent 30 minutes or more of work depending on the change, assuming that all of your tests continue to pass. The time requirement only goes up with the number of projects/changes.
We're looking into ways to automate the aggregation of security release across multiple projects, and the creation of environments for local testing - there's some notion of integrating this with our version control/test processes to automate some of the work involved in a 'continuous deployment' like manner (something truly best practice!) - which should spread the workload away from the Patch Party. This will free up that time to manage the truly special legacy projects that we all know and love.
We're also discussing the way we monitor applications in production to better identify issues before they become issues (implementing this, again, is a form of preventative maintenance). With the right monitoring, we should be better able to identify bottle necks and performance issues and other problems that often accompany none functional requirements.