If you ask a programmer for advice—a terrible idea—they might tell you something like the following: Don’t repeat yourself. Programs should do one thing and one thing well. Never rewrite your code from scratch, ever!.
Following “Don’t Repeat Yourself” might lead you to a function with four boolean flags, and a matrix of behaviours to carefully navigate when changing the code. Splitting things up into simple units can lead to awkward composition and struggling to coordinate cross cutting changes. Avoiding rewrites means they’re often left so late that they have no chance of succeeding.
The advice isn’t inherently bad—although there is good intent, following it to the letter can create more problems than it promises to solve.
Sometimes the best way to follow an adage is to do the exact opposite: embrace feature switches and constantly rewrite your code, pull things together to make coordination between them easier to manage, and repeat yourself to avoid implementing everything in one function..
This advice is much harder to follow, unfortunately.
Repeat yourself to find abstractions.
“Don’t Repeat Yourself” is almost a truism—if anything, the point of programming is to avoid work.
No-one enjoys writing boilerplate. The more straightforward it is to write, the duller it is to summon into a text editor. People are already tired of writing eight exact copies of the same code before even having to do so. You don’t need to convince programmers not to repeat themselves, but you do need to teach them how and when to avoid it.
“Don’t Repeat Yourself” often gets interpreted as “Don’t Copy Paste” or to avoid repeating code within the codebase, but the best form of avoiding repetition is in avoiding reimplementing what exists elsewhere—and thankfully most of us already do!
Almost every web application leans heavily on an operating system, a database, and a variety of other lumps of code to get the job done. A modern website reuses millions of lines of code without even trying. Unfortunately, programmers love to avoid repetition, and “Don’t Repeat Yourself” turns into “Always Use an Abstraction”.
By an abstraction, I mean two interlinked things: a idea we can think and reason about, and the way in which we model it inside our programming languages. Abstractions are way of repeating yourself, so that you can change multiple parts of your program in one place. Abstractions allow you to manage cross-cutting changes across your system, or sharing behaviors within it.
The problem with always using an abstraction is that you’re preemptively guessing which parts of the codebase need to change together. “Don’t Repeat Yourself” will lead to a rigid, tightly coupled mess of code. Repeating yourself is the best way to discover which abstractions, if any, you actually need.
As Sandi Metz put it, “duplication is far cheaper than the wrong abstraction”.
You can’t really write a re-usable abstraction up front. Most successful libraries or frameworks are extracted from a larger working system, rather than being created from scratch. If you haven’t built something useful with your library yet, it is unlikely anyone else will. Code reuse isn’t a good excuse to avoid duplicating code, and writing reusable code inside your project is often a form of preemptive optimization.
When it comes to repeating yourself inside your own project, the point isn’t to be able to reuse code, but rather to make coordinated changes. Use abstractions when you’re sure about coupling things together, rather than for opportunistic or accidental code reuse—it’s ok to repeat yourself to find out when.
Repeat yourself, but don’t repeat other people’s hard work. Repeat yourself: duplicate to find the right abstraction first, then deduplicate to implement it.
With “Don’t Repeat Yourself”, some insist that it isn’t about avoiding duplication of code, but about avoiding duplication of functionality or duplication of responsibility. This is more popularly known as the “Single Responsibility Principle”, and it’s just as easily mishandled.
Gather responsibilities to simplify interactions between them
When it comes to breaking a larger service into smaller pieces, one idea is that each piece should only do one thing within the system—do one thing, and do it well—and the hope is that by following this rule, changes and maintenance become easier.
It works out well in the small: reusing variables for different purposes is an ever-present source of bugs. It’s less successful elsewhere: although one class might do two things in a rather nasty way, disentangling it isn’t of much benefit when you end up with two nasty classes with a far more complex mess of wiring between them.
The only real difference between pushing something together and pulling something apart is that some changes become easier to perform than others.
The choice between a monolith and microservices is another example of this—the choice between developing and deploying a single service, or composing things out of smaller, independently developed services.
The big difference between them is that cross-cutting change is easier in one, and local changes are easier in the other. Which one works best for a team often depends more on environmental factors than on the specific changes being made.
Although a monolith can be painful when new features need to be added and microservices can be painful when co-ordination is required, a monolith can run smoothly with feature flags and short lived branches and microservices work well when deployment is easy and heavily automated.
Even a monolith can be decomposed internally into microservices, albeit in a single repository and deployed as a whole. Everything can be broken into smaller parts—the trick is knowing when it’s an advantage to do so.
Modularity is more than reducing things to their smallest parts.
Invoking the ‘single responsibility principle’, programmers have been known to brutally decompose software into a terrifyingly large number of small interlocking pieces—a craft rarely seen outside of obscenely expensive watches, or bash.
The traditional UNIX command line is a showcase of small components that do exactly one function, and it can be a challenge to discover which one you need and in which way to hold it to get the job done. Piping things into awk '{print $2}'
is almost a rite of passage.
Another example of the single responsibility principle is git. Although you can use git checkout
to do six different things to the repository, they all use similar operations internally. Despite having singular functionality, components can be used in very different ways.
A layer of small components with no shared features creates a need for a layer above where these features overlap, and if absent, the user will create one, with bash aliases, scripts, or even spreadsheets to copy-paste from.
Even adding this layer might not help you: git already has a notion of user-facing and automation-facing commands, and the UI is still a mess. It’s always easier to add a new flag to an existing command than to it is to duplicate it and maintain it in parallel.
Similarly, functions gain boolean flags and classes gain new methods as the needs of the codebase change. In trying to avoid duplication and keep code together, we end up entangling things.
Although components can be created with a single responsibility, over time their responsibilities will change and interact in new and unexpected ways. What a module is currently responsible for within a system does not necessarily correlate to how it will grow.
Modularity is about limiting the options for growth
A given module often gets changed because it is the easiest module to change, rather than the best place for the change to be made. In the end, what defines a module is what pieces of the system it will never responsible for, rather what it is currently responsible for.
When a unit has no rules about what code cannot be included, it will eventually contain larger and larger amounts of the system. This is eternally true of every module named ‘util’, and why almost everything in a Model-View-Controller system ends up in the controller.
In theory, Model-View-Controller is about three interlocking units of code. One for the database, another for the UI, and one for the glue between them. In practice, Model-View-Controller resembles a monolith with two distinct subsystems—one for the database code, another for the UI, both nestled inside the controller.
The purpose of MVC isn’t to just keep all the database code in one place, but also to keep it away from frontend code. The data we have and how we want to view it will change over time independent of the frontend code.
Although code reuse is good and smaller components are good, they should be the result of other desired changes. Both are tradeoffs, introducing coupling through a lack of redundancy, or complexity in how things are composed. Decomposing things into smaller parts or unifying them is neither universally good nor bad for the codebase, and largely depends on what changes come afterwards.
In the same way abstraction isn’t about code reuse, but coupling things for change, modularity isn’t about grouping similar things together by function, but working out how to keep things apart and limiting co-ordination across the codebase.
This means recognizing which bits are slightly more entangled than others, knowing which pieces need to talk to each other, which need to share resources, what shares responsibilities, and most importantly, what external constraints are in place and which way they are moving.
In the end, it’s about optimizing for those changes—and this is rarely achieved by aiming for reusable code, as sometimes handling changes means rewriting everything.
Rewrite Everything
Usually, a rewrite is only a practical option when it’s the only option left. Technical debt, or code the seniors wrote that we can’t be rude about, accrues until all change becomes hazardous. It is only when the system is at breaking point that a rewrite is even considered an option.
Sometimes the reasons can be less dramatic: an API is being switched off, a startup has taken a beautiful journey, or there’s a new fashion in town and orders from the top to chase it. Rewrites can happen to appease a programmer too—rewarding good teamwork with a solo project.
The reason rewrites are so risky in practice is that replacing one working system with another is rarely an overnight change. We rarely understand what the previous system did—many of its properties are accidental in nature. Documentation is scarce, tests are ornamental, and interfaces are organic in nature, stubbornly locking behaviors in place.
If migrating to the replacement depends on switching over everything at once, make sure you’ve booked a holiday during the transition, well in advance.
Successful rewrites plan for migration to and from the old system, plan to ease in the existing load, and plan to handle things being in one or both places at once. Both systems are continuously maintained until one of them can be decommissioned. A slow, careful migration is the only option that reliably works on larger systems.
To succeed, you have to start with the hard problems first—often performance related—but it can involve dealing with the most difficult customer, or biggest customer or user of the system too. Rewrites must be driven by triage, reducing the problem in scope into something that can be effectively improved while being guided by the larger problems at hand.
If a replacement isn’t doing something useful after three months, odds are it will never do anything useful.
The longer it takes to run a replacement system in production, the longer it takes to find bugs. Unfortunately, migrations get pushed back in the name of feature development. A new project has the most room for feature bloat—this is known as the second-system effect.
The second system effect is the name of the canonical doomed rewrite, one where numerous features are planned, not enough are implemented, and what has been written rarely works reliably. It’s a similar to writing a game engine without a game to implement to guide decisions, or a framework without a product inside. The resulting code is an unconstrained mess that is barely fit for its purpose.
The reason we say “Never Rewrite Code” is that we leave rewrites too late, demand too much, and expect them to work immediately. It’s more important to never rewrite in a hurry than to never rewrite at all.
null is true, everything is permitted
The problem with following advice to the letter is that it rarely works in practice. The problem with following it at all costs is that eventually we cannot afford to do so.
It isn’t “Don’t Repeat Yourself”, but “Some redundancy is healthy, some isn’t”, and using abstractions when you’re sure you want to couple things together.
It isn’t “Each thing has a unique component”, or other variants of the single responsibility principle, but “Decoupling parts into smaller pieces is often worth it if the interfaces are simple between them, and try to keep the fast changing and tricky to implement bits away from each other”.
It’s never “Don’t Rewrite!”, but “Don’t abandon what works”. Build a plan for migration, maintain in parallel, then decommission, eventually. In high-growth situations you can probably put off decommissioning, and possibly even migrations.
When you hear a piece of advice, you need to understand the structure and environment in place that made it true, because they can just as often make it false. Things like “Don’t Repeat Yourself” are about making a tradeoff, usually one that’s good in the small or for beginners to copy at first, but hazardous to invoke without question on larger systems.
In a larger system, it’s much harder to understand the consequences of our design choices—in many cases the consequences are only discovered far, far too late in the process and it is only by throwing more engineers into the pit that there is any hope of completion.
In the end, we call our good decisions ‘clean code’ and our bad decisions ‘technical debt’, despite following the same rules and practices to get there.