The 21st Century Rewrite

One of the side effects of being in a fast-growing company is that things are constantly changing. Any tools and processes in place made sense for the team size from 6 months ago.

It seems that I spend most of my social time at work helping folks navigate changing existing code. We discuss The Teeth. We talk about small changes. I’ve however found that there are many different ways that rewrites appear in today’s software teams.

In fact, modern tooling is sufficient and complex enough to convince some teams that they are not doing a rewrite at all. Between feature flagging, serving dark traffic, and tools like scientist, we may not be aware of the risks we take. “We are shipping code constantly to production, how could it be a rewrite?”

This post explores looks to explore some of the ways we accidentally take on what Spolsky referred to as the Big Rewrite about 20 years ago, and how these rewrites still happen but are harder to see.

Behavioral and Structural Changes

Painting in the broadest strokes, the changes we make in software projects come in two flavors: behavioral and structural.

A behavioral change tweaks how the software behaves and what the customer sees. Behavioral changes are new features, new products, and bug fixes. Behavioral changes deliver more value to the customer and are the primary means of a piece of software remaining competitive.

Structural changes modify the program’s structure. They have no outwardly visible effect to the customer. Structural changes are small refactorings, technical migrations to new technology, and extraction of services or applications.

A healthy software team, department, or organization hones its ability to interleave these behavioral and structural changes throughout its life. There are failure modes of focusing on either behavioral or structural changes too much.

Focusing too much on behavioral changes creates a temporarily successful business that is brittle and unable to change. Developers are afraid to most important parts of the code because the years of accumulating features have crippled the company in technical debt. Adding what were once simple behaviors now takes quarters. These products have their day in the sun but eventually get overtaken by more nimble competitors.

The failure mode of structural change leaves a company that rarely delivers value to its customers. Technical migrations are always ongoing, but the added flexibility is not followed by value-delivering behavioral changes. This organization has many solutions-looking-for-problems, and generally originates from engineers not developing customer empathy.

For the rest of the business, these two failure modes look the same: engineers aren’t getting anything done.

Rewriting

When an organization gets far enough down any failure mode (although usually the behavioral failure mode), a team or department often determines that a rewrite is their only option. The group is stating the following through their decisions: “It is too difficult to deliver value in the existing system, so we need a new one.”

Here is where scope becomes critical, especially within the context of modern software development. If no behavioral or structural value is realized for months or quarters, you probably have a rewrite. If you are unable to distinguish the behavioral changes from the structural changes, you might have a rewrite. If embarking on a path of structural change will make integrating behavioral changes twice over (i.e. two systems), you probably have a rewrite. If changes need to be meticulously hand-tested before the team develops confidence that nothing broke, you probably have a rewrite. If the word “launch” or “rollout” enter the vernacular as a finite event with a date, you might have a rewrite.

So although we may be using feature flags or technically landing things to production daily, we may still be performing a rewrite. This rewrite can be more dangerous because it is less obvious. It’s not a separate repo, it’s not labeled “v2”. If it’s weeks or months before customers interact with your code, you may have a 21st Century Rewrite. One of the key identifiers of a rewrite is the inability to distinguish between structural and behavioral changes.

What is so bad about a rewrite?

Now, we may hear of a successful rewrite from time to time. “We rewrote Service X and everything was just fine.” Much like a black swan, I believe they exist but I have yet to see one in my life.

Rewrites are dangerous because they carry unquantified risk. Just about every change to software carries some unquantifiable risk. We can only develop heuristics like larger changes carry more risk.” This risk in absolute terms cannot be measured because we do not have a complete view of the system. Every change we make will have a reaction. The larger the change, the larger the risk. Large enough risks can have material impact on the business.

A rewrite that is large enough will start off as a structural change with behavioral changes mixed in due to business impatience as the project drags on. The business wants to see new or upgraded behaviors, and they wanted to see them yesterday. This creates a positive feedback loop for engineers. They too would like to deliver business impact, so they agree to add more behavioral changes to their structural work.

How should we change software?

Successful and predictable software development is a zipper. We have the left side (behavioral, B) and the right side (structural, S). A healthy team or organization interleaves these in small and large units.

Zooming in to a single pull request, we should see a few micro-refactorings preceeding a few structural changes:

S-S-B-B-S-S-B-B

Zooming out to the work for a quarter, we should see the same fractal properties. We upgrade our strained database technology (S) before delivering a new industry-disrupting product (B).

If we are unable to separate the S from the B on the small scale, we likely do not understand the problem space. If we thought we were making a structural change but we changed the behavior, we need to augment our own understanding (probably with tests). On a larger scale, the inability to separate S and B reflects a lost understanding of the business domain and how it maps to our software. To rewrite is to discard accumulated knowledge in the code and revert to knowledge held in our heads.

The inability to separate S and B shows itself in the negative. “We can’t introduce a new product if don’t write it in Go” or “We can’t add functionality without refactoring class X.” If we bundle these two together, we assume more risk.

Instead, we should clearly identify our structural and behavioral changes. “This product is under a strain that does not lend itself to single-threaded, interpreted languages like Ruby. We are going to rewrite it in Go without changing the functionality to achieve higher throughput. From there, we will add the new behavior requested without worrying about scaling concerns.”

We zip together these structural and behavioral changes, landing them one after the other. We do this against the existing system, without developing a second system hidden behind a feature flag or some other mechanism.

Conclusion

One of the key properties of a rewrite is the bundled structural and behavioral changes. By lumping these together, we assume more risk when developing software. Skilled teams are able to unzip the two and land behavioral or structural value incrementally.

Special thanks to X for reading early drafts of this post.