I walked over slowly to the big board and flipped the number back to 0. It had been zero days since our last production incident.
We stood around the board solemnly for a moment to acknowledge the occassion. Our hard work had been reset, and it would take some effort before we could get back out of the single digits.
Now, we didn’t actually have a big board with vinyl numbers. But this mentality is something I’ve seen at some of the high-performing teams I’ve been on.
Budgeting and the Age of Money
Using the app pays for itself pretty quickly. Just being aware of how money moves through your bank accounts often yields unexpected savings.
One of my favorite details of YNAB is its “Age of Money” indicator.
The Age of Money readout answer the question: “When you spend your ‘oldest’ dollar, how long has that dollar been in your possession?” The dollars that I’m spending today have been in one of my accounts for 89 days.1
Growing this number means my wife and I have even more of a cushion. It provides more flexibility and resiliency toward unexpected large expenses. It means we might be budgeting for April expenses in January. A larger number here means more peace of mind.
It’s also possible for this number to be negative, e.g. if your liabilities (credit card debt) measure more than your assets. A negative Age of Money means I’ve already spent the money I’m earning today. If the number is increasingly negative, we are digging a hole we will never get out of.
What does this have to do with software engineering?
Applying to Software Engineering
When assessing the health of a software team, this idea fits nicely after a few tweaks.
This also allows us to avoid misleading averages. The average time between incidents might not be as informative as the time since the last incident. The trend of that number can also tell us which direction the team’s health is heading in.
Furthermore, reducing this number also by definition reduces the interruptions a team faces. If you are trying to get a team into a flow state, grow this metric.
To assess the health of an engineering team, observe the delta between something being shipped and it being needed in production. Let’s look at 3 cases.
If the number is negative, the team is very much underwater. The work they are shipping today needed to be in production days or weeks ago. This isn’t a “the PM wanted this to go live a week ago,” but more like a “this payment was due 2 weeks ago.” Morale on this team is likely also in the dumps. They might be asked to work weekends or put in longer hours, further killing morale.
There’s a good chance that your engineers are not engineering, but just operating. This state is unsustainable, unless you want to start including your engineering into your Cost of Goods Sold (COGS) line item on your profit and loss statement.2
Solving this requires careful and strong management, which I won’t go into here.
A neutral or low positive number can feel good. We have a few days of breathing room!
But a few days of breathing room still keeps a strong sense of dread around. The team is just barely treading water. If the number is slowly growing and then being reset, chances are you could unlock sustained growth in this number with a few select efforts.
Ah, the promised land. Incidents still occur, but they do not interrupt the entire team. The team has the space to plan and execute on long-running projects. Individuals can rotate off of the pager to tackle special projects safely.
The team is able to focus on things that must be reacted to but are outside of their control, like customer feedback.
Deciding what resets the number on a team is a fun exercise. Every team can be a bit different. Here are a few things I’ve seen that would fit:
- Days since last production outage.
- Days since SSH-ing into a production machine.
- Days since last introduction of a security vulnerability.3
- Days since manually retrying a failed background job.
- Days since last incident
It’s important to remember that we control the categorization, and that not every exceptional case needs to be number reset. Try starting with strict controls and seeing if that’s too much.
Incidents will happen, that will never change. This number brings awareness to the incidents and drives a discussion on how to prevent the next one. It’s a simple number that folks can see progress. The number also highlights some of the unseen work that might be going on that may not be having immediate, direct customer impact.
And eventually, you might get to the point where your organization has plenty of breathing room:
“When I have a good quarterly conference call with Wall Street, people will stop me and say, ‘Congratulations on your quarter,’ and I say, ‘Thank you,’ but what I’m really thinking is that quarter was baked three years ago.” —Jeff Bezos
This post explored a new metric that took inspiration from workplace accidents, personal budgeting, and memes. I hope you enjoyed it. Even if you do not explicitly employ this tactic, I encourage you to have a discussion with your team or remain aware of this number.
Hopefully, your team will crawl-walk-run before too long.
Special thanks to Shayon Mukherjee for reading early drafts of this post and providing feedback.
It used to be much higher. We bought a house 3 months ago. Home ownership, oof! ↩
A few fellow engineers I shared a draft of this post with were not familiar with profit and loss statements, where they fit into the business, and where engineering sits in the ideal case. The gross margin (revenue minus cost of goods sold (COGS)) is an incredibly important number. It tells us if we have a sustainable business. If revenue minus COGS is negative, we are subsidizing our customers’ activity. Engineering, product, and design are usually categorized as research and development (R&D) because they are creating future value. But if you have engineers turning the crank, they are part of servicing the customers today. I would make the case that they should be included in the COGS, thereby reducing the gross margin and making the business less sustainable. I’ll save that for another blog post. ↩
It’s important on this one to measure the date of introduction and not the date of discovery. The two dates are likely different. Preventing introduction of new security vulnerabilities can be tackled with eduction, safer defaults, more linting rules, etc. ↩