“The mythical man-month” continues to influence how we think about software projects in the 2020s. This 1975 collection of essays by Turing-award winner Frederick Brooks is frequently referenced but rarely read. In other words, a classic.
OS/360
The book recounts what Brooks learned from managing the development of OS/360 . OS/360 was the operating system of IBM System/360, a family of mainframe computers introduced in 1964.
According to Brooks, the project did not go well. The system was late, used more memory than planned, exceeded the budget many times over, and initially didn’t perform well. The book is a post-mortem retrospective on what went wrong and why, based on what the author learned, programming is hard to manage.
What is it that makes software work so hard to plan and keep on schedule?
The mythical man-month
The most famous concept of the book is the idea of “man-month”. It’s a mythical unit of productivity, defined as the amount of work that one person can perform in a month.
It’s “mythical”, because, as the author points out, the effect of adding a personto the project depends on many variables: the nature of the work, the amount of coordination needed to collaborate with others, etc. “Man-month” is always an illusory approximation and should be regarded as such.
Parallelizable and non-parallelizable work
Some work is perfectly parallelizable. For example: picking strawberries. By moving from one worker to two, we can expect to cut the time needed to pick the fruits in half. As we keep adding people, we eventually get diminishing returns (adding the 100th worker impacts the completion time less than adding the second one).
Some work is not parallelizable. If a truck transport from Lisbon to Warsaw takes 3 days, assigning more trucks will note make it arrive faster.
Realistic software projects are somewhat parallelizable. When they’re understaffed, adding people does help to shorten the delivery time. But after reaching the optimal staffing level, we not only hit diminishing returns, we’re actually making the matters worse because of the increased coordination overhead.
A complete product takes 9x the time
Building an isolated program that works when run by its author, is significantly easier than developing a program that can a) be run by anyone (a product, with documentation and tests) and that b) can integrate with other elements of a software system.
Each dimension, the author suggests, increases the work needed by a factor of 3:
Isolated program | Part of a system | |
---|---|---|
Proof of concept | work required = 1 | work required = 3 |
Product | work required = 3 | work required = 9 |
Milestones
The first thing about schedules is that there should be one.
For the schedule to be useful, the milestones should be defined in clear, absolute, objective terms. Brooks cites examples of bad milestones:
- coding 90% finished: coding tends to be “almost finished” for half of the coding time
- debugging 99% finished: there’s a reason why these last two bugs are still unresolved…
- planning complete: that’s an event that one can proclaim at will
Instead, we should opt for milestones that are not subject to hand waving and interpretation: e.g. design doc approved by all stakeholders, implementation passes all compatibility tests, component released to staging.
The second system effect
The author warns against the general tendency to over-design a successor that comes after an initially successful system. This includes:
- a temptation to add features that are unneeded or wasteful. For example, OS/360 dedicated 26 bytes of a date-handling routine to correctly handle December 31 on leap years (which as Brooks points out, could be left to the system administrator), and
- a tendency to refine features that are no longer important. For example, OS/360 state-of-the-art support for static overlays. The system shipped at a time when static overlays themselves were becoming redundant thanks to dynamic memory allocation .
As Brooks recounts, OS/360 was a rich source of relevant examples. The project was the “second system” for most of its architects, who previously worked on IBM 1410/7010 , IBM 7030 Stretch and Project Mercury mission control system .
Team organization
Brooks recommends separating the responsibility for the external specification of the system (in the book referred to as “architecture”) from it’s implementation.
This is because for him conceptual integrity is the most important consideration in system design. The idea is that a small group of architects can develop a coherent specification, which can then be implemented by a larger group of implementors:
- the architects should only prescribe externally-visible aspects of the system, leaving the implementers authority in implementing them effectively
- the implementers shoyld focus on efficient implementation of their part of the specification
Interestingly, Brooks notes that the architects must always be able to demonstrate at least one possible way of implementing each feature, to avoid prescribing the impossible in the specifications.
What worked well
As in every good postmortem, it’s good to note what actually worked well. There are a few ideas highlighted by Brooks as good practices to combat the risks of large software projects.
Prototyping
In chapter 11 “Plan to throw one away” Brooks suggests that whenever a new technology or a new system is developed, it’s not possible to get it right the first time. The management should plan for throwing away the first version and starting from scratch before the resulting system is shipped to customers.
This seems to be a stretch when taken at a face value, but the broader point is the value of prototyping. Indeed, when developing a novel piece of software, having an initial version early allows us to verify assumptions and de-risk the project.
Release management
Although it’s not referred to as such, the two pages of the book describing a “very successful” way of “maintaining system libraries”, is in fact a description of an effective release management strategy. Brooks explains that in OS/360 development, each library had 3 versions:
- the playpen version, a live version that the responsible team works on
- the integration testing version, submitted for validation and release to the other teams
- the release version, which passed integration and is available for other teams
The key idea is that only the live version can be edited directly by the responsible team – the stable version exposed to other teams can no longer be changed, saved for critical bug fixes approved by release managers.
Hustle
I guess the modern expression for this is “being proactive”, but I love how Brooks concisely expresses the importance of following up on loose threads around the daily work, reacting fast, and taking care of what needs to be taken care of, whether it’s “our job” or not. Great software engineering teams hustle.
Conclusion
While so much has changed in software engineering since the period of mainframe computers in the 1960s, at the same time, so little has changed.
We’re still prone to making the same mistakes as the authors of OS/360. The spectre of second system effect hangs over meeting rooms in companies big and small. Managers and TPMs intuitively think that adding more people == faster progress, falling for the mythical-man month fallacy.
Learn history so that we don’t repeat it!