Measuring accurately underpins science. Frances Buontempo considers what constitutes sensible metrics.
Having previously observed that time tends to fly by [ Buontempo16a ], I considered starting to measure how I am actually spending my time. This naturally led on to consideration of all the things we actually measure and why, causing the usual lack of time to write an editorial. Those regular readers who have been tracking my progress so far will find this means I have written a total of zero editorials to date. We have previously considered how many different authors Overload has had [ Buontempo15 ] and how many articles they have written. We have had a few new authors since then, and of course would welcome even more. Simple counts like these can be illuminating.
Measuring can be a powerful tool. It helps you to observe changes, which you may have not otherwise noticed. It allows us to verify, or perhaps invalidate, a hypothesis. You can track your weight, or heart rate, or time taken to run a given distance, in order to monitor your health. You can track when you go to bed, and how long you sleep for, to find patterns in order to form better habits. Are you getting enough sleep? Do you usually go to bed at the same time? Some measurements can just report back a pass/fail, for example a build, with tests of course, on a continuous integration server. With this simple initial setup you can take it further to get more nuanced information, for example code coverage, number of warnings in the build (obviously zero, even if you had to
#pragma
turn off warnings to achieve this [
King16
]). There are various metrics for code quality we can use. Some are based on possibly arbitrary, though intuitively appealing, ideas such as McCabe’s cyclomatic complexity, and cohesion [
McCabe76
]. Recently, many organisations have been adopting agile, and therefore started to measure how a team is doing, frequently through velocity. Jim Coplien spoke to the Bath Scrum User Group [
BSUG
] just before this year’s ACCU conference about Scrum. In particular, he polled the audience, another good way of measuring things, to see if they knew why we have the Daily Scrum (clues: it’s not to answer three questions, though that may be the format it takes), and what a team’s sprint failure rate will be. Coplien constantly encouraged us to ask why we were doing things a certain way and not just do something for the sake of performing a ritual because ‘the literature’ said it would work. In fact, part of his mission is to build a body of pattern literature for the agile community to share. He did encourage us to just measure three main things: velocity, without correcting for meetings, holidays and the like, return on investment and finally happiness. Who starts a retrospective by asking if the last sprint made people happy or sad? Does your velocity have a direction? Stepping beyond scrum and process management, he pointed out another measure for that evening. His quick on the spot head-count indicated very few women were present, which is an issue Anna-Jayne Metcalfe dug into in her closing keynote, and lightning talk. The keynote, ‘Comfort Zones’, is now on the newly created ACCU conference youtube channel [
ACCU channel
].
This closing keynote, ‘Comfort Zones’, touched on the topic of anger, and aggression as a chosen response, in particular if people feel threatened or are frightened by change. Most of us will have heard the suggestion to ‘count to ten’ before responding to something we find annoying. Simply counting can make a difference. Getting to ten before responding gives you a chance to choose your reaction, rather than just responding in anger. In the Hitch Hiker’s Guide to the Galaxy , Ford “carried on counting quietly” to annoy computers [ Adams79 ]. Before a team I previously worked with started trying to measure test coverage, some of us just counted how many tests we had – seeing the number go up was motivating. Even setting up an empty test project on Jenkins tended to sprout some actual tests relatively quickly. Several people I know have managed to lose a lot of weight recently. Just weighing themselves, or counting calories, or steps taken every day provided ‘weigh points’ on the journey and were motivating. OK, bad joke, way points are reference points in space, though beacons, “Nodnol, 871 selim” [ Red Dwarf ] and so on show any units can be used, including kilograms or pounds. Just counting things can be informative. Sometimes the numbers ‘fall into the wrong hands’ though, or form the basis of a macho competition. I have previously worked in companies where people proudly announced how many hours they had worked yesterday or this week. Putting in long hours is not always heroic. It can mean you are on the path to a heart attack, a marriage breakup, or just being very inefficient. If two people achieve the same goals and one was at work for 7 hours while the other pulled a 12 hour day, which would you rather have on your team? Working hard doesn’t mean working longer. As I said previously, being busy isn’t the same as being productive [ Buontempo16a ]. In dysfunctional teams, and organisations, there’s a danger metrics can be used as a weapon. If a scrum team’s velocity goes down, the solution is not making them work over the weekend. Neither is it questioning the team members one at a time to find a scapegoat, or the person to blame. The team’s velocity, is, after all the team’s velocity. To slightly misquote a Japanese proverb, we are measuring to fix the problem, not the blame.
When we measure we typically have units, though not always. Agile encourages us to use story or estimation points. There will be no linear mapping to the number of hours (or days) an item of work will take. They capture the perceived effort involved. If you use planning poker cards they tend to follow a Fibonacci sequence, to break the team out of a linear mind-set. Various resources, for example [ ScrumPlop1 ] emphasise they are unitless numbers. Furthermore, bear in mind that these are an estimate or guess. Larger stories will tend to be less clear, so many teams and resources strive towards smaller stories, for example ‘Small Items’ [ ScrumPlop2 ]. I have worked on a couple of teams now where we seem to manage seven stories each sprint, regardless of the capacity, tinkered with for holidays and meetings, or even the perceived effort for each item of work. Just counting stories done, sometimes in the midst of many hours playing planning poker or the like never ceases to amuse me. I recommend having two counts at least – the official way, and any other team metric that works for you. How many items does your team manage each day, or week or sprint? If you don’t work on an agile team, or even a team, do you know what you can achieve? Do you know what’s holding you up? Are you happy? Are you productive? If not, what can you do to change things?
Having considered some areas where a unitless approach can be appropriate, such as counts of completed work or story points, most metrics have units. The United Kingdom uses the so-called metric system now, though informally many older people still measure their height in feet or weight in stones. Prior to this we had the ‘imperial’ system. This appears to date back to the 1824 Weights and Measures act [ Weights ], which the internet assures me wasn’t implemented until a couple of years later. What’s in a name? This was an attempt to provide uniform weight and measures, so that throughout the Empire you knew what you would get if you asked for a gallon, or a slug. Many of the original measures were based on quite natural ideas – a foot is the length of a foot, and so on. If you need to be accurate this could be problematic, hence the attempt to standardise, however older measures would give a ball-park figure that might be good enough for many situations. The drive to standardisation spread over several centuries [ Britannica ], meaning that American measures forked off from the 1824 act, leaving us with differences between various units: a US gallon is smaller than a British imperial gallon. Pints differ. The list goes on. Standards change over time. C++ has had many reformulations. This is a good thing. An interesting point to note is that the precise definitions of various measures usually involve another measure. A weight might be given at a specific temperature. The benchmarks and references used vary over time. Indeed, how we accurately measure time has changed. How to measure time could fill an article or even a book. Instead consider the history of the ‘metre’, or ‘meter’ if you will. The French Academy of Sciences chose a definition for the metre based on one ten-millionth of the earth’s meridian along a quadrant [ NIST ]. This allowed more precision than another suggestion to use the length of a pendulum with a half period of one second. Since gravity varies you would have to be precise about where the pendulum was, which may have required precise coordinates or distance in metres from somewhere, which could prove awkward, being slightly recursive in definition. Having spent time conducting a lot of pendulum experiments in ‘A’ Level physics at school I suspect it’s easy to miss the exact moment it reaches the extrema of the swings. Since the 18th century the definition of a metre has been refined several times. By 1983 it became “the length of a path travelled by light in a vacuum during a time interval of 1/299,792,458 of a second” [ NIST ]. I would be even more likely to blink at the wrong moment using this, than for my pendulum experiments. Where imperial measurements had been based on some natural ideas, like a foot, and tended to be in multiples of 12 or 16 allowing easy division into various ratios, the French Revolution pushed towards multiples of ten. In fact, it seems the French Revolution experimented with a ten day week, admittedly in a twelve month year [ Calendar ]. Apart from the drive to decimalise everything, the renaming of days and months was part of a drive to remove religious names from the days and months. For whatever reason, this didn’t stick. People seem to prefer a seven day week, and long weekends. The attempt to impose a new system of measures doesn’t seem to work. An imperious imposition of the metric system over an imperial system isn’t always successful. Some British people cling to the old imperial measures out of a sense of patriotism. If you trace the units’ origins though, you quickly find that they were imposed by invaders, such as a Roman ‘mile’, or ‘millia’ being a thousand paces. Yard is from Anglo-Saxon ‘gyrd’ for a stick [ Metrics ]. The list is long.
As I finish writing, on May 4th, my musings on the imperial death march [
Imperial March
] conjures a picture of intimidation and fear. Metrics are a way to communicate. They need to be incredibly accurate in order to be scientific. History shows us that as our knowledge increases so our measures change. As we discover new things we will need to continually improve our ‘yardsticks’. Sometimes we don’t need accurate numbers. If you measure code coverage by your tests you can get bogged down in whether you include empty lines, comments, deciding if each part of a logical disjunction (
or
) is evaluated before a branch is taken or just one by lazy evaluation and so on. It might be sufficient to just look at the names of the tests with a customer or product owner to get a sense of coverage of the requirements, rather than the code. Just because you can measure something doesn’t mean you should. Steve Elliot spoke at the Pipeline conference in London this year. He shared many metrics that can be useful from a DevOps perspective; measure ALL the things (starting in a development environment) but encouraged us to make sure they weren’t used to be Orwellian [
Buontempo16b
]. Rewarding people who write the most lines of code is asking for trouble. Catching a process that is failing before it hits QA or the customers, however, is better. Constantly reassessing the way you measure is important. The numbers, graphs and dashboards are a way to communicate between the whole team, rather than enable a witch-hunt. Simply talking can solve problems, but it should be based on a mixture of feelings and science.
Change is good. Change is constant. Embrace it, but watch and measure where things are heading. Ask yourself if you are happy. Ask yourself how you can improve. Ask yourself if I will ever write an editorial.
References
[ACCU channel] See Day 4 for the closing keynote. https://www.youtube.com/channel/UCJhay24LTpO1s4bIZxuIqKw/featured
[Adams79] The Hitch Hikers Guide to the Galaxy , Douglas Adams, 1979.
[Britannica] http://www.britannica.com/science/British-Imperial-System
[BSUG] Bath Scrum User Group http://www.meetup.com/Bath-Scrum-User-Group/events/228943766/
[Buontempo15] ‘How to write an article’ Overload 125, Feb 2015 http://accu.org/index.php/journals/2061
[Buontempo16a] ‘Where does all the time go?’ Overload 132, April 2016 http://accu.org/var/uploads/journals/Overload132.pdf
[Buontempo16b] http://buontempoconsulting.blogspot.co.uk/2016/03/pipeline-2016.html
[Calendar] https://en.wikipedia.org/wiki/French_Republican_Calendar
[Imperial March] https://www.youtube.com/watch?v=-bzWSJG93P8 See what I did there?
[King16] Guy Bolton King, ACCU Conference, 2016 http://accu.org/index.php/conferences/accu_conference_2016/accu2016_sessions#Without_Warning:_Keeping_the_Noise_Down_in_Legacy_Code_Builds
[McCabe76] McCabe, ‘A Complexity Measure’ IEEE Transactions of Software Engineering 2(4), 1976
[Metrics] http://www.metric.org.uk/myths/imperial#imperial-was-invented-in-britain
[NIST] http://physics.nist.gov/cuu/Units/meter.html
[Red Dwarf] http://reddwarf.wikia.com/wiki/RD:_Backwards
[ScrumPlop1] https://sites.google.com/a/scrumplop.org/published-patterns/value-stream/estimation-points
[ScrumPlop2] https://sites.google.com/a/scrumplop.org/published-patterns/value-stream/small-items