A Whitepaper on Metrics
Andreas Rau, Steinbeis Transferzentrum Softwaretechnik, 1998, 1999, 2001
Last Change: 2001-08-06

Introduction

In its early days, the making of software has been more of an art than a science or field of engineering, and sometimes it still seems to be today. While it might be true that (due to the variety of applications) software development involves more creativiy than other engineering disciplines, the problems solved by software have long reached a level of complexity that cannot be managed by sheer intuition anymore. In other disciplines, the use of standard components and feedback by means of measuring are daily practice. Not so in software development. While there are of course (many different) libraries for various applications and the (still unfullfilled) promise of "software-ICs" through use of object orientation and component-based development, we still have a long way to go to achieve the level of standardization already common in other fields. In mechanical engineering, everything from nuts and bolts to T-bars is standardized. In software development more often than not, every organization has their own libraries and few libraries are standardized and in common use. As for measurement, its application to software has turned out to be rather difficult due to its inherent complexity and intangibility of the subject. Measuring, say the complexity of a program, is just not as easy as measuring, say, length or weight. As described in [Pfleeger97], there is a huge gap between the research on metrics and its use in the industry.

Still though, the benefits that can be gained from software measurement are substantial. There is however one fundamental difference between software measurement and the physical measurement we're used to: The results from software measurement are no final truths. There's always more than one way to measure, say, program size, and hardly ever does measurement allow an absolute statement about the subject. Part of this is because software can come in so many variations and is used in domains often radically different from one another. Another reason is, that some of the attributes we're interested in are far from being well defined. Even something as seemingly simple as program size turns out to be quite ambiguous when it comes to measurement, and matters are much worse for attributes like complexity, quality (and all its aspects), productivity, etc. So the objective of software measurement cannot be to make an absolute statement about software product and its relation to others (on the market). In fact at this time, this kind of comparison, even of competitors in the same domain, is very difficult even if "the same" metrics are used for all subjects. Rather it is intended as a tool to monitor, control and optimize the development process in a continuous measure-analyze/understand-optimize cycle, building an empirical database for the particular environment or organization by the way. This database can then be used to derive estimates for cost and schedule of new projects and provide reference data to monitor their progress against. Comparison to historical data is also vital for evaluating the effectivity of changes and optimizations to the development process.

Of course, in order for the estimates to be meaningful and the data to be comparable, the metrics must be carefully chosen and applied in a consistent manner. That is, before starting to measure, it is necessary to identify a problem a goal or problem to motivate the measurement-program and then define or select appropriate metrics for it. Effort, for example, is better measured in person-months than as cost, b/c costs vary over time due to market conditions and inflation whereas a man-month (pretty much) stays a man-month. Note however, that a person-month (person-year) is not the same in different countries, b/c the length of a workday, overtime regulations and vacation vary. For that reason, great care must be used when comparing such data, and using data collected by others can hardly replace an organization's own empirical data. As will be discussed later on, there are simply too many traps and ambiguities to trip over and too big of an impact of an organization's own procedures and culture. Consistent application of metrics is obviously facilitated by automated tools. They can also contribute a great deal to the acceptance of a metrics program. So what is a good metric then? According to [Mills83], a good metric should be

In addition, metrics should be easy to interpret, which is facilitated by values belonging to appropriate measurement scales (see below).

Finally, another possible use of metrics that should not go unnoticed is to support decision-processes by helping to evaluate the possible alternatives based on a set of requirements and target values/ranges.

Measurement Scales

There are different ways to express the data collected in software measurement. As described in [Mills88] statisticians recognize four different types of measured data or measurement scales with their associated possible operations. As the collection of data and their usage for estimates truely is a statistical method, it is important to be aware of those. Otherwise, if inappropriate operations are used for analysis, the results will be useless. The following table gives an overview of the measurement scales and possible operations:

Type of Data Description of Data Possible Operations Explanation
Nominal Classification equal, not equal named categories with no attached value
Ordinal Ranking greater/better, less/worse, median named categories with ordered values
Interval Differences addition/subtraction, mean, variance numbers without an absolute zero
Ratio Absolute Zero relation numbers with an absolute zero
Every type of data inherits the operations of the types above it in the table.

When collecting nominal data we use one of a fixed number of named categories for our values (e.g. kind of application). A value can either fall into one (or more) categories or not, and two values can either be equal (fall into the same category) or not. Because the categories have no attached value or natural order, other operations are not possible. Although subjective most of the time, nominal data can also be objective if there are strict rules (e.g. threshold values or well defined properties) for classification.

Ordinal data on the other hand, uses categories with attached values having a defined order and hence allows also ranking of values (low-high, better-worse, more-less, ...). Again depending on wether the rules for classification are hard or soft we might end up with a subjective or objective metric.

Interval type data is expressed by numbers and adds the possibility to compute meaningful differences between values. There is no absolute zero value is defined however, and thus ratios might or might not make sense (e.g. 4 is not necessarily twice as much, good, etc. as 2). As the values are normally found by some well defined procedure, interval data is usually used to express objective metrics.

Ratio type data, like interval type data, is expressed by numbers. Because it has an absolute zero defined, not only differences but also ratios make sense. Again, the values are rather measured in the true sense of the meaning than guessed, so also ratio type data normally indicates an objective metric.

Measurement Classification

According to [Mills88] and [Moeller93], metrics can be classified by different aspects as These categories can also be found in other publications and are widely accepted.

Product metrics identify attributes of the software product itself. They may be applied at different stages of development. Note, that the term software product is not restricted to the actual implementation but also encompasses the specification, design and documentation. The attributes of the implementation are measured by a subset of product metrics called code-based metrics. Attributes of primary interest here are size, (structural) complexity and (number of) bugs. In the early days of software measurement, great expectations were directed towards code-based metrics. Most of them have not been met, but still code based metrics are a valuable instrument in software measurement. Process metrics on the other hand measure properties of the development process and its phases such as their cost and duration compared to the schedule, the number of iterations per phase or the amount of equipment and human ressources and the skill of the latter.

The distinction between objective and subjective metrics depends on wether the value of a metric is reproducible by any (qualified) observer or dependend on his or her expertise, opinion, etc. It is closely related to the definition of the different measurement scales: Anything expressed by categories or rankings tends to be subjective whereas something that can be expressed by value is more objective. This does not imply that objective measures are better than subjective ones. On the contrary [Moeller93] states that, although easier to measure, objective metrics are often much harder to interpret with regard to management objectives than subjective ones (an objective complexity of n might or might not be good, subjectively "good" customer satisfaction certainly is).

There are metrics to monitor the development process at different levels. Following a common definition as in [Moeller93], those which are concerned with a single phase of the development process only are called phase-metrics. Because of their limited context, they are usually easy to interpret and can be used for short term (action) decisions. They can be supplemented by global-metrics which cover multiple (subsequent) phases or even the whole development process. They are more long term (vision) oriented. As soon as the metric program is successfully launched (i.e. in operation for 2-5 years), additional (derived) metrics can be defined for fine tuning the activities within each phase.

The distinction between direct and indirect metrics is based on the way a metric is measured. Size for example, can be directly measured whereas qualitity or complexity can only be measured indirectly by breaking them down into different aspects. This must not be confused with the distinction between primitive and computed (derived) metrics. The term primitive metrics is by no means an statement about their usability but refers to the fact that these are "instantly ready after (direct) measurement". It is true, that by themselves they are often hard to interpret. Still they are the base for computed metrics, which might be more valuable but cannot exist without them. Examples are program size or total effort for primitive metrics, and quality defined as the number of bugs in a normalized portion of code or productivity as the amount of code produced in a given amount of time for computed metrics.

Finally, there is the distinction between snapshot metrics, which measure a momentaneous state or condition, and moving metrics, which measure a dynamic behaviour in terms of rates (of events) or trends (for a state or condition) and typically require a database of historical information.

Primitive Metrics

Primitive metrics provide raw data, "physical" attributes of the software, that are later used as inputs for computed metrics. Such attributes are

Bugs

Bugs can simply be counted as they are found and fixed, respectively. In addition, it is a good idea to classify them and maintain independent counters for each category. One useful classification is by severity. As there are always some nasty bugs nobody wants to fix, another useful metric is the lifetime of individual bugs. Obviously, severe bugs should be fixed first and a severe bug that doesn't get fixed for a long time asks for intervention. A metric related to bug lifetime is the turn-around time for changes.

The phase of development where bugs are found and where they where introduced should also be recorded. The sooner a bug is found, the better, so it is reasonable to try to optimize here. Finally, for quality assurance and feedback to the programmers, the kind of bug (I/O, Memory Management, ...) might be of interest. Both counting and classification can be assisted by a bug tracking tool.

But not only the absolute (primitive) number of bugs or fixes but their (computed) differences, averages or rate are of interest. While the absolute number of (non-severe) remaining bugs (=difference between bugs and fixes) often decides about the shipping of a product, the bug rate provides an idea of how many more bugs are to expect and wether testing should be continued or not. It can also be used for an estimate of the mean time between failure (MTBF).

For documents, bugs can be interpreted as the number of corrections resulting from a review (except for rare cases, spelling "bugs" are certainly not critical).

Cost/Effort

Cost is what it all boils down to for management. But as stated before, effort is much better suited for building a database of reference values, b/c its meaning does not change (so much) over time. Note also, that cost does not only depends on wages, but also on the efficiency of the people and the process they're involved in. Still cost is important for evaluating an organization's position with respect to its competitors and the market. In order to optimize estimates, the actual cost should be compared to the budget (relative and absolute). Data aquisition for both is pretty straightforward, except for the aforementioned problems with converting the different units (person-months to hours, etc.) and comparing between different countries. An example conversion table is shown below.

Effort Conversions
1 person-day (PD) = 8h
1 person-week (PW) = 5 PD = 40h
1 person-month (PM) = 4 PW = 20 PD = 160h
1 person-year (PY) = 10 PM = 40 PW = 200 PD = 1600h

Duration/Time

This refers to both the duration of the whole process as well as each individual phase or typical activity. The problem with duration is not the measuring itself but to exactly define the endpoints, that is, when a phase is (successfully) completed and the next one begins. Duration is not only interesting as a pure value but also with respect to the schedule, i.e. deviation from deadlines (relative and absolute). Time or slices thereof can be used to normalize all kinds of primitive metrics to define rates, such as bugs found per week.

Size

Software size is probably the most important primitive metric. It can be used both directly, like to monitor progress, and to standardize other metrics to make them comparable among different projects. Trying to use it for measuring productivity or comparing productivity between different technologies, projects or organizations can easily lead to wrong conclusions because of other factors like inherent complexity, expressive power, etc. (see example for LOC). Besides, even for something as straightforward as program size, there are many definitions (see below) and thus, ways to measure it. Fortunately there are currently only two ways in common use. Document size is usually much less of a problem, b/c it can simple be measured in terms of pages.
Lines of Code
The traditional way of measuring program size is by counting lines of code (LOC), that is, thousands of them (kLOC). Although counting lines of code sounds simple, a closer look reveals a number of open questions that must be answered before starting to measure. The most important question concerns the definition of what a line of code really is:

There are good arguments for almost any combination of answers, but its not really important HOW these questions are answered, but THAT they are answered and the metric be applied in a consistent fashion or, better yet, automated with tools to get comparable results. As counting LOC is offered by most tools, this should not be a problem. Otherwise, as demonstrated in [Thaller94], results might differ by as much as 100%.

The actual lines counted also depend on the kind of project. For new software, total lines of code (TLOC) are of primary interest, while for software maintainance new lines of code (NLOC), modified lines of code (MLOC) and deleted lines of code (DLOC) (or their total) are much more suitable. Code reused from other projects (RLOC) must also be considered separately.

The LOC paradox arises when using LOC to determine, say productivity, between different programming languages as done in [Thaller94]. Because they require many statements where a high-level language requires only one, low level langages like assembly language appear to be much more productive - which of course they are not, as a glance at the cost and development time usually reveals. Also when bug rates are normalized using LOC, the resulting image is distorted. Using the objectcode instead of the sourcecode does not help, b/c this will make optimizing compilers seem less productive. If said comparisons are needed, the probably best solution is to normalize LOC themselves to assembler LOC. A table with empirical conversion factors can be found in [Thaller94]. Some examples

Language Conversion
Assembler 1
C 2,5
Ada 4,5
Smalltalk 15

Another disadvantage of LOC is that counting cannot start until implementation is (almost) finished, so it is essentially useless for making estimates. Also, considering only the implementation phase, it is blind to the effort that goes into the other phases, especially those before implementation. This becomes worse as the project grows and implementation becomes shorter relative to the other phases. But the rate at which LOC increases gives at least some feedback on progress during implementation.

However, despite these ambiguities and disadvantages, LOC are useful because of the good tool support and the vast amount of reference material in existence.

Function Point Analysis
Developed by A.J. Albrecht of IBM in 1997, this approach tries to eliminate some of the disadvantages of LOC by deriving the size of a program not from the code but from its (specified) functions as viewed by the user. This leads to a metric which is independent of the programming language and technology used. Thus, it can be used to normalize and compare results from different environments. Also, b/c the functions can be derived from the specification, program size can be determined earlier in the process than with LOC, and therefore better be used for estimates. Keep in mind though, that the conversion between FP and LOC cannot expected to be linear, b/c the size of the implementation not only depends on the number of functions but also on their complexity.

The core of the method consists of measuring various aspects concerning the software's interfaces to users, files and other systems and weighing them to compute a Function Point (FP) rating that can be compared to other projects. This basis seems reasonable for an information processing system and is the only one that's initially available. Since its publication, the method has been refined by introducing additional corrective factors to account for different requirements and application domains. To improve consistency and repeatability, the counting practices have been standardized in the International Function Point Users Group's Counting Practices Manual, Release 4.0, and there's also a method called Mk II FPA. The general problem with FP is, that the computed rating is purely abstract and the procedure requires a lot of experience, which makes it very difficult to automate. Also, it can hide the effort behind the functions which can be substantial when complicated algorithms are used. Details about FP can be found in [Albrecht83].

As suggested in the paper "Forgotten Aspects of Function Point Analysis" by Paul Goodman and Pam Morris, the decomposition of a system necessary for Function Point Analysis formally reflects the (functional) requirements of a system. When this decomposition is applied to business functions and the systems supporting them, missing and duplicate functions can easily be detected. When building a system, the decomposition can be used as a meaningful scale to monitor progress. The FP rating also roughly reflects the complexity of each logical transaction on the lowest level of decomposition and indicates where this complexity might come from (i/o or computation). Thus it could be used as a tool to assign implementation to the appropriate people. Because there is empirical data which allows conversion from FP to LOC, a two phase approach could exploit both the new benefits of FP and the vast amount of experience based on LOC.

Halstead's Software Science
Devised in the 70's by Maurice Halstead (see [Halstead77]), this is a very formal approach to define program size and derive various estimates. Its not really a primitive metric but as it measures size similar to LOC it fits here and makes for a nice transition to computed metrics. The method is based on the idea of a tokens which stand for any syntactic entity that can be identified by the compiler. There are two kinds of tokens: operators and operands. Operators are "true" operators (+,-, ...) plus all keywords (while, for, do, ...), operands the data they operate with (literals, variables?, ...). Based on these, program size is then defined as

L = N1 + N2

where N1, N2 are the total number of Operators and Operands. This could be interpreted as counting statements as in LOC and weighing them with their complexity (number of arguments). Corresponding to N1, N2 Halstead defines n1, n2 to be the number of unique operators and operands and uses them to estimate program size to

Lg = n1(log2n1) + n2(log2n2)

Halstead also defines a measure for complexity he calls program volume V and derives the number of bugs B from it

V = L(log2(n1+n2)), B = V / 3000

Because of their formal nature, many tools are capable of computing these metrics. There usefulness however is not clear (to me).

Computed Metrics

Complexity

Being purely abstract, complexity is easy to talk about but hard to express in numbers. Still it is of interest both for (cost) estimates and for judging the qualitiy of a design or an implementation. While it is very difficult to measure complexity in a specification, it becomes somewhat easier as we move on to design and implementation (of course, it might already be too late then...). Some metrics designed to measure other aspects, like FP for example, can also give an idea of complexity. Other metrics were specifically designed with complexity in mind.
McCabe's Cyclomatic Complexity
Amongst the most popular methods to measure implementation complexity is the cyclomatic complexity defined in [McCabe76]. His approach is based on the control flow graph, where edges and nodes are counted and subtracted from another leading to

V(G)=G-K+1

where G is the number of edges and K is the number of nodes. In a structured program (without jump's or goto's) this corresponds to the number of enclosed areas in the control-flow graph increased by 1. A linear sequence of statements does not add to complexity. For every branch with n exits complexity increases by n-1. For an if-statement n is 2 and the increase in complexity is 1 - no matter if there is an else or not. Loops also increase the complexity by 1, b/c every loop can be interpreted as containing an implicit if-statement to decide wether to continue or abandon the loop. McCabe gives 10 as a complexity limit where redesign of a module is advisable.

Design Stress
Design complexity is much harder to measure than implementation complexity, b/c what distinguishes a good design from a bad one is not easy to define. A promising approach is to start from commonly accepted design guidelines and try to determine the degree to which they have been met. As reported in [Thaller94], Wayne and Dolores Zage have used this approach on the principles of cohesion and coupling as defined by Glenford J. Myers and came up with two metrics for what they call design-stress. The external design stress corresponds to coupling and is defined as

Se=e1(inflow*outflow) + e2(fan_in*fan_out)

where e1=1 and e2=1 are weights, inflow and outflow are the number of arguments and return values received from and passed to other modules and fan_in, fan_out are the number of modules that can call this module or can be called by it, respectively.

The metric for internal design stress does not as closely model cohesion. It is defined as

Si = c1*CC + c2*DSM + c3*IO

where c1=c3=1 and c2=2.5 are weights and CC,DSM and IO are central calls , uses or references of complex data types such as pointers and structures and accesses to external devices like files, screen, keyboard or printer.

Both design stress metrics are not used to absolutely judge individual modules but rather to identify the modules with the highest complexity within a system so they can be redesigned, or given to the most experienced developers and reviewed more thoroughly than others.

Productivity

According to [Moeller93], productivity is measured as the amount of work (size) completed with a given effort, where "completed" usually means "has passed quality control", i.e. a document has been written, reviewed, corrected and accepted. Depending on the primitive data available and the phase of development, productivity might be expressed as

The absolute values will vary over development time. The average values for pages/day or NLOC/day can be expected not to vary a lot across different projects and therefore give a more general idea of an organization's productivity. Another way to express productivity is by computing the cost per unit as

where cost is the total cost for the respective phase.

Quality

Like complexity, quality is not easy to define, much less to measure. A common metric that can be found in [Moeller93] defines quality as the degree to which a product is bug-free (where bug not only refers to crash or wrong result but also includes not meeting a functional requirement or performance requirement). Using this definition, quality can be (inversely) expressed as the number of bugs in the program or, better yet, the number of bugs normalized by program size.

For documents, this would be "corrections per page" and less than 1 is ok.

Landmarks (must read)

[Albrecht83] - Metrics
Albrecht, A.J. and J.E. Gaffney, Jr. Software Function, Source Lines of Code and Development Effort Prediction: A Software Science Validation, IEEE Transactions on Software Engineering SE-9,6 p639-648, Nov. 1983, A comparison between FP, software science and LOC with a detailed appendix on applying FP
[Halstead77] - Metrics
Halstead, M.H. Elements of Software Science, New York: Elsevier North Holland, 1977, The original book by Halstead on his software science. Another classic
[McCabe76] - Metrics
McCabe, T.J. A Complexity Measure, IEEE Transactions on Software Engineering SE-2,4 p308-320, Dec. 1976, McCabe's classic paper on the cyclomatic complexity of a computer program
[Watts90] - Metrics, Process
Watts, S. Humphrey, Managing the Software Process, Reading MA 1990, Definition of THE CAPABILITY MATURITY MODEL

Bibliography (have read)

[Mills88] - Metrics
Mills, Everald E. Software Metrics, SEI Curriculum Module SEI-CM-12-1.1, Carnegie Mellon University, A good overview of product and process metrics with an exhaustive bibliography
[Moeller93] - Metrics, Process
Möller, K.H. Software-Metriken in der Praxis, Handbuch der Informatik, R.Oldenbourg Verlag, A high level description of useful metrics and their use within the development process with startup guidelines and examples
[Thaller94] - Metrics, Process
Thaller, G.E. Software-Metriken einsetzen, bewerten, messen, Verlag Hans Heise 1994, A detailed description of basic metrics and ideas for their application.
[Pfleeger97] - Metrics
Pfleeger, Shari Lawrence et.al. Status Report on Software Measurement, IEEE Software 14(2) [Special Issue on Measurement], p33-43, 1997, A detailed report on the current use and problems of measurement.