Still though, the benefits that can be gained from software measurement are substantial. There is however one fundamental difference between software measurement and the physical measurement we're used to: The results from software measurement are no final truths. There's always more than one way to measure, say, program size, and hardly ever does measurement allow an absolute statement about the subject. Part of this is because software can come in so many variations and is used in domains often radically different from one another. Another reason is, that some of the attributes we're interested in are far from being well defined. Even something as seemingly simple as program size turns out to be quite ambiguous when it comes to measurement, and matters are much worse for attributes like complexity, quality (and all its aspects), productivity, etc. So the objective of software measurement cannot be to make an absolute statement about software product and its relation to others (on the market). In fact at this time, this kind of comparison, even of competitors in the same domain, is very difficult even if "the same" metrics are used for all subjects. Rather it is intended as a tool to monitor, control and optimize the development process in a continuous measure-analyze/understand-optimize cycle, building an empirical database for the particular environment or organization by the way. This database can then be used to derive estimates for cost and schedule of new projects and provide reference data to monitor their progress against. Comparison to historical data is also vital for evaluating the effectivity of changes and optimizations to the development process.
Of course, in order for the estimates to be meaningful and the data to be comparable, the metrics must be carefully chosen and applied in a consistent manner. That is, before starting to measure, it is necessary to identify a problem a goal or problem to motivate the measurement-program and then define or select appropriate metrics for it. Effort, for example, is better measured in person-months than as cost, b/c costs vary over time due to market conditions and inflation whereas a man-month (pretty much) stays a man-month. Note however, that a person-month (person-year) is not the same in different countries, b/c the length of a workday, overtime regulations and vacation vary. For that reason, great care must be used when comparing such data, and using data collected by others can hardly replace an organization's own empirical data. As will be discussed later on, there are simply too many traps and ambiguities to trip over and too big of an impact of an organization's own procedures and culture. Consistent application of metrics is obviously facilitated by automated tools. They can also contribute a great deal to the acceptance of a metrics program. So what is a good metric then? According to [Mills83], a good metric should be
In addition, metrics should be easy to interpret, which is facilitated by values belonging to appropriate measurement scales (see below).
Finally, another possible use of metrics that should not go unnoticed is to support decision-processes by helping to evaluate the possible alternatives based on a set of requirements and target values/ranges.
| Type of Data | Description of Data | Possible Operations | Explanation |
|---|---|---|---|
| Nominal | Classification | equal, not equal | named categories with no attached value |
| Ordinal | Ranking | greater/better, less/worse, median | named categories with ordered values |
| Interval | Differences | addition/subtraction, mean, variance | numbers without an absolute zero |
| Ratio | Absolute Zero | relation | numbers with an absolute zero |
When collecting nominal data we use one of a fixed number of named categories for our values (e.g. kind of application). A value can either fall into one (or more) categories or not, and two values can either be equal (fall into the same category) or not. Because the categories have no attached value or natural order, other operations are not possible. Although subjective most of the time, nominal data can also be objective if there are strict rules (e.g. threshold values or well defined properties) for classification.
Ordinal data on the other hand, uses categories with attached values having a defined order and hence allows also ranking of values (low-high, better-worse, more-less, ...). Again depending on wether the rules for classification are hard or soft we might end up with a subjective or objective metric.
Interval type data is expressed by numbers and adds the possibility to compute meaningful differences between values. There is no absolute zero value is defined however, and thus ratios might or might not make sense (e.g. 4 is not necessarily twice as much, good, etc. as 2). As the values are normally found by some well defined procedure, interval data is usually used to express objective metrics.
Ratio type data, like interval type data, is expressed by numbers. Because it has an absolute zero defined, not only differences but also ratios make sense. Again, the values are rather measured in the true sense of the meaning than guessed, so also ratio type data normally indicates an objective metric.
Product metrics identify attributes of the software product itself. They may be applied at different stages of development. Note, that the term software product is not restricted to the actual implementation but also encompasses the specification, design and documentation. The attributes of the implementation are measured by a subset of product metrics called code-based metrics. Attributes of primary interest here are size, (structural) complexity and (number of) bugs. In the early days of software measurement, great expectations were directed towards code-based metrics. Most of them have not been met, but still code based metrics are a valuable instrument in software measurement. Process metrics on the other hand measure properties of the development process and its phases such as their cost and duration compared to the schedule, the number of iterations per phase or the amount of equipment and human ressources and the skill of the latter.
The distinction between objective and subjective metrics depends on wether the value of a metric is reproducible by any (qualified) observer or dependend on his or her expertise, opinion, etc. It is closely related to the definition of the different measurement scales: Anything expressed by categories or rankings tends to be subjective whereas something that can be expressed by value is more objective. This does not imply that objective measures are better than subjective ones. On the contrary [Moeller93] states that, although easier to measure, objective metrics are often much harder to interpret with regard to management objectives than subjective ones (an objective complexity of n might or might not be good, subjectively "good" customer satisfaction certainly is).
There are metrics to monitor the development process at different levels. Following a common definition as in [Moeller93], those which are concerned with a single phase of the development process only are called phase-metrics. Because of their limited context, they are usually easy to interpret and can be used for short term (action) decisions. They can be supplemented by global-metrics which cover multiple (subsequent) phases or even the whole development process. They are more long term (vision) oriented. As soon as the metric program is successfully launched (i.e. in operation for 2-5 years), additional (derived) metrics can be defined for fine tuning the activities within each phase.
The distinction between direct and indirect metrics is based on the way a metric is measured. Size for example, can be directly measured whereas qualitity or complexity can only be measured indirectly by breaking them down into different aspects. This must not be confused with the distinction between primitive and computed (derived) metrics. The term primitive metrics is by no means an statement about their usability but refers to the fact that these are "instantly ready after (direct) measurement". It is true, that by themselves they are often hard to interpret. Still they are the base for computed metrics, which might be more valuable but cannot exist without them. Examples are program size or total effort for primitive metrics, and quality defined as the number of bugs in a normalized portion of code or productivity as the amount of code produced in a given amount of time for computed metrics.
Finally, there is the distinction between snapshot metrics, which measure a momentaneous state or condition, and moving metrics, which measure a dynamic behaviour in terms of rates (of events) or trends (for a state or condition) and typically require a database of historical information.
The phase of development where bugs are found and where they where introduced should also be recorded. The sooner a bug is found, the better, so it is reasonable to try to optimize here. Finally, for quality assurance and feedback to the programmers, the kind of bug (I/O, Memory Management, ...) might be of interest. Both counting and classification can be assisted by a bug tracking tool.
But not only the absolute (primitive) number of bugs or fixes but their (computed) differences, averages or rate are of interest. While the absolute number of (non-severe) remaining bugs (=difference between bugs and fixes) often decides about the shipping of a product, the bug rate provides an idea of how many more bugs are to expect and wether testing should be continued or not. It can also be used for an estimate of the mean time between failure (MTBF).
For documents, bugs can be interpreted as the number of corrections resulting from a review (except for rare cases, spelling "bugs" are certainly not critical).
|
There are good arguments for almost any combination of answers, but its not really important HOW these questions are answered, but THAT they are answered and the metric be applied in a consistent fashion or, better yet, automated with tools to get comparable results. As counting LOC is offered by most tools, this should not be a problem. Otherwise, as demonstrated in [Thaller94], results might differ by as much as 100%.
The actual lines counted also depend on the kind of project. For new software, total lines of code (TLOC) are of primary interest, while for software maintainance new lines of code (NLOC), modified lines of code (MLOC) and deleted lines of code (DLOC) (or their total) are much more suitable. Code reused from other projects (RLOC) must also be considered separately.
The LOC paradox arises when using LOC to determine, say productivity, between different programming languages as done in [Thaller94]. Because they require many statements where a high-level language requires only one, low level langages like assembly language appear to be much more productive - which of course they are not, as a glance at the cost and development time usually reveals. Also when bug rates are normalized using LOC, the resulting image is distorted. Using the objectcode instead of the sourcecode does not help, b/c this will make optimizing compilers seem less productive. If said comparisons are needed, the probably best solution is to normalize LOC themselves to assembler LOC. A table with empirical conversion factors can be found in [Thaller94]. Some examples
| Language | Conversion |
|---|---|
| Assembler | 1 |
| C | 2,5 |
| Ada | 4,5 |
| Smalltalk | 15 |
Another disadvantage of LOC is that counting cannot start until implementation is (almost) finished, so it is essentially useless for making estimates. Also, considering only the implementation phase, it is blind to the effort that goes into the other phases, especially those before implementation. This becomes worse as the project grows and implementation becomes shorter relative to the other phases. But the rate at which LOC increases gives at least some feedback on progress during implementation.
However, despite these ambiguities and disadvantages, LOC are useful because of the good tool support and the vast amount of reference material in existence.
The core of the method consists of measuring various aspects concerning the software's interfaces to users, files and other systems and weighing them to compute a Function Point (FP) rating that can be compared to other projects. This basis seems reasonable for an information processing system and is the only one that's initially available. Since its publication, the method has been refined by introducing additional corrective factors to account for different requirements and application domains. To improve consistency and repeatability, the counting practices have been standardized in the International Function Point Users Group's Counting Practices Manual, Release 4.0, and there's also a method called Mk II FPA. The general problem with FP is, that the computed rating is purely abstract and the procedure requires a lot of experience, which makes it very difficult to automate. Also, it can hide the effort behind the functions which can be substantial when complicated algorithms are used. Details about FP can be found in [Albrecht83].
As suggested in the paper "Forgotten Aspects of Function Point Analysis" by Paul Goodman and Pam Morris, the decomposition of a system necessary for Function Point Analysis formally reflects the (functional) requirements of a system. When this decomposition is applied to business functions and the systems supporting them, missing and duplicate functions can easily be detected. When building a system, the decomposition can be used as a meaningful scale to monitor progress. The FP rating also roughly reflects the complexity of each logical transaction on the lowest level of decomposition and indicates where this complexity might come from (i/o or computation). Thus it could be used as a tool to assign implementation to the appropriate people. Because there is empirical data which allows conversion from FP to LOC, a two phase approach could exploit both the new benefits of FP and the vast amount of experience based on LOC.
L = N1 + N2
where N1, N2 are the total number of Operators and Operands. This could be interpreted as counting statements as in LOC and weighing them with their complexity (number of arguments). Corresponding to N1, N2 Halstead defines n1, n2 to be the number of unique operators and operands and uses them to estimate program size to
Lg = n1(log2n1) + n2(log2n2)
Halstead also defines a measure for complexity he calls program volume V and derives the number of bugs B from it
V = L(log2(n1+n2)), B = V / 3000
Because of their formal nature, many tools are capable of computing these metrics. There usefulness however is not clear (to me).
V(G)=G-K+1
where G is the number of edges and K is the number of nodes. In a structured program (without jump's or goto's) this corresponds to the number of enclosed areas in the control-flow graph increased by 1. A linear sequence of statements does not add to complexity. For every branch with n exits complexity increases by n-1. For an if-statement n is 2 and the increase in complexity is 1 - no matter if there is an else or not. Loops also increase the complexity by 1, b/c every loop can be interpreted as containing an implicit if-statement to decide wether to continue or abandon the loop. McCabe gives 10 as a complexity limit where redesign of a module is advisable.
Se=e1(inflow*outflow) + e2(fan_in*fan_out)
where e1=1 and e2=1 are weights, inflow and outflow are the number of arguments and return values received from and passed to other modules and fan_in, fan_out are the number of modules that can call this module or can be called by it, respectively.
The metric for internal design stress does not as closely model cohesion. It is defined as
Si = c1*CC + c2*DSM + c3*IO
where c1=c3=1 and c2=2.5 are weights and CC,DSM and IO are central calls , uses or references of complex data types such as pointers and structures and accesses to external devices like files, screen, keyboard or printer.
Both design stress metrics are not used to absolutely judge individual modules but rather to identify the modules with the highest complexity within a system so they can be redesigned, or given to the most experienced developers and reviewed more thoroughly than others.
The absolute values will vary over development time. The average values for pages/day or NLOC/day can be expected not to vary a lot across different projects and therefore give a more general idea of an organization's productivity. Another way to express productivity is by computing the cost per unit as
where cost is the total cost for the respective phase.
For documents, this would be "corrections per page" and less than 1 is ok.