Some time ago, we discussed the need for a Journal of Negative Results in Software Engineering . Well, today, we’re not yet announcing the creation of such journal but the publication of a Special Issue on “Negative Results in Empirical SE” in the Empirical Software Engineering Journal, which is a more realistic first step to gauge the interest of the community on the topic and see the kind of negative results were contributed.
The foreword we (Neil Ernst, Richard Paige and myself, as the editors of this special issue) wrote for this special issue is now available online but in case you hit a paywall, you can also read it below:
Importance of negative results in Software Engineering
First, what do we mean by negative results? Negative or null results—that is, results which fail to show an eﬀect—are all too uncommon in the published literature for many reasons, including publication bias and self-selection eﬀects. Such results are nevertheless important in showing the research directions that did not pay oﬀ. In particular, “replication cannot be meaningful without the potential acknowledgment of failed replications” .
We believe negative results are especially important in software engineering, in order to firmly embrace the nature of experimentation in software research, just like most of us believe industry should do. This means scientific inquiry that is conducted along Lean Startup  principles: start small, use validated learning and be prepared to ‘pivot’, or change course, if the learning outcome was negative. In this context, negative results are, given their methodology, failed approaches that are just as useful as successful approaches: they point out what hasn’t worked, in order to redirect our collective scientific eﬀorts. As Walter Tichy writes in , “Negative results, if trustworthy, are extremely important for narrowing down the search space. They eliminate useless hypotheses and thus reorient and speed up the search for better approaches.”
The software industry has long had a strong belief in the power of intuition: that productivity can vary by 10x among programmers, or that problems found doing requirements analysis cost dramatically less to fix than code bugs, among many others (see Glass  and Bossavit  for more details on this folklore). Part of our job as researchers must be to give good empirical weight for, or against, commonly held beliefs. Negative results are an important part of this discussion.
Lack of negative results
Publication of negative results is rare, even more so in software engineering where, in contrast to life sciences, there are no specific tracks or journals to present such results. Data that led to negative results in software engineering is very rarely shared. Even open datasets are uncommon in software engineering, despite emerging eﬀorts such as the PROMISE repository (http://openscience.us/repo/). Many researchers remain reliant on single sources like Github for a limited set of artefacts (predominantly code). By contrast, other fields in computing emphasise, as a community eﬀort, the production and maintenance of open datasets, e.g., in machine learning (see, for example, the UCI machine learning repository at https://archive.ics.uci.edu/ml/). It seems that the software engineering community needs to consider developing a culture that accepts negative results as something that is as important to share as novel inventions.
How does this influence the papers we targeted for this issue? First and fore- most, each of the papers selected adheres to the highest standards of the journal. Our view is that published negative results cannot sacrifice empirical rigour. A negative result due primarily to misaligned expectations or due to lack of statistical power (small samples) is not a negative result, rather a poorly designed experiment. The negative result should be a result of a lack of eﬀect, not lack of methodological rigour.
Indeed, negative results often come because the investigation is so well conducted. For instance, insisting on suﬃcient power for an experiment (for example, by choosing larger numbers of sub jects) can mean that the noisy result that con- firmed your suspicions when N=10 (a ‘positive’ result) disappears when N=100, simply because there is always a random chance you will reject the null despite the true eﬀect not existing. Statistician Andrew Gelman compares this to trying to measure the weight of a feather with a bathroom scale, and the feather is in the pouch of a jumping kangaroo .
Summary of Papers
The six papers presented in this special issue cover a wide array of software engineering topics, and tend to tackle problems for which experimental approaches are more applicable. The topics range from experiments related to logging, the validity of metrics, and sentiment analysis through to productivity analysis, eﬀort estimation and revisiting the fragile base class problem. We do not have any submissions reporting negative results using qualitative research approaches; we think this is partly because there is no such thing in a qualitative framework (vs. the erroneous but often-used “p < 0.5” approach, where a ‘negative’ result is when p > 0.05, which is not at all what p-values are telling you). The papers are as follows:
The first paper in this special issue, titled “Empirical Evaluation of the Eﬀects of Experience on Code Quality and Programmer Productivity: An Exploratory Study”, by Dieste et al., examines the claim that expertise improves code quality and programming productivity. The authors conducted an experimental task with both industry and academia (students), focusing on an Iterative Test-Last assignment (developing tests and production code in parallel). They measured the experience of each subject, and assessed their performance on the task. Their results show that there is no evidence that experience is a significant factor in either quality or productivity.
Serebrenik et al. then consider what sentiment analysis tools are bringing to software engineering research, in “On Negative Results when Using Sentiment Analysis Tools for Software Engineering Research”. Sentiment analysis tools claim to measure the positive (e.g., “great code”) or negative opinions (e.g., “this code is not very good”) expressed in text corpora. Since researchers are now using these tools to evaluate software artefacts, e.g. in discussions of pull requests for Github, the authors conducted an experiment to understand how well these tools worked, and whether they matched what developers actually thought of the artefact. They conclude that a) tools only weakly agree with each other, and b) replicating existing studies using these tools was highly sensitive to the choice of the tool used. Read it here.
In “On the Correlation between Size and Metric Validity”, Gil and Lalouche investigate the correlation between software metrics and code size (as measured in Source Lines of Code (SLOC)). Given the importance of software metrics in the field, and in a nice example of replications providing multiple data points, similar studies were conducted earlier (see [8, 6]. This study adds more pro ject data, but also examines the impact of both size and a given metric on external features, such as quality. They conclude that “one has to either decouple the metrics from size or show that the external features themselves are not correlated with size”, since there is a strong correlation between metric and size.
Saban et al., in “Fragile Base-class Problem, Problem?”, explores the well- accepted assumption that “misusing” inheritance and composition in ob ject-oriented programming (e.g. in the context of framework reuse where these two mechanisms are extensively used) negatively aﬀects software maintenance, since changes on the superclasses might cause faults in the subclasses (which therefore become more “fragile”). After a quantitative analysis, the authors conclude that fragile classes are not more fault prone than other classes. An additional qualitative study shows that the detected faults in those classes also were not caused by fragile base classes, showing that the fragile class problem may not be as problematic as previously thought in the literature. Read it here.
Our fifth paper, “Negative Results for Software Eﬀort Estimation”, by Menzies et al., seeks to assess whether new software eﬀort estimation methods are actually better than older COCOMO-based methods initially proposed many years ago. Menzies et al. show that Boehm’s 2000 COCOMO II model  works as well (or better) than all approaches proposed since COCOMO II, for projects with enough data available to enable parametric estimation based on the 23 COCOMO attributes characterizing a software project. In short, in 2016 new innovations in eﬀort estimation have not superseded parametric estimation.
Finally, in “To Log, or Not To Log: Using Heuristics to Identify Mandatory Log Events – A Controlled Experiment”, King et al. highlight the challenge of deciding the Mandatory Log Events (user activities that must be logged to enable forensics) to track for a given security analysis goal. Authors conducted a controlled experiment to evaluate the eﬀectiveness of three diﬀerent methods (standards-driven, resource-driven and heuristics-driven) to perform this task and show that none of them is significantly better than the other. This highlights the need for additional research in this area.
(some papers are not yet online I’ll add the links when available, you can also write directly to the author, ping me if necessary, to get a free copy of the paper)
We would like to acknowledge the rigour and dedication of the many reviewers of this special issue. One observation that we would make as editors of this special issue is that reviewing negative results is hard: not only are negative results papers diﬀerent from the norm, they require a diﬀerent way of thinking and critiquing. In many cases, reviewing the papers was a collaborative eﬀort between reviewer and editor, to try to ensure that the negative result was as clearly expressed as possible. We thank our open-minded reviewers for their support in this. We are also thankful to the Empirical Software Engineering Editors in Chief, Lionel Briand and Thomas Zimmermann, for their support, help and patience throughout the process of preparing this special issue.
- Barry Boehm, Chris Abts, A. Winsor Brown, Sunita Chulani, Bradford K. Clark, Ellis Horowitz, Ray Madachy, Donald J. Reifer, and Bert Steece. “Software Cost Estimation with COCOMO II”, Prentice-Hall (2000).
- Bossavit, Laurent, “The Leprechauns of Software Engineering: How folklore turns into fact and what to do about it”, LeanPub (2015).
- Ferguson, Christopher and Moritz Heene, “A Vast Graveyard of Undead Theories: Publica- tion Bias and Psychological Science’s Aversion to the Null”, Perspectives on Psychological Science, 7(6) pp. 555-561 doi: 10.1177/1745691612459059 (2012)
- Andrew Gelman and J. Carlin. “Beyond power calculations: Assessing type S (sign) and type M (magnitude) errors”. Perspectives on Psychological Science, 9:641651, (2014)
- Glass, Robert L., “Facts and fallacies of software engineering”, Addison-Wesley (2012).
- Herraiz, I., Gonzalez-Barahona, J.M., Robles, G.: Towards a theoretical model for software growth. In: Proceedings of the Fourth International Workshop on Mining Software Reposi- tories, (2007)
- Ries, Eric, “The Lean Startup”, Crown Publishing Group (2014).
- Shepperd, M.: A critique of cyclomatic complexity as a software metric. Software Engineer- ing Journal 3(2), 3036 (1988)
- Tichy, Walter F., “Hints for Reviewing Empirical Work in Software Engineering”, Empirical Software Engineering, 5, 309–312 (2000).
Initial call for papers
Even if the issue has now appeared online I think it’s still worth to keep below the call for papers since it may help anybody interested in organizing some other kind of issue / event around these same ideas (and we hope somebody does that and our effort does not stay as one in a lifetime inititative!).
Call for Papers — EMSE Special Issue on “Negative Results in Empirical Software Engineering”
Editors of the Special Issue:
Richard Paige (University of York) – Jordi Cabot (ICREA – Universitat Oberta de Catalunya) – Neil Ernst (Software Engineering Institute)
Description of the Special Issue:
Negative or null results — that is, results which fail to show an effect — are all too uncommon in the published literature, for many reasons, including publication bias and self-selection effects. And yet, particularly in engineering disciplines, such results are important in showing the paths which did not pay off. In particular, “replication cannot be meaningful without the potential acknowledgment of failed replications” [LAEW, FH]. For example, did your controlled experiment on the value of dual monitors in pair programming not show an improvement over single monitors? Even if negative, results obtained are still valuable when they are either not obvious or disprove widely accepted wisdom. As Walter Tichy writes, “Negative results, if trustworthy, are extremely important for narrowing down the search space. They eliminate useless hypotheses and thus reorient and speed up the search for better approaches.” [Tic]
In this special issue, we seek papers that report on negative results. We seek negative results for all types of software engineering research in any empirical approach (qualitative, quantitative, case study, experiment, etc.). Evaluation criteria will be based on:
- The quality of the reporting.
- The significance of the non-replication or negative result.
- Underlying methodological rigour. For example, a negative result due primarily to misaligned expectations or due to lack of statistical power (small samples) is not a good paper. The negative result should be a result of a lack of effect, not lack of methodological rigour.
Additionally, we seek reviewers who would, besides reviewing submissions for the special issue, also be willing to provide an evaluation of negative results based on their experience of reviewing submissions at major venues (such as ICSE, EMSE, etc.). The reviewers’ evaluations of the impact of a possible ‘positivity bias’ will be compiled into a report included in the special issue (either in the editors’ introduction or as a separate paper). Such evaluations should be no more than 2 pages and describe either one specific experience or general perceptions of the problem.
Deadline for submission: October 7, 2015
Papers should be submitted through the Empirical Software Engineering website (http://www.editorialmanager.com/emse/). Choose “SI:Negative” (full-paper) or “SI:NegativeComment” (short comment/experience report with reviewing negative results) as the Article Type.
If you have questions/comments or would like to volunteer to be a reviewer of the papers, please contact the guest editors.
- [LAEW] Jonathan Lung, Jorge Aranda, Steve M. Easterbrook, Gregory V. Wilson: On the difficulty of replicating human subjects studies in software engineering. ICSE 2008: 191-200
- [FH] Ferguson, Christopher and Moritz Heene, “A Vast Graveyard of Undead Theories: Publication Bias and Psychological Science’s Aversion to the Null”, Perspectives on Psychological Science, November 2012 vol. 7 no. 6 555-561 doi: 10.1177/1745691612459059
- [FNR] The Forum for Negative Results http://page.mi.fu-berlin.de/prechelt/fnr/
- [Pre] Prechelt, Lutz, “Why We Need An Explicit Forum For Negative Results”, Journal of Universal Computer Science, vol. 3, no. 9 (1997), 1074-1083
- [Tic] Tichy, Walter F., “Hints for Reviewing Empirical Work in Software Engineering”, Empirical Software Engineering, 5, 309–312, 2000
Featured image by StockMonkeys.com