Reader Comments

Post a new comment on this article

Dropped data, inadequate controls, inappropriate analysis

Posted by Phil_Davis on 19 Feb 2016 at 17:01 GMT

As a reviewer of this manuscript, I couldn’t help but noticing that the authors altered the size of their dataset, from 34,940 papers reported in the manuscript, to 31,216 in the published paper—a removal of nearly 5,000 data points. As a result, the main results of their paper change drastically, although they still support their main claim that boosts citations, not surprisingly. There is no explanation on why 11% of their dataset was deleted nor did the editor feel that such a change warranted a justification or re-review of the manuscript.

In spite of this editorial oversight, I have two major reservations about this paper:

1. Inappropriate Control Group. The authors go to great length to select an appropriate control group (Off-Academia) that is similar to their study group (On-Academia). While I am confident that both groups of papers are original research articles, they do not show similar online availability: From Table 5, 66.5% of On-Academia papers were found freely available from other websites, compared to 36.9% of Off-Academia websites. Indeed, the authors note this large difference themselves: "This indicates that there may be some self-selection by availability in our data." (p.11)

Unfortunately, the authors do not explore self-selection further in their paper as they suggest at the end of their Introduction. Without attempting to investigate this source of bias, the authors are unable to disambiguate the effect of from a general online availability bias. Or put another way, the strong causal claim they make about may simply be an artifact of statistical confounding.

2. Inappropriate Access Variable. The authors search for all freely available sources of a paper using Google Scholar but classify their results dichotomously: Either the paper is found elsewhere or it isn't. Because of their rudimentary classification, they can only compare On-Academia versus Off-Academia and ignore the context of availability. For example, if they counted the number of available copies, they could investigate whether more copies predict more citations. If they noted the source of the copies, they could measure the relative effect of each source. For instance, we would know the magnitude of effect that PubMed, arXiv, journal websites, personal/dept. websites, and competing services like ResearchGate have compared to Without distinguishing their access variable, we are left with merely an On/Off variable, which is not very helpful to understanding how discoverability is related to citation performance--the intended purpose of this paper.

In sum, without addressing bias and how they measure access, we are left with a paper that may only be valuable as a tool for promoting and marketing a single commercial repository and adds little to our understanding of access to the scientific literature.

No competing interests declared.

RE: Dropped data, inadequate controls, inappropriate analysis

MichaelMortonson replied to Phil_Davis on 22 Feb 2016 at 21:36 GMT

As one of the authors of this paper, I first want to thank Phil and the other reviewers for their time and their feedback on the manuscript. All of the issues raised in Phil's comment were addressed already during the review process, but I'll summarize our responses here for the benefit of other readers.

It's true that the sample of papers was larger in an earlier version. We have not tried to hide this change from the editors or reviewers; a detailed list of the changes to the manuscript and our reasons for making them were submitted to the editor, and we expected that our reviewers would have access to this information as well. Briefly, in the process of making revisions requested by the reviewers, we discovered a couple of technical errors that had resulted in the inclusion of some papers that violated the selection criteria (mainly the requirement that publication year and upload year be the same for papers on Academia). Fixing these errors reduced the size of the sample by about 11% (3724 papers). As we were correcting a software bug to make the analysis consistent with its description in the original manuscript, rather than altering the methodology, it did not seem necessary or appropriate to describe this change in the paper itself.

In principle, the results could have changed "drastically" after correcting the sample selection, but we found that the changes were in fact quite small and did not affect the main conclusions. For example, a linear model prediction for the Academia citation advantage shifted from 50% to 51% for 3 years after publication, and from 73% to 69% after 5 years. The nature of the sample selection errors was such that the 3724 papers that were mistakenly included in the original analysis were essentially a random selection of papers as far as we can tell, so it is not surprising that the results were affected little by their removal.

As a general note, both the original and updated versions of the code, data, and paper are publicly available at the link provided on the main article page ( Anyone who is curious about how the results changed can check for themselves, and all the tools are there for reproducing the results in the paper and performing variations on the analysis.

The two "major reservations" in Phil's comment are identical to the criticisms that he brought to our attention in his review of the submitted manuscript. We have already carefully considered these points and provided our responses to the editor and reviewers as part of the normal peer review process; the content of these responses is summarized below.

1. The observation that articles posted to are more likely to be available elsewhere online is definitely interesting and, as the reviewer points out, we noted this property of the sample in the article. We disagree that this discrepancy biases our estimate of the on-Academia effect, as the online availability effect is explicitly controlled for in the regression by including a variable that indicates whether or not a paper is online.

2. The reviewer believes that a binary available/not-available variable is inadequate for measuring online accessibility. We agree that a more refined metric would be ideal -— for example, the count of different sites with a full-text version of the paper, or a battery of dummy variables for different sources such as arXiv, SSRN, department home page, etc. But such a metric would be practically very noisy: for example, determining when the full-text version was posted to many sites is difficult or even impossible. As such it could cause attenuation bias (i.e. bias towards zero) in our estimates. Furthermore, we are not aware of any open access study that provides such a detailed access metric. Finally, we point out that we find a 69% increase in citations after 5 years even among articles that are not online elsewhere. These articles are not affected by any potential problems with our "online" classification; their values would not change if we used a more refined online variable, since they would still be assigned a value of zero for this variable.

Competing interests declared: I am an author of this paper.