Yesterday afternoon, I received an exciting press release from PLoS (the text of which is largely similar to this blog post) – article usage data is now available for (nearly) all articles published in any of the open-access PLoS journals! This is a big deal for, at least, two reasons:
1. authors now have a great incentive to publish in PLoS journals, and
2. this could be a wonderful data set to mine for those interested in both the public and scientific reception of open access publishing
PLoS has published an xls file containing all the article-level metric data, which can be downloaded here. I’ve played around with the data for a few minutes and a few things become clear rather quickly.
First, though I love the potential for interactivity which is a hallmark feature of the PLoS journals, it appears to be very infrequently used. Of the 11059 research articles included in this data set, only 651 (~5.9%) have one or more ratings, while only 29 (< 0.3%) have 3 or more ratings. I find this quite surprising – though I’m personally quite reluctant to comment on, leave notes on, or rate a peer-reviewed research article, that’s in large part due to my present academic standing (I don’t even have a Bachelor’s degree yet). What authority do I have to say anything about the research of another lab? This isn’t to say I never have something to say, but if so, I’d write a post here. Conveniently, PLoS is also keeping track of blog trackbacks and mentions (though I’m sure there are mentions not counted in these metrics). Examined this way, the situation appears a bit less bleak: 1196 of the 11059 included research articles (10.81%) were mentioned in at least one blog post (have a non-zero number of trackbacks).
The second obvious feature of the data set is that, as one would expect, a small number of articles appear to be viewed an enormous number of times (the 70 most-viewed articles, which amounts to .63% of all research articles, account for over 10% of all article views). A notable example is one of this year’s media darlings published in PLoS One, “Complete Primate Skeleton…” with over 56000 views in just 2.5 months (the data go through 7/31/09). Papers like this contrast starkly with the mean, which is closer to around 2100 views for a research article. It’s likely that there are similar reasons for the large number of views for many of these other articles. In the future, I’ll try comparing number of views to time since publication (a factor PLoS readily acknowledges skews these metrics) – perhaps this number simply reflects a trend towards publishing in open access journals in recent years. PLoS is doing a great job of tracking blog references to papers, but perhaps there could be a similar metric for popular media mentions (such as Google News results/queries?). This would be the interesting metric for understanding (and maybe factoring out or accounting for) the role of mass media in article popularity. If researchers are working to gauge the influence of their article among their scientific community, pure page-views, especially when driven by media outlets, may not be the best metric.
Number of citations found in PubMed Central vs those found in CrossRef for each article (n = 11059)
Finally, it becomes clear that the various citation-tracking services (specifically, CrossRef, PubMed Central, and Scopus) have quite disparate results. Even if we ignore Scopus for now (the folks at PLoS acknowledge an issue with their database in the ALM FAQ), we can see that there is not generally a 1:1 relationship between the number of citations ascribed to PubMed Central and number of citations ascribed to CrossRef. I’m sure someone out there has a (or a number of) good reason(s) why this is the case – but it does seem a bit strange.
So what does this early data tell us about open-access and community-focused publishing? Most importantly, despite all the encouragement on the part of PLoS and the blogging community, it appears that the enterprise of bringing an interactive discussion to an article (rather than having the discussion take place in the comments of blog posts) has been largely unsuccessful. I find this at least a bit baffling – it’s not at all uncommon for readers (often researchers) to post comments on blog posts discussing peer-reviewed research using their real name. If the trackback feature is working properly, anyone reading the original research article is just a click away from seeing this feedback. Why the willingness to post a comment in one place, but not the other? Or, take the 5-star rating system – this is a quick way to post a very general reaction to a paper. The mean of the average ratings of the 651 rated papers is 4.16 – should this be taken as an indicator of very high average quality of research published in PLoS journals, or a signal that there is an inherent selection bias in those rating articles? I know that personally I have a very tough time rating anything 1/5 (even my iTunes library reflects this – there are something like 1000/6000 songs rated 4+, and the remainder are unrated). The problem here appears to be the lack of anonymity. There’s, I think, a simple solution to this problem. Now, when you look at a rated article (for example), you can see the user who rated it. I think it makes complete sense to require a user to be registered to rate an article, but s/he should also be given the option of anonymity. Requiring sign-in would still be useful for filtering out spam ratings (in ensuring only one rating per user, etc), but would allow users to post more honest reactions to articles. Though it’s not unfeasible to institute a similar system for comments or text notes, I think those should be tracked to a particular user.
Though this data reveals some initial reluctance within the community to adopt this particular means of scientific dialogue, I’m optimistic that as the myspace and facebook generations grow to be competent graduate students and scientists we’ll have a more open and interactive culture of science. It’s absolutely fantastic that PLoS released this data, and I hope in the near future they release increasingly-detailed metrics that can be further mined for interesting usage and publication patterns. I really do think that the future of the scientific publishing “industry” will be in managing and profiting from usage data rather than scientific discourse. The guys at Mendeley seem to understand this, and certainly are on their way to having a very impressive data set to work with. Knowing what articles tend to cluster in researchers’ libraries (and references sections of articles) will allow for the creation of an iTunes Genius-like algorithm for suggesting papers – an absolutely killer feature many labs (and hopefully institutional libraries) would certainly be willing to pay for.
For further information about the PLoS Article-Level Metrics and the prospect of community-oriented science communication, check out the (non-exhaustive) set of links/articles below. As always, I’d love any feedback you might have, so feel free to comment or email any thoughts/ideas/criticisms.
PLoS Journals – measuring impact where it matters – this post is especially interesting
Improving Science Through Online Commentary (Eagleman & Holcombe, 2003) [PDF]
PLoS Takes a Giant Leap Toward Science 2.0
EDIT: new links/posts/mentions
PLoS Article-Level Metrics Home
A Blog Around the Clock: Article-Level Metrics at PLoS