copyright notice
link to the published version: IEEE Computer, March, 2022

accesses since January 14, 2022

A Collapsing Academy, Part III: Scientometrics and Metric Mania


Hal Berghel

In the golden era of higher education, an institution advertised itself by Latin inscriptions like fiat lux. Now the operative watch phrase is fiat numerus.


When it comes to higher education in the U.S., the 20 th century will be known as the century of retro-dorsal, reactive, regression. For all of the progress made in science and technology in the past century, we have precious little to show for an understanding about the nature and causes of our success. Don't misunderstand me. Science and technology is doing fine by itself - it's our post facto assessment of how we accomplished what we did that falls short. Nowhere is this more obvious than in the penchant for the absurd quantification of research quality by citation metrics. To re-purpose one of Edsger Dijkstra's observations, this has been a mistake carried through to perfection.

Socrates is reported to have said that an unexamined life is not worth living. In the academy this amounts to continually challenging ourselves with questions like ‘are we accomplishing anything of enduring value' and, ‘is what we do worth doing.' Instead, we all too often fall into herd-like behavior and substitute relatively meaningless goals like, for example, maximizing citation scores. Unfortunately, this has found acceptance in the academy. At any given moment in time, it seems like the academy is one bad idea, strategic plan or SWOT analysis away from total breakdown. When it comes to the assessment of scholarship we've wandered away from legitimate appraisal to bean counting. To further compound the problem, we're even counting the wrong beans.

Trying to quantify the quality of scholarship and research is akin to measuring the intensity of emotion: we may recognize qualitative difference when we see it, but it's impossible to score it objectively, fairly and without bias. But in the world of a self-absorbed, short-term, upwardly mobile, narrow-focused, criticism-averse modern university administrator, bureaucrat, politician and corporate leader, bean counting all-too-often becomes a go-too tool for institutional assessment. A textbook case is the rank-and-yank personnel evaluation program that contributed so significantly to Enron's ascendancy as the corporate paradigm of excellence it enjoys today.

To be fair, such administrative shortcomings are in no small measure a response to myriad external pressures that come with the job. A central administrator has to simultaneously satisfy key stakeholders like trustees and regents, politicians, potential and real benefactors, legislators, and business leaders – any one of whom can throw the proverbial guano into the institutional punchbowl. It goes without saying that faculty and students are above reproach (I say this with tongue firmly placed in cheek). This falls under the rubric of a kind of administrative anorexia wherein administrators discover that while their ability to actually improve their academy is largely beyond their control, they are in total control of the flow of memos, reports, budget requests, meeting schedules, off-campus retreats, strategic plans, SWOT-analyses, and evaluation cycles. These latter activities therefore occupy much of their waking moments because – wait for it - they can control them. The watchword is: if you can't figure out a way to achieve institutional greatness, issue a memo, demand a report, or schedule an off-campus retreat. At least that way you can document that you've done something for your salary.


The use if metrics in the evaluation of scholarship appears to have begun in the 1950s when Eugene Garfield created the field of scientometrics – the quantitative measurement of scholarly work – for his Institute for Scientific Information. [GARF]. This evolved into a cottage industry of citation indexing of scholarly journals that we know today. To be clear, I take no particular issue with either Garfield's scholarly work, or citation indexing in general. Both fall within the sphere of legitimate curiosity-inspired research. Rather, it is the subsequent use to which it has been put that is problematic. What has happened is that the relatively obscure and moderately interesting study of scientometrics has been adopted as a legitimate indicator of quality, relevance and importance in the evaluations and assessments of faculty scholarship - in some cases on equal footing with peer review. Some faculty now wear their citation index scores as a badge of honor without any critical assessment of the accuracy, precision, and value of the processes involved. Regrettably, scientometrics are too often being considered on a par with more relevant, subjective, and impossible-to-quantify criteria in decision making when it comes to the evaluation of scholarship. The de facto mantra of this metric mania should be: cost-savings before insight. There ought to be a placard to that effect on the wall of every provost and dean so prospective faculty realize what they're up against when it's time for promotion and tenure.

In an earlier academy (pre-1980) the assessment of scholarly prowess relied primarily on local peers – colleagues, academic-administrators (as opposed to career-administrators) – complemented with external peer review from alleged experts. In most subject areas, the grist for this mill was primarily scholarly and published work presented in some form appropriate to the discipline. In that by gone era it was not unusual for all responsible parties to actually read publications and make an informed judgement of quality. This method was not without shortcomings, but it was light years ahead of bean counting.

But those halcyon days are largely past. A byproduct of our current post-modern, digital age is this enthusiastic acceptance of putatively objective measures like metrics to “assist” in scholarly assessments. Many varieties have been developed for such purposes, and several online resources utilize them. Google Scholar and Microsoft Academic Search are well known websites that build metrics into their indexing services. Computer professionals long enough in the tooth will recall that the idea of web resources that would integrate document search with some sort of metadata analysis dates back at least to the 1990's when Citeseer was developed at the NEC Research Institute. [LAWRENCE]. While Citeseer's goal focused primarily on the addition of citations to the index of bibliographic entries in a central database or digital library, later web tools sought the expansion to all online resources. The slide from listing citations to calculating metrics was transformative, for the latter can be used like a rapier by the unenlightened and self-serving administrator – not to mention narcissistic, competitive faculty. As the saying goes, figures don't lie, but liars figure. What must be remembered is that metrics are disentangled from the underlying scholarship and can be manipulated into pretty much any narrative one chooses because they are de-contextualized. This flexibility makes the reliance on metrics inherently subject to abuse. There is a close parallel between metrics and scholarship, and movie reviews and motion pictures. But with movies it is understood by all that reviews may be ad hoc and arbitrary and are not always reliable indicators of quality. But with scholarship, metrics are assumed to be reliable because they are quantitative in nature, objectively determined, and based on publicly accessible data derived from the Internet. What could possibly be wrong with that?

Clearly, indexing hyperlinks of scholarly resources, analyzing and comparing their document metadata, indexing and analyzing non-textual data such as imagery and graphs, text analysis, and a host of advanced file management techniques that support efficient and effective comprehension, etc. are legitimate activities in service to scholarship. These ideas can all be traced back to the pioneering work of Vannevar Bush [BUSH], Theodor Nelson [NELS], and Douglas Engelbart [ENGE], dating back more than a half-century. But the intention of this triumvirate was to facilitate the acquisition of knowledge, not the evaluation and ranking of it. The difference between comprehension and critique is profound and irreconcilable.


So what metrics might be useful to measure scholarly quality? Most academics would agree that any metric worthy of the name should measure impact. That should be our starting point. But how do we do that? Worthy assessment of a scholarly paper, for example, might include a spectrum of reasonable measures that consider a multitude of factors. For example:

  1. Did anyone read it?
  2. Did anyone react to it? (e.g., review, letter to editor)
  3. Did any teacher outside the authors' circle of influence assign it to their students? Is there any reason to suspect that such students reacted positively?
  4. Was it included as a course reading assignment by teachers?
  5. Was it referenced in MS theses or PhD dissertations?
  6. Did anyone support or challenge it in a review?
  7. Was it republished or reprinted?
  8. If it was included in a digital library, how many full-page downloads did it produce?
  9. Do any standard reference sources (encyclopedias, texts) refer to it?
  10. Was it cited? If so, by whom?
    1. How many citations were self-citations?
    2. How many citations were produced by co-authors and students?
    3. How many co-authors were involved?
    4. How many citations were venue-specific?
    5. How many citations were external to the topic?
    6. How many citations were produced by domain knowledge experts?
    7. What percentage of citations were produced by archival (journal) publications?

Note that citation counting is but one measure of many. And like the other measures, it has its own unique set of problems. For one, using citation metrics in the measurement of impact must necessarily include an estimate of the credibility and importance of the citation sources. Obviously, in the special case where the list of co-authors is co-extensive with the sources of the citations, we might question whether the citations mean much. The phrase “write-only publication” has been used to describe such cases. At the other extreme, citations could come from leading scholars in the field who have little or no connection to the authors. In this case, the citations would seem to be a much more accurate measure of impact. Unfortunately, any situation between these two extremes requires interpretation, and any reasonable interpretation would require contextualization, which in turn would require reading the source material and familiarity with the literature. So for the majority of cases, the interpretation of the metrics takes as much effort as the old timey method during the halcyon days.

In addition, special consideration must be given to self-citations which only measure the esteem in which the author(s) regard their own work. While self-citations are useful for re-constructing research lineage, they are not useful for much else. We may generalize one step beyond self-citation to citations produced by “circles of influence” including the authors and subsequent generations of their students and post graduates. These citation threads work much as natural selection does in evolution: long-term success cannot be achieved in the absence of diversity. And, as with evolution, the long-term effects of in-breeding in scholarship are likely to be undesirable and unsustainable. Such was the case with Lysenkoism, geocentric astronomy, cold fusion, and the endless outpouring of cargo cult science [FEYN] that has been enjoying a renaissance for the past fifty years. It is worth noting that this very phenomenon caused Einstein to criticize quantum physics. It wasn't that he thought quantum physics was wrong so much as an incomplete product of a herd mentality. Predictive capacity wasn't enough for Einstein. Explanatory capacity was also required. Quantum entanglement without the latter was but spooky action at a distance – akin to vaporware in computing.

You can see where this is going: the use of citations as a measure of quality is necessarily sketchy. Citations are like recommendations. They are only as good as the recommender. This leads to an infinite regress as the burden of credibility passes down step-by-step through succeeding generations of recommendations. What is more, naked metrics do not distinguish between critical citations and supportive ones. So we need another metric for the strength of each endorsement, and so forth. In short, citation metrics are worthless apart from interpretation, and interpretation requires the same level of domain knowledge and understanding as peer review.

And so far, we've only dealt with parameter 10. What about 1-9? Are they equal in value to 10? They are certainly relevant in some circumstances, so why aren't they also quantified. The plain fact is that only parameter 10 – citations – is easy to count and simple to automate. They are the lowest-hanging fruit on the tree of assessment - not because they provide any more meaningful assessments than the alternatives, but rather because they are inexpensive to generate. Let's label citation analysis for what it is – scholarly evaluation on the cheap.

The point is that questions like 1-10 are all collectively necessary for any thorough evaluation of scholarship, but no subset is, in itself, sufficient. All require an interpretation cycle that is as challenging as the understanding of the scholarly work that they purport to represent. Failure to appreciate this is the first of two primary causes of the evaluative nonsense facing university faculty today. I'll drive home this point with two extreme examples: the May 15, 2015 issue of Physical Review Letters announcing the discovery of the Higgs Boson Mass by the CERN Large Hadron Collider that listed 5,154 co-authors, [CERN] and a 1996 paper by Alan Sokol, apropos of nothing in particular, to a social science journal. [SOKOL]

In the case of the nine-page Hadron paper, the authors/page ratio is 573:1, and the authors/word ratio is 1:1. These are accurate, objective, and quantitative measures, to be sure. But no reasonable person would suggest that they accurately reflect the importance of the paper. But one wonders whether the co-author list is due to gatecrashing (co-authors are listed who are largely unfamiliar with the content), pro-forma recognition of level of effort, horse trading between co-authors, misuse of authority or power, or the inflation of author lists through group, lab, institutional or political associations? There is no way to interpret these ratios (metrics) without context. About all that we can conclude from the metrics is that the optics are wrong because they involve the reader in the politics behind the publication. [PLAWRENCE2]

While the Hadron paper example illustrates the difficulty in apportioning scholarly credit, the Sokol paper highlights the fragility of the peer-review system. [BERSOK] Sokol submitted a paper that was obtuse, unintelligible nonsense. He suspected that inattentive editors might be so biased toward a submission that appeared to be rigorous and formal, that they would fail to verify the relevance of the references. He was correct. Sokol snowed the editors with footwork that purported to integrate weighty topics like relativity, quantum physics, Minkowskian space-time, and Einstein's field equations with philosophical topics such as phenomenology, semiotics, deconstructivism, and hermeneutics. The prospect that quantum physics and relativity could be unified with the social sciences was a temptation too great to be ignored, so the editors published it and in so doing received the 1996 Ig Nobel Prize “ for eagerly publishing research that they could not understand, that the author said was meaningless, and which claimed that reality does not exist.” [IGNOBEL] In its own way, the Sokol hoax confirms out point that there are no shortcuts in the evaluation of scholarship. As the good folks at literacy central have been saying for seventy years, reading is fundamental. [ ]

These two examples illustrate just how deceptive citations can be: it isn't at all clear what a citation to the CERN article would imply for any particular author, and in the case of Sokol's paper it isn't clear whether any of the included citations should be assigned any significance. The critical point is that decontextualized citations are vacuous. Even a statistical correlation between number of citations and quality or importance requires a thoughtful, measured assessment of quality. Correlations can be coincidental. Failure to recognize this by universities is one of the two primary causes of the evaluative nonsense facing university faculty today. The other is the elephant in the room: external funding, which we'll defer to another forum!


Metrics may be thought of as quantitative metadata. [CARPENTER] Currently fashionable metrics include the h-index, hi-k index (h index over k years), m-values, Carbon h-index, etc. all of which are assumed to be “useful” measures of research and publication quality in some circles. The h-index, which is defined as the number N of publications that have been cited at least N times in a rank-ordered list, seems to predominate. There is now an online cottage industry associated with the development of research quality metrics called “reputation” sites. One use of reputation sites is the assessment of the alleged impact of publications for purposes of faculty evaluation. Unfortunately, faculty have been encouraged to support these for-profit sites by creating “profiles” that provide metric summaries of their work because they are quantitative, objective, and of course, free. But this is misguided. Throwing papers down a staircase and assigning weights depending on the number of stairs they have traversed is also quantitative, objective, and free, but it hardly qualifies as a useful tool for meaningful assessment. Encouraging the use of metric ranking services introduces a moral hazard into the evaluation of scholarly work – in encourages the participants to chase the metrics rather than focus on creating durable scholarship. It also blurs the distinction between correlation and causation.

<<PULL QUOTE: Encouraging the use of metric ranking services introduces a moral hazard into the evaluation of scholarly work – in encourages the participants to chase the metrics rather than focus on creating durable scholarship. >>

The over-arching concern of the use of metrics in such evaluation is the open question what do the such measures measure? It must be remembered that correlations are no guarantee of causal connection, they are just estimates and there is no standard method for determining confidence in them. [DEUTSCH] Consider the following two statistical “facts:” (1) Total revenue generated by arcades correlate with the number of computer science doctorates awarded in the US (98.51%, r=0.985065), and (2) spending on science, space and technology correlates with suicides by hanging, strangulation and suffocation at the level of 99.79% (r=.99789126). [VIGEN] Even if we concede that these correlations are objective and unbiased, we are certainly not committed to accept them as important, relevant or useful. That metrics and statistics are not infallible guides to optimal decision making, can be misleading, and are supplements to, and not surrogates for careful reasoning, is well documented [HUFF] [ROSLING][TUFTE] but seemingly ignored by consumers of ranking services.

This is not to say that evaluative metrics have no use in the academy. For example, metrics can be useful at identifying so-called “predatory” or “deceptive” publishing (e.g., open-access publications that lack rigorous peer review). But, as University of Colorado librarian Jeffrey Beall discovered, a such identification can lead to defamation litigation, threats and persecution, [BASKEN] [STRAUM]. So personal risk may be attendant with the use of metrics in this domain.

The use of metrics in the evaluation of scholarship has had its share of detractors. [PLAWRENCE] As Hicks et al, observe “research evaluations are now routine and reliant on metrics. The problem is that evaluation is led by data rather than judgment.” [HICKS] The so-called Leiden Manifesto for research metrics offers ten principles as a “distillation of best practice in metrics-based research assessment” which are worthy of consideration. These principles bring into focus the challenges of using metrics for this purpose. A parallel effort has been undertaken by the San Francisco Declaration on Research Assessment (DORA) which offers equally valuable suggestions for best practices for organizations, publishers, researchers and funding agencies in their Declaration on Research Assessment. [DORA] An online document entitled “5 common myths about the perceived value of journal impact factor (JIF)” is particularly noteworthy. [5MYTHS] A separate 2015 report commissioned by the UK Higher Education Funding Council for England elaborates on these the uses and misuses of research metrics and indicators, [WILSDON] and the supplementary correlation analysis revealed that author-based metrics yield different results than the peer review process. [MTCOR] The companion extensive literature review is noteworthy. [WOUTERS]

So it is not that the use of metrics in the evaluation of scholarship is without criticism that accounts for its increasing use, but rather that the criticism is largely ignored by evaluators. This is not an oversight, but an expedient! In order to deal with the political and ethical issues that accompany the misuse of metrics, one would have to challenge some basic tenets of academic capitalism not the least of which is cost savings and reducing administrative burden.

The moral hazards and perverse incentives that accompany the use of metrics in the academy are acceptable within the framework of the need for cost-effective, objective, and uncontentious evaluations. Evaluations of scholarship with metrics may be thought of as parallel construction to another hidden metric-based evaluation based on external funding where the standards are even less prescriptive but in some disciplines even more important. In a sense the former metric can serve as a façade to the latter – required to give legitimacy to formal process and administrative code. While it would be de classe to demand of faculty that they become institutional profit centers in writing, there is nothing to prohibit the demand for scholarly excellence according to standards befitting the discipline. That demand is spoofy enough to avoid legal scrutiny.

Evaluative metrics should always be taken for what they are: anecdotage, no more no less. Consider how the greatest scholars in history would have fared well were their work assessed on the basis of current metrics. One may get an idea of the absurdity the current fascination with metrics in some circles by considering how Copernicus, Galileo, Newton, Einstein, Pasteur and Salk would have reacted to the use of quantitative metrics in the evaluation of their work. The fact that Einstein's work on relativity was passed over by the Nobel committee should not be overlooked. We should look at h-indices, Journal Impact Factors (JIFs), and the like just as we do with Facebook likes, number of re-tweets, Amazon review scores, page views, number of downloads, and Google Scholar citation counts – along with collegiate football rankings, political and opinion polls, customer satisfaction surveys, and sundry performance rankings. They are all just artificial abstractions from data that are subject to wide interpretation and certainly neither necessary nor sufficient criteria for the evaluation of anything in particular. Even something as basic as statistical measures of central tendency (various means, medians, modes, and norms) must be interpreted and are not immune to misunderstanding and misuse. There is only one way to evaluate scholarship - invest the time to read and understand it.

The issue is not whether the use of metrics in the evaluation of scholarship has kept abreast of new technology, or whether it is sensitive to societal goals regarding diversity, equal opportunities, and objectivity. The emphasis of metrics is a de facto corruption of the reward system as it encourages counter-productive behavior. So why are we still calculating h-indices and JIFs? The answer is they are inexpensive and objective surrogates for legitimate peer review. While bibliographic data bases [ScienceDirect (Elsevier), Scopus (Elsevier), Mendeley (Elsevier), Web of Science (Clarivate Analytics, formerly Thomson Reuters), Springer Link (Springer), West Law (Thomson Reuters)] and Academic social networking sites [ResearchGate (private), Google Scholar (Google), Academia (private)] may be objective and unbiased sources of information, they suffer from the same deficiencies as throwing papers down the staircase or counting words – they circumvent the necessity of personal, measured, intellectually-engaged assessments. Anything that draws our attention away from the actual content of a scholarly work is misplaced, and this entails (1) familiarization with the literature on the subject, (2) intense study of the object under review, and (3) assessment and verification of positions taken. This process cannot be outsourced and automated. The fact that a something has been referenced, downloaded, or liked is prima facie irrelevant for purposes of evaluation. It is not that quantitative metrics are not yet the equal of peer review in evaluation, but rather that they will never be the equal of peer review in just the same way that watching faux news will never be a legitimate substitute for a formal education.

In conclusion, we need to be cynical about the utility and value of metrics – most especially in the evaluation of scholarship. But, to paraphrase Lily Tomlin: no amount of cynicism is ever sufficient. ( )


[GARF] Garfield, E., Citation Indexes for Science: A New Dimension in Documentation through Association of Ideas, Science, Vol 122, Issue 3159, 15 Jul 1955, pp. 108-111 ( )

[LAWRENCE] S. Lawrence, C. Lee Giles and K. Bollacker, "Digital libraries and autonomous citation indexing," Computer , vol. 32, no. 6, pp. 67-71, June 1999.

[BUSH] Bush, V. As We May Think, The Atlantic, July, 1945. ( )

[NELS] Nelson , T., Complex information processing: a file structure for the complex, the changing and the indeterminate, Proceedings of ACM'65 20 th National Computer Conference, Assn. Comp. Machinery, pp. 84-100, 1965. ( ) Nelson' account of much of this history has been captured in a video interview by Devon Zuegel ( ),

[ENGE] Engelbart, D., Augmenting Human Intellect: A Conceptual Framework, Air Force Office of Scientific Research, Washington, D.C. and SRI Summary Report AFOSR-3223, SRI Project No. 3578, October 1962. ( )

[FEYN] Feynman, R., Cargo Cult Science - Some remarks on science, pseudoscience and learning how to not fool yourself, 1974 Caltech commencement address. ( )

[CERN] G. Aad et al., “Combined Measurement of the Higgs Boson Mass in pp Collisions at v- s = 7 and 8 TeV with the ATLAS and CMS Experiments,” Physical Rev. Letters, vol. 114, no. 19, 2015, pp. 191803-1–191803-33. ( )

[SOKOL] Sokol, A., “Transgressing the boundaries: Towards a transformative hermeneutics of quantum gravity,” Soc. Text, vol. 46/47, pp. 217–252, Spring/Summer 1996. ( )

[PLAWRENCE2] Lawrence, P., The politics of publication, Nature, v. 422, 20 March 2003. ( )

[BERSOK] Berghel, H., The Sokol Hoax: A 25-Year Retrospective, Computer, 53:3, pp. 67-72, March, 2020.

[IGNOBEL] 1996 Ig Nobel Prize Winners, The Annals of Improbable Research, Harvard Computer Society. ( ) (Induction ceremony available online on C-SPAN2 @ ).

[CARPENTER] C. Carpenter, David Cone and Cathy Sarli, “Using Publication Metrics to Highlight Academic Productivity and Research Impact,” Acad Emerg Med., vol. 21, no. 10, pp. 1160–1172, Oct 2014.

[DEUTSCH] Deutsch, D, R. Dror and D. Roth, Statistical Analysis of Summarization Evaluation Metrics Using Resampling Methods, Transactions of the Association for Computational Linguistics, 9: 1132-1146. (2021) ( )

[VIGEN] Vigen, T., Spurious Correlations, Hatchette Books, New York (2015).

[HUFF] Huff, D., How to Lie with Statistics, W.W. Norton reissue, New York (1993).

[ROSLING] Rosling, H., A. Rosling Ronnlund and O. Rosling, Factfulness: Ten Reasons We're Wrong About the World – and Why Things are Better Than You Think, Flatiron Books reprint, New York (2020)

[TUFTE] Tufte, E., The Visual Display of Quantitative Information, Graphics Press, 2 nd ed., Cheshire, CT (2001)

[BASKEN] Basken, P. Why Beall's List Died – and What It Left Unreasolved About Open Access, The Chronicle of Higher Educatino, September 12, 2017. ( )

[STRAUM] Straumsheim, C, Academic Terrorist, INSIDE HIGHER ED, June 2, 2017. ( )

[PLAWRENCE] Lawrence, P., The mismeasurement of science, Current Biology, 17:15, pr.583-585, August 07. 2007 ( )

[HICKS] Hicks, D., P. Wouters, et al, The Leiden Manifesto for research metrics, Nature, v. 320 23 April 2015, pp. 429-431. ( )

[DORA] San Francisco Declaration on Research Assessment, 2012. ( )

[5MYTHS] Hatch, A. and R. Schmidt, 5 Common Myths About Evaluation, Rethinking Research Assessment: Ideas for Action, DORA, (2020). ( )

[WILSDON] Wilsdon, J. et al, The Metric Tide: Report of the Independent Review of the Role of Metrics in Research Assessment and Management. DOI: 10.13140/RG.2.1.4929.1363 (2015). ( )

[MTCOR] The Metric Tide Correlation analysis of REF2014 scores and metrics (Supplementary Report II to the Independent Review of the Role of Metrics in Research Assessment and Management), HEFCE. DOI: 10.13140/RG.2.1.3362.4162 (2015). ( )

[WOUTERS] Wouters, P. et al, The Metric Tide: Literature review (Supplementary Report I to the Independent Review of the Role of Metrics in Research Assessment and Management), HEFCE. DOI:
10.13140/RG.2.1.5066.3520 (2015) ( )