The blog

2024.06.12: Bibliography keys: It's as easy as [1], [2], [3]. #bibliographies #citations #bibtex #votemanipulation #paperwriting

The bibliography at the end of the paper "New directions in cryptography" has 14 entries, labeled "[1]" through "[14]". The numbers "1" through "14" are examples of bibliography keys. I'll abbreviate "bibliography key" as "bibkey" or just "key".

One sentence from the paper refers to what happened after "the invention of the telegraph [2, p. 191]". This refers to the bibliography entry having bibkey 2. A reader who wants to follow the citation looks at the bibliography and immediately finds "[2] D. Kahn. The Codebreakers, The Story of Secret Writing. New York: Macmillan, 1967." This is enough information for the reader to track down the book in question (although the bibliography should have had more information to help with this process, perhaps most importantly an ISBN) and to check what page 191 of the book says about this topic. This can be a fast process for readers who happen to have Kahn's book on a nearby bookshelf.

Bibkeys don't have to be numeric. Almost all of my papers use numeric bibkeys, but of course this statistic is biased by my own preferences and by the norms in areas of science where I publish. Sometimes journals require other styles for bibkeys. Commonly used tools for managing bibliographies (e.g., BibTeX and BibLaTeX) support many different styles. Another paper citing the Kahn book might cite it as "[Kahn 1967]" or "[Kahn67]" or "[Kah67]" (there's a reason that the n disappears; bear with me) or "[K67]". Typesetting also varies: Nature and many other big science journals put bibkeys in superscripts; maybe the bibliography says "2." rather than "[2]"; maybe the main text says "(Kahn 1967)" rather than "[Kahn 1967]".

This blog post goes through various arguments that I've seen for non-numeric bibkeys, arguments that mostly sound good at first glance but that turn out to collapse upon closer investigation. This blog post then goes through various counterarguments.

The skipping-clicks argument. The central argument I've seen for having keys contain more than just a number is that this saves time for readers as follows: a reader looking at "[Kahn 1967]" will often think "Of course that's Kahn's Codebreakers book", a reader looking at "[CLRS09]" will often think "Of course that's the algorithms textbook from Cormen, Leiserson, Rivest, and Stein", etc., saving the time to click on the link or to flip through a printout of the paper.

Fundamentally, the problem with this is that it's actively encouraging guesses to override the communication of ethically required citation information.

Sometimes the guesses are wrong. Keys are unique within any particular paper but collide across papers. The wrong guesses are likely to have the following pattern: the guess says a well-known source is being cited; the actual situation is that a lesser-known source is being cited. Even when readers don't guess incorrectly, having a fast path for the better-known sources creates a bias (the "streetlight effect") against taking the slow path.

All of this is unfair to the lesser-known sources. Instead of scientific work being evaluated purely on the basis of its merit, privileged authors end up receiving credit they don't deserve, while it becomes harder for other authors to succeed. Even when an unprivileged paper manages to attract enough attention to be cited, readers are discouraged from seeing that it's cited!

I tried using BibTeX's "alpha" style to reformat a paper I had recently written on some math questions appearing in cryptanalysis. At one point alpha style produces a citation to "[Len17]". Some readers will have seen Arjen Lenstra's 2017 integer-factorization survey, and will think "Of course [Len17] is this survey". These readers are wrong, but they aren't wrong in a way that's obvious from the text surrounding the "[Len17]" citation, or from anything else in the main body of the paper; seeing that they're wrong requires looking at the "[Len17]" entry in the bibliography.

Some other readers will think of the 2017 paper on lattices by Hendrik Lenstra and Alice Silverberg. That's another wrong guess. With a numeric bibkey ("[70]"), the reader isn't being given fragments of information that encourage wrong guesses; the reader is more likely to click (or flip) to see the correct information, namely that this is a 1917 paper on notation from someone named Lennes.

Nels Johann Lennes hasn't been in a position after 1951 to notice how much credit he receives. But presumably he cared how he would be remembered, and wouldn't have expected that his death would suddenly result in people editing him out of the history. It's also wrong that "[Len17]" ends up sometimes miscrediting one of the Lenstra brothers. My experience is that authors making excuses for not properly citing dead people are also authors who aren't properly citing today's graduate students.

BibTeX's alpha style turns another reference in the same paper into "[GST21]". This time it's a 2021 paper (and, yes, with a student author, Aude Le Gluher, who defended her Ph.D. thesis in December 2021). Will readers who see "[GST21]" be thinking of that paper? Or of another 2021 paper that happens to have authors G, S, and T?

At this point I'd expect some people to jump in saying something like this: "The obvious solution is to provide full last names and years, as in [Kahn 1967]." But then another reference in the same paper would be "[Li 2014]", and the reality is that some readers will think that this is a different 2014 paper from a different Li. To reiterate: this is an unacceptably high error rate for communicating information that as authors we are ethically obliged to communicate.

I'd expect some other people to jump in saying something like this: "No, these errors won't happen. Readers don't use the label [GST21] to make guesses; they use it as a mnemonic after studying the bibliography."

But it's human nature to guess, and to be overconfident about it. Putting provocative fragments of information into citation labels is begging the reader to guess what's being cited. I've watched people in person talking about multi-initial bibkeys that they were looking at as references to paper X when in fact the references turned out to be to paper Y. I find it utterly implausible that these were isolated incidents.

A 2011 blog post opposing numeric keys explained how the author of the blog post, paleontologist Mike Taylor, reads papers:

So, for example, if someone writes “Haplocanthosaurus has been recovered as a non-diplodocimorph diplodocoid (Wilson 2002)”, you know that the paper that recovered Haplo in that position was, well, Wilson 2002. ... But there are some journals — our old friends Science and Nature are the most significant perpetrators here — that use numbered references instead. So you’d read “Haplocanthosaurus has been recovered as a non-diplodocimorph diplodocoid [35]”, and then you have to flip to the end of the paper, look up number 35 in the numbered list of notes, and see that the reference is given as “J. Wilson, Zool. J. Linn. Soc. 136, 217 (2002).” (Yes, it’s true: Science doesn’t even bother to tell you the title or end-page of the cited paper. You just get author name, hghly abbrvted jrnl ttl, volume, start-page and publication year.)

The complaint about Science omitting information from its bibliographies is of course valid, but that's a different issue. Let's think about what Taylor is saying about bibkeys.

The advantage being claimed here of "Wilson 2002" over "[35]" is that, upon first seeing "Wilson 2002" as a citation, "you know" that the source "was, well, Wilson 2002", without having to "flip to the end of the paper". The stated process isn't that "Wilson 2002" is being used as a mnemonic after the reader has studied the bibliography. The stated process is using "Wilson 2002" as a substitute for looking at the bibliography. What if the reader is thinking of a different Wilson?

The same Taylor blog post also makes a different argument: "And everyone who works on sauropods is familiar with Wilson 2002." Yes, if a famous source known to the reader is being cited for what it's famous for, then the reader's guess about the bibkey will be correct. But that reader doesn't even need to see a citation, let alone a "Wilson 2002" bibkey. My concern here isn't with the corner case of people hearing about famous papers they already know; my concern is with what happens in all the other cases.

Yes, looking at a bibliography entry takes a moment. Science involves many different error-reduction practices that take time. It's part of the job.

Side note: interactions with the level of description of prior work. The reader's error rate in guessing what "Wilson 2002" is referring to depends not just on the reader's background but also on how much information is in the surrounding text. Without getting up to speed on the study of sauropods, let's assume that the particular statement "Haplocanthosaurus has been recovered as a non-diplodocimorph diplodocoid (Wilson 2002)" has enough information that no readers will be guessing the wrong source.

If I'm spending that many words describing prior work, it's probably because the prior work plays an important enough role in my paper that I'd want to credit Wilson in the text: "Wilson [35] recovered Haplocanthosaurus as a non-diplodocimorph diplodocoid". If the timeline mattered then I'd say that too: "In 2002, Wilson [35] reported recovering Haplocanthosaurus as a non-diplodocimorph diplodocoid". And, of course, if I saw a risk of this triggering a miscredit, then I would add further detail to prevent that miscredit.

But often I have smaller citations where I don't want names in the text. For example, my Lennes citation with numeric keys is "[70]" in text about a particular "mathematical convention (see, e.g., [70] and [44, page xi])" that my paper is following. The "e.g." is telling the reader that there are more sources I could have cited here. I'm crediting, in the bibliography, the sources I chose. Including author names in the main text after "see, e.g." would turn a short parenthetical note into a much longer parenthetical note, and would give the reader a wrong impression regarding the importance of those sources.

The fact that many sources would fit here is also the reason that "[Len17]" leads some readers to miscredits. I could compensate for this by writing "the 1917 Lennes paper [Len17]", but this makes the "Len17" information redundant and again has the problem of giving the reader a wrong impression regarding importance. I would much rather just say "[70]".

The readers-are-using-technology argument. After the quotes above, Taylor's blog post continues as follows:

This is completely stupid, of course, but you can just about see why the world is that way: S&N are basically print journals (albeit with online editions), and space is at a very tight premium. So this kind of extreme compression may be worth the pain it costs in terms of space recovered.

But you want to know what’s really stupid?

PLoS ONE, and the rest of the PLoS journals, use numbered references. Yes, PLoS ONE, the online-only journal that has no length restrictions whatsoever. In PLoS ONE, you can have as many giant figures as you need, and as many tables, and as much text. Yet still they niggle away at space by using the objectively inferior numbered-references format.

I agree that the overall trend is towards fewer printouts, making the costs of pages less important. I almost never print out papers these days. But I have colleagues who regularly do: they find printouts easier to read than screens, or their note-taking styles work well on printouts, or they're traveling and don't want devices running out of battery power.

More importantly, I don't agree with the idea that reading papers on screen eliminates the value of conciseness. More concise text generally takes less time for me to read. Also, more of it fits on the screen at once; when I'm bouncing from text to context, the context I'm looking for is more likely to already be on the screen without my having to scroll.

Sure, conciseness has to be weighed against other desiderata. Writing "updated ? x : y" in the C programming language is more concise than writing "x if updated else y" in Python; does the brevity justify the cognitive load? If papers had been limited to, say, 30 pages in the name of conciseness, would Wiles have been forced to split his proof of Fermat's last theorem into several papers, taking space in each paper to review the relationship to each of the other papers? Should he have skipped every second proof step in the name of conciseness? And, getting back to bibliographies: I don't like omitting information from bibliography entries, even when this allows a paper to be squeezed into fewer pages.

But the point I'm making here is simply that the virtues of conciseness are not limited to printouts. A 2011 blog post by Zen Faulkes makes the same point ("the advantage goes beyond saving paper").

If I'm reading something that's obviously less concise than it could have been, I expect that there's a reason that the author took the extra space. If the reason is "I don't like taking the time to look at bibliographies, so as a reader I prefer bibkeys with fragments of bibliographic information, so as an author I prefer to provide such keys": sure, that sounds like it outweighs conciseness, but a closer look shows that it creates bigger problems; see above. Portraying conciseness as a historical relic doesn't strengthen the case.

The writers-are-not-using-technology argument. One of the supportive comments after the Taylor blog post was that "if you forget a reference or decide to add one or more you have to completely renumber all of your references through the entire paper an reference section. Not fun".

How many science authors in 2024 aren't using software such as BibTeX that numbers references automatically? Do we seriously believe that those authors are getting their bibliographies correct in the first place, even without traps such as a second 2020 Chen paper being added, forcing a relabeling to distinguish "[Chen 2020a]" from "[Chen 2020b]"? Does saving time for these authors justify inflicting inferior results on the readers?

Most importantly, I'm trying to do the best for my readers. I don't want my papers dragged down by another author's struggles with paper-writing technology. Even if some publication venues decide to make concessions to those authors, they shouldn't let this pollute their other publications.

The education argument. One of the followups to the Taylor blog post was a blog post from paleontologist Andy Farke. Beyond praising the time savings of not looking at the bibliography ("It's easy for readers who are familiar with the literature to know exactly what's being discussed" and "You don't have to flip back and forth between the main text and the reference list"), Farke claimed the following as an advantage of having full names in bibkeys:

It helps readers new to the field to become familiar with the major names and papers. See the names "Wedel," "Taylor," "Wilson," "Curry-Rogers," and others often enough, and you probably have a good picture of a few of the major recent workers in sauropods.

I'm baffled by this argument. For a student who wants to figure out who's cited by skimming papers (let's ignore the existence of Google Scholar etc.), skimming for names in the bibliography is much more efficient than skimming for names in the main body of the paper. The information is packed into the bibliography, while it's scattered throughout the rest of the paper. Skimming the bibliography also gives the student a quick idea of time scales, publication venues, what authors choose to highlight in titles, etc.

For a while now I've been making sure to put back-references into my bibliographies: each bibliography entry ends with a red list of links back to where the entry is cited. Among other virtues, this makes it easy to skim the bibliography pages for what's cited most often.

The everyone-else-is-jumping-off-the-bridge argument. Another argument I've heard privately is that alpha style, like "[CLRS09]", is what authors prefer, as shown by a particular anonymous poll. Obviously this isn't directly arguing that alpha style is better; it's making an indirect argument that you should accept the wisdom of a bunch of other people who have considered the issue and decided that alpha style is better.

Before looking at that particular poll, let's look at five examples of public evidence that authors are voting with their feet against alpha style.

First: The Farke blog post reported observing that numeric bibkeys, despite their unpopularity in paleontology, were the most popular choice across science:

I did a quick survey of the other 99 percent of the scientific literature, and numbered citations simply dominate. Even arXiv - the epitome of digital presentation with no real standard format - has a vast majority of papers with the [1,2,3] style (in fact, the only counterexamples I found were in a handful of biologically-oriented papers). The medical literature (medically oriented papers are the great majority of PLoS ONE submissions), computing literature, physics literature, etc., most often use numbered citations.

Second: A 2018 blog post from biodata scientist Daniel Himmelstein measured 86% of PubMed articles as using numeric bibkeys. This measurement included some efforts at reproducibility, and was accompanied by some analysis; well worth reading.

Third: I looked at the 10 most recent papers on the cryptographic preprint server I promised myself in advance that I would report the results no matter what:

Fourth: I looked at the 10 most recent papers on I again promised myself in advance that I would report the results no matter what:

Fifth: My impression has always been that the American Mathematical Society allows authors to choose the bibliography style, and that authors usually choose numeric style. AMS's journal-author guide provides both amsplain (numeric) and amsalpha, saying that amsplain is "preferred". AMS's style guide says that "The most common style is a number within brackets, e.g., [1], [2]. The next most common style is a short alpha-numeric form, e.g., [Serre71] or (even shorter) [S71]. When multiple authors are involved, the citations become lengthy, e.g., [dlVP-SD04] for de la Valleé-Poussin and Swinnerton-Dyer from 2004."

You could accuse AMS of biasing the results by saying that amsplain is "preferred". But that's much less extreme than what happened in the poll supposedly showing an author preference for alpha. Let's look at that poll.

The context here is the Association for Computing Machinery, a big computer-science publisher. A decade ago, ACM was asking authors to use last names and years as bibkeys; today, ACM is asking authors to use numeric bibkeys. A quick scan of archived web pages shows that ACM made this change between February 2017 and May 2017.

In January 2020, computer scientist Andrew Myers posted a poll on Twitter:

ACM is thinking about how citations should appear in computer science papers. What's your opinion? 1. Author-year: recognizable but verbose 2. numeric: compact but not recognizable 3. alphanumeric: somewhat compact and somewhat recognizable.

Myers labeled the specific buttons as "[Hammer and Hicks 2014]", "numeric: [37]", and "alphanumeric: [HS09]".

There are studies showing how wording choices in questions have an impact on answers. (Do you find yourself influenced by the word "alpha"? If you think "of course not", then you need to read more studies of how irrational humans are.) So it's worth taking a moment to contemplate the biases built into this poll's questions: the links aren't clickable, for example, and the alpha example "[HS09]" has below-average length.

Anyway, author-year received 35.9% of the 301 votes; numeric received 32.6% of the votes; and alpha received 31.6% of the votes.

Wait a minute. I was supposed to be telling you about a poll showing that authors prefer alpha style. Instead I'm showing you a poll with three options that were highly controversial within the polled audience, with no majority option and with alpha style as the least favorite option. How could anyone imagine this showing a preference for alpha? Bear with me.

Myers complained that there was "vote splitting"; claimed that "Current citations use author-year format"; and set up another poll, which I'll describe in a moment. He also tweeted that "I need more votes so we can convince ACM to make a change!" and that "I have been trying to create the evidence needed to make it happen".

The new poll asked the following question, again claiming that the "current standard format is author-year":

ACM is considering changing the format of citations used in papers. The current standard format is author–year, like [Smyth and Plotkin 1982] or [Adya et al. 2014] or, used as a noun, Smyth and Plotkin (1982). Many people feel that format is too verbose and can interfere with reading the text around the citation, and want to use numeric citations like [37] instead. Numeric citations are compact but cannot be recognized when seen in the text. A proposed compromise is alphanumeric citations, such as [SP82], which are almost as compact as numeric citations but with the advantage that they are globally recognizable and searchable. What citation format do you prefer? (Here is a side-by-side comparison)

The bias is even more blatant here: alpha is presented as a "compromise" that's supposedly "almost as compact" as numeric and that supposedly produces keys with "the advantage that they are globally recognizable and searchable". What exactly does "globally recognizable and searchable" mean here? What's the justification for that claim? Why is there no description of disadvantages of alpha?

What I find particularly interesting here is the choice of poll mechanism, where poll participants were asked

Let's think about what to expect here. If your favorite option is full last names, then of course you would vote for that over the other options, and probably you would vote for the "compromise" over numeric, right? If your favorite option is numeric, then of course you would vote for that over the other options, and probably you would vote for the "compromise" over full last names, right?

(The arguments we've seen against numeric are also arguments against alpha: alpha does a worse job than full last names of skipping clicks, doesn't provide names to students, etc. The arguments we've seen for numeric are also arguments against alpha: most importantly, alpha will trigger miscredits much more frequently than numeric, probably more frequently than full last names.)

So asking specifically about the "compromise" vs. numeric should get not just the 31.6% of the poll participants who actually wanted the "compromise", but also the 35.9% who wanted author-year. There's similar inflation in "compromise" vs. author-year. This choice of polling mechanism is inherently biased towards "compromises", even when those "compromises" are (1) objectively bad and (2) disliked by most of the polled audience.

Unsurprisingly, the alpha "compromise" won over numeric (by 216-116, out of 342 total), and won over full last names (by 208-128). Meanwhile 174-156 preferred numeric over full last names, so if there had instead been a runoff vote then numeric would have won.

Myers tweeted his thanks to the people who had helped him create his desired evidence, and said that the "ACM folks are studying the results". I haven't located records of ACM's discussions of these topics, but, given that the alpha proposal hasn't been adopted in four years, it's safe to presume that the proposal ran into objections.

Is this a story of authors preferring alpha? No. It's the story of three choices all being highly controversial within an anonymous, unscientifically selected group of people, and then the polling mechanism being blatantly manipulated to turn this into outright misinformation that those people prefer alpha.

The consistency argument. One other indirect pro-alpha argument I've heard is that it's important for a journal to be "consistent": in particular, if the journal editors have decided to encourage alpha (for whatever reason), then they should require all papers to use alpha.

What exactly is the scientific value of this superficial stylistic "consistency"? Do we think the reader is using printouts of journal articles as wallpaper, and will have the artistic effect spoiled if one article says "[Len17]" and another one says "[70]"? Why is consistency within a journal more important than consistency with other papers by the same author, or with science more broadly? How about trying, for each paper, to do what's best for the readers of that paper?

The bias-against-late-authors counterargument. Let's move on to advantages of numeric bibkeys. The above discussion of arguments for non-numeric styles already covered some numeric advantages, most importantly the fundamental advantage of encouraging proper credits; but trying multiple styles shows further advantages of numeric.

In some areas of science, multiple authors on a paper are conventionally listed in decreasing order of contributions. For example, student Smith is doing a ton of lab work, with occasional input from supervisor Jones; the paper ends up being by Smith and Jones.

There are other areas of science (e.g., math) where figuring out decreasing order of contributions is difficult and where authors are conventionally listed in alphabetical order: Jones and Smith. My coauthored papers are usually in these areas. To avoid any ambiguity, I've been including explicit footnotes saying that the authors are listed in alphabetical order.

Full-last-name styles for bibkeys typically give only one author name followed by "et al.", or two author names if there are exactly two. It's clear how this is driven by some attention to conciseness, but it's also clear that this creates a credit issue: subsequent author names have less visibility. This is glaringly unfair to the "et al." authors for any paper with equally contributing authors in alphabetical order.

For areas with decreasing order of contributions, the unfairness isn't as glaring but is still there. As the Faulkes blog post put it:

Why should one person get on a large team get such a disproportionate amount of personal advertising by having their name printed throughout the main body of a paper, while the name of author 2 of 3 is only printed once in the literature cited?

Faulkes also comments on the different impact of name-year bibkeys upon different types of readers: such keys can work for professional scientists deeply specialized in the area, but problems appear when one considers a broader group of readers.

Now let's look at alpha style. Compared to listing author last names and years, alpha is clearly designed to put higher weight on conciseness. For single-author papers, the author's last name is abbreviated to three letters, as in "[Kah67]". For multiple-author papers, each author's last name is abbreviated to just one letter. To limit length when there are many authors, the style produces three letters and then a "+" for the remaining authors.

My own paper cites a paper by Albrecht, Curtis, Deo, Davidson, Player, Postlethwaite, Virdia, and Wunderer. This turns into "[ACD+18]" with alpha: Davidson, Player, Postlethwaite, Virdia, and Wunderer have all been squeezed into the "+". Meanwhile "[Adl91]" receives three letters because that happens to be a single-author paper, and more readers will guess (correctly) that it's from Len Adleman (something that the bibliography says but that I had decided not to mention in the text). Compared to "et al.", alpha's advertising bias against later authors isn't as glaring, but it's still there.

I'm not saying that numeric bibkeys are completely free of any possible bias. Assuming a bibliography sorted by names, the reader knows that "[2]" is a name early in the alphabet, and might guess that it's Adleman. But this is less likely for "[2]" than it is for "[Adl91]" or "[Adleman 1991]".

Interlude: The offensiveness counterargument. Source:

One of the papers at Crypto 2018 was from Tore Kasper Frederiksen, Yehuda Lindell, Valery Osheter, and Benny Pinkas. Do they really want this paper cited in alpha style as "[FLOP18]"?

Maybe they don't care. Maybe they think it's amusing and will draw more attention to their paper. But maybe they're unhappy about it. If in 2025 I find myself citing a new single-author paper by Duman, do I first have to ask that author whether "[Dum25]" is acceptable?

The forward-lookup-time counterargument. The primary use of a bibliography is to identify a work cited in the text, most importantly to help the reader find a copy of that work. Can we make this process more efficient without damaging its reliability?

The process starts with looking at a bibkey in the text and finding the corresponding bibliography entry. For someone reading on screen, that's a single click, assuming keys are linked, as they usually (not always!) are these days.

But imagine looking at a bibkey in a printout and flipping forward to find the bibliography entry with the same key. Bibliographies are, of course, sorted by key to support this process, but the choice of keys also affects how fast this is.

I've spent a lot of time looking at bibliographies over the past four decades. I'm certainly faster finding "[70]" in a bibliography than finding "[Len17]". Numeric bibkeys are very well correlated with quantitative position within the bibliography; I find the right page faster, and I find the spot within that page faster. This paper happens to have 100 references, and it certainly isn't obvious in advance that "[Len17]" is 70% of the way through: this isn't coming just from the overall distribution of names in society, but also from how many authors an average paper has in this area and from what this particular paper happens to cite. For a more severe example of the slowdown, try your luck at the Himmelstein challenge: "If you disagree, I challenge you to pick a Himmelstein citation from the snippet above and try to find it in the alphabetical reference compilation."

The reverse-lookup-time counterargument. Let's look at another use of bibliographies. A reader is starting to look at paper X, for example to referee it, and is thinking something like this: "Hmmm, is paper X citing relevant paper Y by Aude Le Gluher and Pierre-Jean Spaenlehauer and Emmanuel Thomé?"

A reliable way to answer this question is to read the bibliography of paper X. Also, a reliable way to see where paper X cites paper Y and what paper X says about Y is to read paper X, which one hopes a referee would be doing anyway.

But sometimes readers aren't spending that much time. I systematically go through all papers that I'm refereeing and many more papers, but, beyond that, I very often look at selected portions of further papers, and sometimes seeing whether a paper I know is cited gives me useful extra information in how the literature fits together. Meanwhile, both as an author and as a reviewer, I'm disappointed at how often I see reviews from people who obviously haven't taken the time to read through the papers they're supposed to be reviewing; the sloppiness is so extreme that, yes, sometimes a paper citing Y is met with a review claiming that Y wasn't cited. (Most recent example I saw: 30 April 2024, just six weeks ago.)

Certainly the process of working backwards to where paper X cites paper Y, starting from paper X's bibliography entry for paper Y, can be reliably accelerated. Thinking about the human process here led to some important details of how I handle back-references in my own papers:

What about the process of seeing whether paper X cites paper Y in the first place? Can that be reliably accelerated?

Maybe my computer knows the DOI for paper Y and can search for the DOI in the bibliography for paper X. But often DOIs aren't being used. So instead I'll ask my PDF reader to search for, say, Gluher. Done. But this can be an annoyingly slow process if I'm searching for a more common name such as Li. Searching for a short title word has the same problem. Searching for a longer title phrase isn't reliable, in part because people writing bibliographies are often sloppy with titles and in part because PDF searches often fail across line breaks. Also, let's not forget the case of people reading printouts.

A different answer, working both for screens and for printouts, takes advantage of bibliographies with numeric bibkeys sorted alphabetically by author names. An important caveat is that the reader has to be careful to check for Gluher within the G entries and check for Le Gluher within the L entries, since realistically bibliographies can end up either way. Another caveat is that some papers instead sort bibliographies in order of appearance; the savings of time disappears when this happens, although fortunately the situation is also immediately obvious and doesn't cause a reliability issue.

Is the speedup here, in the cases where it happens, really an advantage of numeric bibkeys? Aren't bibliographies also sorted by author names with other styles of bibkeys? No, it's not that simple.

Bibliographies are always sorted by bibkeys, to support the critical process of forward lookups. For numeric bibkeys, this is compatible with sorting bibliographies by author names to support reverse lookups: the bibkeys are simply chosen to match that order.

What happens with alpha? My citation of Kuffner turns into "[Kuf16]", while my citation of Kuhn and Struik turns into "[KS01]", which has to be sorted before "[Kuf16]" since otherwise the reader looking up "[KS01]" won't find it. The issue here isn't just with 3-letter abbreviations vs. 1-letter abbreviations: my citation of Lee and Venkatesan turns into "[LV18]", while my citation of Lenstra, Lenstra, Manasse, and Pollard turns into "[LLMP93]", which has to be sorted before "[LV18]".

A reader coming from a numeric community and looking for Lenstra, Lenstra, Manasse, and Pollard will see a bibliography that at first glance looks like it's sorted normally by author names, and can easily miss the reference (because of, e.g., spotting Lee and Venkatesan first and seeing no L entries after that). To avoid missing the reference, the reader has to think "Oh, I guess this will be sorted by, um, LLMP" and then search for LLMP. Sure, one can say that readers should already know about this, but there's still an unnecessarily high burden on the reader: beyond the well-known traps of, e.g., "Gluher" and "Le Gluher", the reader has to understand variations in how bibliography styles decide to choose their bibkeys.

Using full last names as bibkeys tends to bring bibliographies closer to normal sorting order, but there are still discrepancies. For example, my paper cites a 2016 paper by Becker, Ducas, Gama, and Laarhoven, which turns into "[Becker et al. 2016]", and also cites a 2014 paper by Becker, Gama, and Joux, which turns into "[Becker et al. 2014]" and thus has to be sorted earlier. If the reader spots the Becker, Gama, and Joux reference, but is searching for Becker, Ducas, Gama, and Laarhoven, will it occur to the reader to look downwards in the list?

For reasons outside the scope of this blog post, I submitted my paper to a misguided new journal that requires alpha. (It was accepted.) I tweaked alpha to insert numbers as necessary after initials so that the order of keys matches normal sorting order for authors in the bibliography. For example, this changed "[CLRS09]" to "[C9LRS09]", since Cormen was the ninth C name. (When the added number was right before the year, I added a dot in between to avoid confusion.) A reviewer objected: "everything that makes the first author look more important than the others is to be avoided in our community where alphabetical order is the most common". Well, yes, alpha is biased against late authors in any case; I would be happier just using [1], [2], [3], etc.

First appearance vs. reverse lookups. [Section added 2024.07.06.] I don't see any biases in numeric bibkeys when the numbers are in order of first appearance in the text: [1] is the reference cited first in the body of the paper, [2] is the reference cited first in the body after [1], etc. This is something done by, e.g., IEEE journals.

The usual complaint about this is that it doesn't support reverse lookups. But why don't we handle the reverse lookups with a separate index? The body cites [1], then [2], then [3]. The bibliography then says (of course with much more detail) [1] Zhang, [2] Jones, [3] Kim. Then there's a separate index sorted by author, saying [2] Jones, [3] Kim, [1] Zhang. Even better, if it's actually a paper by Jones and Smith, the separate index could have [2] Jones and Smith, [3] Kim, [2] Jones and Smith, [1] Zhang. Has anyone written tools for this already?

The Western-bias counterargument. [Section added 2024.06.18.] In giving examples above of the damage done by collisions, I naturally picked short last names that I had noticed frequently colliding, such as Li and Chen.

Replies have noted that this damage is not evenly spread worldwide, and that this opens up the use of non-numeric bibkeys such as "[Li 2014]" or "[Li14]" to accusations of anti-Asian bias. These accusations seem hard to rebut. An online list of the most common names worldwide says there are 107 million people named Wang, 105 million people named Li, 98 million people named Zhang, 75 million people named Chen, etc. Smith, a name that Americans think is very common, is listed as only 4.5 million people.

This issue goes beyond bibkeys. Sometimes bibliographies abbreviate full names to just initial and last name, for example. We can and should always give full names and ban these abbreviations, although there will still be problematic collisions for various people named Hao Chen.

As for names appearing for important credits in text, maybe we should be expanding "In 2002, Wilson [35] reported recovering Haplocanthosaurus as a non-diplodocimorph diplodocoid" to "In 2002, Jeffrey A. Wilson [35] reported recovering Haplocanthosaurus as a non-diplodocimorph diplodocoid". Some readers will object that we're supposed to leave out first names in papers; but that's just a convention, and if the convention is problematic then we should change it.

Mathematicians writing about the "Bolzano–Weierstrass theorem" would be unhappy writing about the "Bernard Bolzano–Karl Weierstrass theorem". But both of these names are wrong anyway: the theorem was proven by Bolzano when Weierstrass was two years old. In any case, theorems (and methods and so on) should be named after concepts ("the convergent-subsequence theorem") rather than people. Mathematician Paul Halmos, who at age 31 won the Chauvenet Prize for mathematical exposition, commented in "How to write mathematics" that " 'the closed graph theorem' is good and 'the Cauchy-Buniakowski-Schwarz theorem' is bad".

Obviously there are tensions between the general goal of avoiding bias and the general goal of conciseness. But, for bibkeys in particular, numeric keys avoid the inherent biases of non-numeric keys and are more concise than non-numeric keys.

The lazy-programmers counterargument. One last point. This one is an indirect argument, like the popularity argument.

Consider alpha's abbreviation of years as two digits. Whoever wrote that line of code in alpha was obviously not thinking about the problems this would cause, such as turning "[Len1917]" into "[Len17]", or turning "[HPS1998]" followed by "[HPS2016]" into "[HPS16]" followed by "[HPS98]". The issue here isn't tied to crossing 2000: I have another recent paper citing, among other things, Padé 1891, Kronecker 1881, Gauss 1801, Waring 1779, Lagrange 1776, Lagrange 1773, Stevin 1585, and Euclid around 300 B.C.

Do you really want to be using a bibliography style designed by some programmer who wasn't thinking? I'm writing this with all due respect to the alpha designers.

Maybe numbers [1], [2], [3], etc. look like they show even less thought. But remember that these numeric bibkeys are typically chosen to match author-sorting order for bibliographies. There's long experience and quite a bit of code that goes into the details of that author-sorting order, in particular to support reverse lookups. Readers still can't be sure about, e.g., "Gluher" vs. "Le Gluher", but the overall bibliography style feels compatible with scholarship, while mixing up centuries does not.

For the submission mentioned above, I tweaked alpha to use four-digit years for any years that were more than 90 years before the most recent cited year (although, after more thought, I think a cutoff of 80 years before the paper date is safer) and to use four-digit years when years after any particular sequence of initials were crossing century boundaries. Also, I compressed single-author papers to one letter, reducing the advertising bias somewhat, and added numbers to get the bibliography order right, as discussed above. Then "[H4PS1998]" and "[H4PS2016]" are in the right order, and "[L5.1917]" doesn't confuse readers regarding the year, while most years are still compressed to two digits.

Sure, this was more work than basic alpha. It also consumes slightly more space overall. But it's an example of the sort of thing that of course you do if you're thinking about how readers actually interact with papers.

My point isn't to recommend this variant of alpha. On the contrary: this is rearranging deck chairs on the Titanic. The more you think about the role that bibkeys play in science, the more confident you become that keys should simply be numbers.

Version: This is version 2024.07.06 of the 20240612-bibkeys.html web page.