1 Animating Acronyms

Inspired by the paper, Meta-Research: The growth of acronyms in the scientific literature (radio interview here), we'll use gganimate to create an animation of the top twenty acronyms used in the titles of scientific publications 1900--2019 (see similar animation here).

If you want download and wrangle to data yourself then feel free to use this R script; however for simplicity we'll download the data we'll need directly from GitHub.

data_url <- "https://github.com/cmjt/statbiscuits/raw/master/swots/data/top_twenty.RData"

load(url(data_url))

1.1 Data exploration

The R object we've loaded, named top_twenty, is a list of length 30, where each element contains a data.frame for each year with the following variables: pmid, acronyms, nchar, source, Journal.Title, ISSN, eISSN, Year, Volume, Issue, Page, DOI, PMCID, Manuscript.Id, Release.Date.

The name of each element in the list indicate the year of publication.

names(top_twenty)
##  [1] "1990" "1991" "1992" "1993" "1994" "1995" "1996" "1997" "1998" "1999"
## [11] "2000" "2001" "2002" "2003" "2004" "2005" "2006" "2007" "2008" "2009"
## [21] "2010" "2011" "2012" "2013" "2014" "2015" "2016" "2017" "2018" "2019"

The column acronyms contains the top-twenty acronym used in the title of a publication in Journal.Title identified by pmid and PMCID. The number of characters in an acronym are given in the nchar column. Other information relating to the published article is given in the other columns.

head(top_twenty[[1]])
##      pmid acronyms nchar source Journal.Title      ISSN     eISSN Year Volume
## 1 1688383      RNA     3  Title       J Virol 0022-538X 1098-5514 1990     64
## 2 1688384     mRNA     4  Title       J Virol 0022-538X 1098-5514 1990     64
## 3 1688384     mRNA     4  Title       J Virol 0022-538X 1098-5514 1990     64
## 4 1688464      RNA     3  Title Mol Cell Biol 0270-7306 1098-5549 1990     10
## 5 1688465      RNA     3  Title Mol Cell Biol 0270-7306 1098-5549 1990     10
## 6 1688530      RNA     3  Title        EMBO J 0261-4189 1460-2075 1990      9
##   Issue Page                  DOI     PMCID Manuscript.Id Release.Date
## 1     1  222                      PMC249091                       live
## 2     1  239                      PMC249096                       live
## 3     1  239                      PMC249096                       live
## 4     1  184 10.1128/mcb.10.1.184 PMC360726                       live
## 5     1   28  10.1128/mcb.10.1.28 PMC360709                       live
## 6     1  257                      PMC551656                       live

As you can imagine for the top-twenty acronyms alone there are thousands of observations each year:

sapply(top_twenty, nrow)
##  1990  1991  1992  1993  1994  1995  1996  1997  1998  1999  2000  2001  2002 
##  3814  3913  3923  4130  4305  4358  4409  4424  4649  4411  4535  4540  4475 
##  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015 
##  5196  5754  6324  7303  8610 11823 14540 17129 19374 22391 25060 26968 29226 
##  2016  2017  2018  2019 
## 30040 32890 33776 33779

Let's look at the top-twenty acronyms in 1990:

table(top_twenty[[1]]$acronyms)
## 
## AIDS  AMP  ATP cAMP  CD3  CD4  CD8 cDNA  DNA  HIV  HLA  IgG   IL  MHC mRNA  PCR 
##  139   81  155   42   43  126   58  342 1212  230  107   59   83   48  277   68 
## RFLP  RNA rRNA tRNA 
##  108  465   51  120

1.2 Format the data to suit our porposes

A long list probably isn't the best way to summarise these data - We could combine all the elements into one big dataframe (remember we have already have a column specifying year)

library(dplyr)
df <- bind_rows(top_twenty)
head(df)
##      pmid acronyms nchar source Journal.Title      ISSN     eISSN Year Volume
## 1 1688383      RNA     3  Title       J Virol 0022-538X 1098-5514 1990     64
## 2 1688384     mRNA     4  Title       J Virol 0022-538X 1098-5514 1990     64
## 3 1688384     mRNA     4  Title       J Virol 0022-538X 1098-5514 1990     64
## 4 1688464      RNA     3  Title Mol Cell Biol 0270-7306 1098-5549 1990     10
## 5 1688465      RNA     3  Title Mol Cell Biol 0270-7306 1098-5549 1990     10
## 6 1688530      RNA     3  Title        EMBO J 0261-4189 1460-2075 1990      9
##   Issue Page                  DOI     PMCID Manuscript.Id Release.Date
## 1     1  222                      PMC249091                       live
## 2     1  239                      PMC249096                       live
## 3     1  239                      PMC249096                       live
## 4     1  184 10.1128/mcb.10.1.184 PMC360726                       live
## 5     1   28  10.1128/mcb.10.1.28 PMC360709                       live
## 6     1  257                      PMC551656                       live

Let's count the acronyms

count <- df %>% count(acronyms)
head(count)
##   acronyms    n
## 1      ABC  160
## 2      ADP   60
## 3     AIDS 5020
## 4      AKT  657
## 5      AMP  377
## 6      ATP 9302

Not quite what we want, we forgot to group by year...

count <- df %>% group_by(Year) %>% count(acronyms)
head(count)
## # A tibble: 6 x 3
## # Groups:   Year [1]
##    Year acronyms     n
##   <int> <chr>    <int>
## 1  1990 AIDS       139
## 2  1990 AMP         81
## 3  1990 ATP        155
## 4  1990 cAMP        42
## 5  1990 CD3         43
## 6  1990 CD4        126

Now let's sort by the most used by year and rank

ranked <- count %>%
  arrange(Year, -n) %>%
  mutate(rank = as.factor(1:n()))
head(ranked)
## # A tibble: 6 x 4
## # Groups:   Year [1]
##    Year acronyms     n rank 
##   <int> <chr>    <int> <fct>
## 1  1990 DNA       1212 1    
## 2  1990 RNA        465 2    
## 3  1990 cDNA       342 3    
## 4  1990 mRNA       277 4    
## 5  1990 HIV        230 5    
## 6  1990 ATP        155 6

1.3 Plotting

Numbers of each acronym by year

library(ggplot2)
ggplot(ranked, aes(x = Year, y = n, col = acronyms)) + geom_line() 

Quite a lot going on... What about a bar graph so we can easily compare acronyms? One thing to note is that as we've already "counted" the acronyms then we need to specify stat = "identity" in our call to geom_bar() so that ggplot() knows to use our data as bar heights. Note also that we can flip x = and y = inside aes() to switch the axes.

ggplot(ranked, aes(x = n, y = rank, fill = acronyms)) + 
  geom_bar(stat = "identity")

We've forgotten about year

  • what about using facet_wrap()?

The colours are awful

  • how about a decent palette (e.g., scale_fill_brewer(palette = "Dark2"))? See others here.

The trouble with the "Dark2" palette from RColorBrewer is that it only contains 8 different colours and we need 54! No problem we can simply extend the colour palette using the colorRampPalette() function from ColorBrewer:

## We need 54 colours as there are 54 unique acronyms
library(RColorBrewer)
n.cols <- 54
mycols <- colorRampPalette(brewer.pal(8, "Dark2"))(n.cols)
head(mycols)
## [1] "#1B9E77" "#349567" "#4D8D58" "#668548" "#7F7C39" "#987429"

So, we use facet_wrap() to create a multi panel plot (splitting by Year) and scale_fill_manual() to specify our chosen (better) colour palette for the fill aesthetic.

ggplot(ranked, aes(x = n, y = rank, fill = acronyms)) +
  geom_bar(stat = "identity") +
  facet_wrap(~Year) +
  scale_fill_manual(values = mycols)

Well there's quite a lot going on there; let's let the axes differ for each year as - not all counts are the same, and - not all acronyms turn up each year. We can use scales = "free" in our facet_wrap() call for this.

ggplot(ranked, aes(x = n, y = rank, fill = acronyms)) +
  geom_bar(stat = "identity") +
  facet_wrap(~Year, scales = "free") +
  scale_fill_manual(values = mycols)

There are a number of issues with this plot; let's sort a few of them out:

  • using geom_text() we'll label each bar by the acronym it represents and left justify this text, hjust = "left"
  • we'll suppress the x- and y-axis labels using xlab("") and ylab("") (is there good reason to do this?)
  • we'll change the theme to theme_gray(), other options discussed here
  • in addition to the theme we choose we may want to fine tune some other elements so using theme() we'll suppress the y-axis ticks and labels with axis.text.y = element_blank(),axis.ticks.y = element_blank() and scrap the needless legend using legend.position = "none". Remember we have the acronym labels now!
  • finally, to reverse the order of the factor ranks, 1...20, we use scale_y_discrete() so that 1 is on top
plot <- ggplot(ranked, aes(x = n, y = rank, fill = acronyms)) +
          geom_bar(stat = "identity") +
          facet_wrap(~Year, scales = "free") +
          scale_fill_manual(values = mycols) +
          geom_text(aes(label = acronyms), hjust = "left", col = "darkgrey") +
          xlab("") + ylab("") + 
          theme_gray() +
          theme(axis.text.y = element_blank(),axis.ticks.y = element_blank(),legend.position = "none") +
          scale_y_discrete(limits = rev(levels(ranked$rank)))

plot

There's still a lot going on! So let's us gganimate to create a racing barchart!

It's as simple as adding + transition_time()! Although we do have to use facet_null() to forget the facet_wrap() stuff (for reasons only ggplot2 wizards know). Let's also add a title specifying year using labs().

library(gganimate)
anim <- plot +  
  transition_time(Year) +
  facet_null()  + ## we have to forget facet stuff for reasons only ggplot2 wizards know
  labs(title = "Year: {frame_time}") 
anim