Point of all this

I have spent a ton of time doing metrics for open source scientific software as part of my job at Columbia. Stan has been the focus since Columbia contributes to much of the ecosystem but it includes tracking other software as a counter point for grant proposals and progress reports. I’d like to see these metrics used more broadly if it helps people write better proposals. I am currently working on the Chan Zuckerberg Initiative (CZI) https://chanzuckerberg.com/wp-content/uploads/2021/03/EOSS-4-Combined-RFA-Packet.pdf on behalf of the Stan organization and it looks like many other NumFOCUS projects are applying as well. This document is for them.

Some context

In January of 2021 NASA had an open source infrastructure call for proposals and the Stan org joined up with PyMC and ArviZ since we all play in the Bayesian space. So why not do it together? I didn’t want to get the funds at the expense of PyMC and ArviZ since we are all on the same team. Towards the end of the NASA effort it became clear that around 10 proposals were coming out of NumFOCUS–isn’t AstroPy also on my team too? It made me uncomfortable.

The NASA proposal was a solid month’s work and I had the privilege of being able to hit it pretty much full time due to my boss’s generosity (Andrew Gelman). I was using Columbia resources to support the Stan org at NumFOCUS and Columbia only indirectly benefits from Stan org funding. I had a think about our competitive advantage due to my having the cycles to focus exclusively on getting a good pitch and then I had a thunk: resources beget more resources to the expense of those without resources and here I am competing with organizations that I have no interest in ‘defeating’ in the funding game.

In response to my ‘thunk’ I decided to create this repo with my metrics to hopefully help other NumFOCUS projects for CZI. Sorry, they are in R but that is the language I am trying to learn but it is all pretty simple stuff, just a hassle to sort out queries, API access etc that take more time than I am willing to admit. It also represents what ‘worked’, many things were tried–looking at you Google scholar and your lack of an API and way to retrieve results without getting locked out. I also tried many high level access packages that inevitably lacked some feature or information I wanted so all the code is pretty low level GET interactions but that level of control is worth it and has become my starting place.

So in short I am not accepting the zero-sum basis of scientific funding and I don’t want to compete with my NumFOCUS neighbors. Maybe CZI allocates more funds because of all of the compelling proposals and excellent metrics justifying the projects. How about a threshold system of funding meaning that all proposals above an evaluation threshold are funded or if there are limited funds then the above threshold awards are selected randomly. Totally ordering the relative merit of proposals or ‘going by gut’ are does not appeal. How about assigning 50% as I suggest and see how well they do over time vs the more standard process.

The scripts

Stan and Bayesian software as a whole is growing rapidly so most of the scripts focus on growth over time as a proxy for relevance to science. Your package may have different qualities. There are probably inertial effects at play that bake in growth because software is getting downloaded more and automated systems like continuous integration systems are growing in popularity which apparently can drive downloads. For CRAN downloads I do add regression lines to show relative growth to baseline packages like ggplot2 to account for this.

Research citations have inertial effects too and scopus.com undercounts considerably what scholar.google.com reports.

WARNING This is a quick effort and not a well designed open source project. Documentation is minimal, mistakes have probably been made but I have tried to keep variables understandable and comments helpful. I welcome bug fixes and/or extensions and I hope it helps your proposal.

Onto the scripts:

Subject distribution for Scopus.com

Scopus.com is Elsevier’s academic search engine over the research literature, both what they publish and other sources. They offer a decent API for searching and subject classification by category.

This code uses a subscription to scopus.com which I got because of my Columbia University affiliation. Information about getting credentials is at https://dev.elsevier.com/sc_apis.html, there appears to be free access available but I don’t know if the free layer gives you all the features I am using below. It does require an interaction with a human I believe–just submit a request and you get it back in a day I recall.

I will be losing these credentials soon but am happy to run queries for other projects for the remaining month. I have been giving projecs a wide and long version of all the data broken down by Scopus category–look at the write.csv calls below.

Email fbb2116@columbia.edu with likely search strings that might be mentioned in the citations of a research publication. Hopefully your project has a unique name that functions as a rigid designator, i.e., use of the name is unique to your project which ‘Stan’ entirely fails at–see https://statmodeling.stat.columbia.edu/2019/04/29/we-shouldntve-called-it-stan-i-shouldve-listened-to-bob-and-hadley/.

A good rigid designator is ‘matplotlib’ because it is not likely to refer something else. Articles that mention ‘mathplotlib’ in passing are counted as much as those that are all about the library. No distinction between the body and references section is made.

After having done a few of these for others I find that the inclusion of ‘py’ someplace in the package name is really helping with search because it makes it unique.

The subject categories are filtered to be biomedicine relevant which is CZI specific. The complete list is at https://service.elsevier.com/app/answers/detail/a_id/15181/supporthub/scopus/, see categories() function below for possible values.

The query being run below is ‘jupyter’, see how that nice ‘py’ in the middle makes for a nice rigid designator.

library(tidyr)
library(ggplot2)
library(dplyr)
library(lubridate)
library(httr)
library(jsonlite)
library(stringr)
library(ggrepel)
library(wkb)

credentials <- read_json('Scopus_credentials.json')
API_KEY <- credentials$API_KEY
INSTITUTION_TOKEN <- credentials$INSTITUTION_TOKEN
# Format of Scopus_credentials.json
# {
# "API_KEY":"XXXXXXXXXXXXXXXXXXXXXXXXXxx",
# "INSTITUTION_TOKEN":"XXXXXXXXXXXXXXXXXXXXXXXX"
# }

BASE_URL = 'https://api.elsevier.com/content/search/scopus'

USE_CACHE = TRUE #will have to setup a redis server
REPORT_PROGRESS = FALSE

if (USE_CACHE) {
  library(redux)
  redis <- redux::hiredis()
  get_results <- function(url) {
    cache <- redis$GET(url) # check redis first
    if (!is.null(cache)) {
      result <- unserialize(cache)
      if (result$status_code == 200) {
        if (REPORT_PROGRESS) {
          cat(paste("\nhitting redis for", url))
        }
        return(result)
      }
    }
    random_wait <- abs(rnorm(1, 1, 1))
    if (REPORT_PROGRESS) {
      cat(paste("\ncache miss, querying:", url, "\n"))
      cat(paste(
        "\nWaiting",
        random_wait,
        "seconds to be nice to webserver\n"
      ))
    }
    Sys.sleep(random_wait)
    result <- GET(
      url,
      add_headers('X-ELS-APIKey' = API_KEY,
                  'X-ELS-Insttoken' = INSTITUTION_TOKEN)
    )
    redis$SET(url, serialize(result, NULL))
    return(result)
  }
}

year_start <- 2012
year_end <- 2020 # want complete years or graph looks odd
years <- year_start:year_end
df <- data.frame(years)

stan_eco_q <-
  '(brms+AND+burkner)+OR+(gelman+AND+hoffman+AND+stan)+OR+mc-stan.org+OR+rstanarm+OR+pystan+OR+(rstan+AND+NOT+mit)'

pymc_arviz_stan_eco_q <-
  paste('pymc*', 'arviz', stan_eco_q, sep = '+OR+')

matplotlib_q <- 'matplotlib'

jupyter_q <- 'jupyter'

query = jupyter_q

years <- year_start:year_end
scopus.df <- data.frame(years)
package = query
total_count <- 0
for (year in year_start:year_end) {
  year_span <- paste(year - 1, "-", year, sep = '')
  url <-
    paste(
      BASE_URL,
      '?query=',
      package,
      "+AND+PUBYEAR+=+",
      year,
      '&facets=subjarea(count=101)',
      sep = ''
    )
  if (USE_CACHE) {
    result <- get_results(url)
  }
  else {
    if (REPORT_PROGRESS) {
      cat(print(paste("hitting scopus with:", url)))
    }
    result <- GET(
    url,
    add_headers('X-ELS-APIKey' = API_KEY,
                'X-ELS-Insttoken' = INSTITUTION_TOKEN)
    )
  }
  if (result$status_code != 200) {
    print(sprintf("got non 200 status from query: %d", result$status_code))
    stop()
  }
  json_txt <- rawToChar(as.raw(strtoi(result$content, 16L)))
  data <- jsonlite::fromJSON(json_txt)
  total_count <-
    as.numeric(data$`search-results`$`opensearch:totalResults`) + total_count
  facet_count <- length(data$`search-results`$facet$category$name)
  j <- 1
  while (j < facet_count) {
    name <- data$`search-results`$facet$category$name[j]
    name <- data$`search-results`$facet$category$label[j]
    name <- str_replace(name, " \\(all\\)", "")
    hitCount <-
      as.numeric(data$`search-results`$facet$category$hitCount[j])
    if (!name %in% colnames(scopus.df)) {
      scopus.df[name] <- rep(0, year_end - year_start + 1)
      #        print(paste("name=",name,", count=",hitCount))
    }
    scopus.df[name][scopus.df$years == year, ] <- hitCount
    j <- j + 1
  }
  
}

column_names <- colnames(scopus.df)
column_sums <- colSums(scopus.df)

df_long <- gather(scopus.df,
                  key = 'topic',
                  value = 'yr_count',
                  column_names[2]:column_names[length(column_names)])

write.csv(
  scopus.df,
  file = paste("scopus_data/", query, ".csv", sep = ""),
  row.names = FALSE
) #not going to work in general
write.csv(
  df_long,
  file = paste("scopus_data/", query, "_long.csv", sep = ""),
  row.names = FALSE
) #not going to work in general

Two files are created named after the query, scopus_data/jupyter.csv and scopus_data/jupyter_long.csv with results. Not clear what happens with complex queries regarding the file system.

Note that the same article can be counted in more than one subject category

Processing continues below where I filter for biomedicine categories:

# continues with values from previous chunk

#got the raw csv data, now lets graph it
#add total count to data frame for each category
df_long$total <- rep(0, nrow(df_long))
for (t in column_names[2:length(column_names)]) {
  df_long[df_long$topic == t, ]$total <- column_sums[[t]]
}
# assign label to last year for display '<topic> <total>'
# can use to scatter the labels to points other than max(years)
df_long_label <- df_long %>%
  mutate(label = if_else(years == max(years),
                         paste(as.character(topic), total), NA_character_))

# category mapping to description at
# https://service.elsevier.com/app/answers/detail/a_id/15181/supporthub/scopus/
categories <- function() {
  url <- 'https://api.elsevier.com/content/subject/scopus'
  result <- GET(url,
                add_headers('X-ELS-APIKey' = API_KEY,
                            'X-ELS-Insttoken' = INSTITUTION_TOKEN))
  json_txt <- rawToChar(as.raw(strtoi(result$content, 16L)))
  data <- jsonlite::fromJSON(json_txt)
  catsDf = data$`subject-classifications`$`subject-classification`
  return(catsDf)
}

# pulled from categories returned by below, uncomment to run
# unique(categories()$description)

# [1] "Multidisciplinary"                            "Agricultural and Biological Sciences"
# [3] "Arts and Humanities"                          "Biochemistry, Genetics and Molecular Biology"
# [5] "Business, Management and Accounting"          "Chemical Engineering"
# [7] "Chemistry"                                    "Computer Science"
# [9] "Decision Sciences"                            "Earth and Planetary Sciences"
# [11] "Economics, Econometrics and Finance"          "Energy"
# [13] "Engineering"                                  "Environmental Science"
# [15] "Immunology and Microbiology"                  "Materials Science"
# [17] "Mathematics"                                  "Medicine"
# [19] "Neuroscience"                                 "Nursing"
# [21] "Pharmacology, Toxicology and Pharmaceutics"   "Physics and Astronomy"
# [23] "Psychology"                                   "Social Sciences"
# [25] "Veterinary"                                   "Dentistry"
# [27] "Health Professions"

medicine_categories = paste(
  "Health Professions",
  "Pharmacology, Toxicology and Pharmaceutics",
  "Psychology",
  "Biochemistry, Genetics and Molecular Biology",
  "Immunology and Microbiology",
  "Nursing",
  "Medicine",
  "Neuroscience",
  "Veterinary",
  "Agricultural and Biological Sciences",
  sep = "|"
)

# filter for medicine categories
df_long_label_filtered <-
  df_long_label[str_detect(df_long_label$topic, medicine_categories), ]

#plot df_long_label to see all categories
plot2 <-
  ggplot(data = df_long_label_filtered, aes(
    x = years,
    y = yr_count,
    group = topic,
    color = topic
  )) +
  geom_line() +
  geom_point() +
  geom_label_repel(aes(label = label),
                   max.overlaps = 17, # adjust to allow for all labels
                   na.rm = TRUE) +
  scale_color_discrete(guide = FALSE) #removes guide on right

print(plot2)

rm(list = ls()) #clean up environment

Annual count of Scopus subject categories for the query “jupyter” with total counts across biomedicine subject categories. Note that the same article can be counted in more than one subject category.

PyPi downloads

From Google’s Big Query public data set indigo-epigram-312023 the download data is stored for PyPi. The query for ‘Keras’ is below:

#standardSQL
SELECT
  COUNT(*) AS num_downloads,
  DATE_TRUNC(DATE(timestamp), MONTH) AS `month`
FROM `bigquery-public-data.pypi.file_downloads`
WHERE
  file.project = 'keras'
  -- Only query the last x months of history
  AND DATE(timestamp)
    BETWEEN DATE_TRUNC(DATE_SUB(CURRENT_DATE(), INTERVAL 120 MONTH), MONTH)
    AND CURRENT_DATE()
GROUP BY `month`
ORDER BY `month` DESC

You will have to setup an account, you get $300 in credit as of now so this will be free but they want a credit card anyway. I can run queries for projects, takes 5 min.

Place resulting .csv file in appropriate folder, I have accumulated some current examples below and they are in the repo. Add yours with the package name following ‘data/PyPi’ and as a prefix before the first ‘-’.

library(ggplot2)
library(ggrepel)
library(tidyverse)
library(stringr)

# this list controls what is displayed
packagePyPiData = c('data/PyPi/ArviZ-results-20210502-112557.csv', 
                    'data/PyPi/Keras-results-20210502-132857.csv',
                    'data/PyPi/PyMC3-results-20210502-112819.csv',
                    'data/PyPi/PyStan-results-20210502-132744.csv',
                    'data/PyPi/PyTorch-results-20210502-130326.csv',
                    'data/PyPi/TensorFlow-results-20210502-131806.csv',
                    'data/PyPi/NumPy-results-20210503-200308.csv',
                    'data/PyPi/ggplot-results-20210503-201446.csv')

# note format for display name extraction: "data/PyPi<display name>-resul...csv"
packagesPyPi = str_match(packagePyPiData, "data/PyPi/([^-]+)-results.*")[,2]

pkgPyPiDf = data.frame()
longest = 0
for ( i in 1:length(packagePyPiData)) { # iterate from one list
  df = read.csv(packagePyPiData[i])
  if (longest < nrow(df)) {
    longest = nrow(df)
    pkgPyPiDf = data.frame(month = as.Date(df$month))          
  }
}

for (i in 1:length(packagesPyPi)) { # iterate from co-indexed list
  df = read.csv(packagePyPiData[i])
  pkgPyPiDf[[packagesPyPi[i]]] = c(df$num_downloads, rep(NA, longest - nrow(df)))
}

pkgLongDf = gather(pkgPyPiDf, key = "package", value = "downloads", packagesPyPi)
label_month = as.Date("2018-08-01")
pkgLongDf = pkgLongDf %>% mutate(label = if_else(month == label_month, 
                                                 package, 
                                                 NA_character_)) 

pyPiPlot = ggplot(data = pkgLongDf, aes(x = month, y = downloads, 
                                        color = package, group = package)) +
  geom_line(na.rm = TRUE) +
  scale_x_date(limits = as.Date(c(min(pkgLongDf$month), "2021-04-01")),
               breaks =  seq.Date(from = as.Date("2016-01-01"), 
                                  to = as.Date("2021-01-01"), 
                                  by = "1 year")) +
  scale_color_discrete(guide = FALSE) +
  scale_y_continuous(breaks=c(0, 100, 1000, 10000, 100000, 1e+06, 1e+07, 1e+08), 
                     trans = scales::log_trans()) +
  geom_label_repel(label = pkgLongDf$label, na.rm = TRUE)

print(pyPiPlot)

rm(list = ls()) #cleanup

Monthly counts of PyPi downloads.

Pull request aging against github for repo.

CZI asks for PR aging information. The below hits the github API and does some counting if you provide the public repo path. It is very easy to get a personal access token, see https://github.com/settings/tokens and without it the access will time out for more than just a few pages. Format of ‘github_credentials.json’

{
 "PERSONAL_TOKEN":"XXXXXXXXXXXXXXXXXXXXXXXXxx",
 "USER":"XXXXXXXXXXXXXXXXXXXXXX"
 }

Make that file a sibling of the Rscript or this page and you should be good.

rm(list = ls()) #cleanup
library(httr) # web access
library(jsonlite) #json processing
library(stringr) #regex
library(lubridate) #date

USE_CACHE = TRUE #will have to setup a redis server
REPORT_PROGRESS = FALSE

credentials <- read_json('github_credentials.json')
PERSONAL_TOKEN <- credentials$PERSONAL_TOKEN
USER <- credentials$USER
# Format of github_credentials.json
# {
# "PERSONAL_TOKEN":"XXXXXXXXXXXXXXXXXXXXXXXXxx",
# "USER":"XXXXXXXXXXXXXXXXXXXXXX"
# }
# get your token at:
# https://github.com/settings/tokens
# you want to check the 'public repo access' at least. 
# }

if (USE_CACHE) {
  library(redux)
  redis <- redux::hiredis()
  get_results <- function(url) {
    cache <- redis$GET(url) # check redis first
    if (!is.null(cache)) {
      result <- unserialize(cache)
      if (result$status_code == 200) {
        if (REPORT_PROGRESS) {
          cat(paste("\nhitting redis for", url))
        }
        return(result)
      }
    }
    random_wait <- abs(rnorm(1, 1, 1))
    if (REPORT_PROGRESS) {
      cat(paste("\ncache miss, querying:", url, "\n"))
      cat(paste(
        "\nWaiting",
        random_wait,
        "seconds to be nice to webserver\n"
      ))
    }
    Sys.sleep(random_wait)
    result <- GET(url, config = authenticate(user = USER, 
                                             password = PERSONAL_TOKEN))
    redis$SET(url, serialize(result, NULL))
    return(result)
  }
}

# package names below, look them up at github.com, 
# e.g. https://github.com/stan-dev/stanc3 'standev/rstanarm')
# packages = c('stan-dev/cmdstan', 'stan-dev/stan', 'stan-dev/rstanarm', 
#             'stan-dev/cmdstanpy')
packages = c('stan-dev/rstan', 'stan-dev/cmdstanr', 'stan-dev/math', 
             'stan-dev/cmdstanpy')
# packages = c('stan-dev/stanc3', 'stan-dev/pystan', 

packageDataDf = data.frame()
for (i in 1:length(packages)) {
  page = 1L
  while(TRUE) {
    url <- paste('https://api.github.com/repos/', packages[i], 
                 '/pulls?state=all&page=', as.character(page), sep='')
    if (USE_CACHE) {
        result <- get_results(url)
    }
    else {
      result <- GET(url, config = authenticate(user = USER, 
                                               password = PERSONAL_TOKEN))
    }
    if (result$status_code == 200) {
      jsonTxt <- rawToChar(as.raw(strtoi(result$content, 16L)))
      newDataDf <- jsonlite::fromJSON(jsonTxt)
      n <- nrow(newDataDf)
      newDataLongDf <- data.frame(package = rep(packages[i], n), 
                                  created = as.Date(newDataDf$created_at), 
                                  closed = as.Date(newDataDf$closed_at))
      packageDataDf <- rbind(packageDataDf, newDataLongDf)
      if (! str_detect(result$headers$link, 'next')) { #last page
        break
      }
      page <- page + 1
      # print(sprintf("doing page %d", page))
    }
    else {
      print(paste("Error", result))
      stop()
    }
  }
}

packageDataDf$age <- packageDataDf$closed - packageDataDf$created

for (i in 1:length(packages)) {
  print(sprintf("package %s has %d closed pull requests, mean age to closure of %.0f days",
                packages[i],
                nrow(packageDataDf[packageDataDf$package == packages[i] &
                                     !is.na(packageDataDf$closed),]),
                mean(packageDataDf[packageDataDf$package == packages[i] &
                                     !is.na(packageDataDf$closed),]$age)))
  
  print(sprintf("package %s has %d open pull requests, mean age of %.0f days from %s",
                packages[i],
                nrow(packageDataDf[packageDataDf$package == packages[i] &
                                     is.na(packageDataDf$closed),]),
                mean(today(tzone = "UTC") - 
                       packageDataDf[packageDataDf$package == packages[i] &
                                       is.na(packageDataDf$closed),]$created),
                today(tzone = "UTC")))
}

[1] "package stan-dev/rstan has 159 closed pull requests, mean age to closure of 77 days"
[1] "package stan-dev/rstan has 10 open pull requests, mean age of 283 days from 2021-05-17"
[1] "package stan-dev/cmdstanr has 208 closed pull requests, mean age to closure of 5 days"
[1] "package stan-dev/cmdstanr has 2 open pull requests, mean age of 139 days from 2021-05-17"
[1] "package stan-dev/math has 1395 closed pull requests, mean age to closure of 19 days"
[1] "package stan-dev/math has 15 open pull requests, mean age of 99 days from 2021-05-17"
[1] "package stan-dev/cmdstanpy has 155 closed pull requests, mean age to closure of 6 days"
[1] "package stan-dev/cmdstanpy has 5 open pull requests, mean age of 121 days from 2021-05-17"

print(sprintf("Across all packages mean aging to closure is %.0f days", 
              mean(packageDataDf$age, na.rm = TRUE)))

[1] "Across all packages mean aging to closure is 21 days"

print(sprintf("All packages mean open PR length is %.0f days from %s",
                mean(today(tzone = "UTC") - 
                       packageDataDf[is.na(packageDataDf$closed),]$created),
                today(tzone = "UTC")))

[1] "All packages mean open PR length is 162 days from 2021-05-17"

Super ugly code that I used to get aging graphing going. Just saving it for later.

rm(list = ls())
library(httr) # web access
library(jsonlite) #json processing
library(stringr) #regex
library(lubridate) #date

USE_CACHE = TRUE #will have to setup a redis server
REPORT_PROGRESS = FALSE

credentials <- read_json('github_credentials.json')
PERSONAL_TOKEN <- credentials$PERSONAL_TOKEN
USER <- credentials$USER
# Format of github_credentials.json
# {
# "PERSONAL_TOKEN":"XXXXXXXXXXXXXXXXXXXXXXXXxx",
# "USER":"XXXXXXXXXXXXXXXXXXXXXX"
# }
# get your token at:
# https://github.com/settings/tokens
# you want to check the 'public repo access' at least. 
# }

if (USE_CACHE) {
  library(redux)
  redis <- redux::hiredis()
  get_results <- function(url) {
    cache <- redis$GET(url) # check redis first
    if (!is.null(cache)) {
      result <- unserialize(cache)
      if (result$status_code == 200) {
        if (REPORT_PROGRESS) {
          cat(paste("\nhitting redis for", url))
        }
        return(result)
      }
    }
    random_wait <- abs(rnorm(1, 1, 1))
    if (REPORT_PROGRESS) {
      cat(paste("\ncache miss, querying:", url, "\n"))
      cat(paste(
        "\nWaiting",
        random_wait,
        "seconds to be nice to webserver\n"
      ))
    }
    Sys.sleep(random_wait)
    result <- GET(url, config = authenticate(user = USER, 
                                             password = PERSONAL_TOKEN))
    redis$SET(url, serialize(result, NULL))
    return(result)
  }
}

# package names below, look them up at github.com, 
# e.g. https://github.com/stan-dev/stanc3 'standev/rstanarm')
packages = c('stan-dev/cmdstan', 'stan-dev/stan', 'stan-dev/rstanarm', 
             'stan-dev/cmdstanpy')
packages = c('stan-dev/rstan', 'stan-dev/math', 
             'stan-dev/stan', 'stan-dev/stanc3', 'stan-dev/pystan',
             'stan-dev/rstanarm', 'arviz-devs/arviz', 'pymc-devs/pymc3')

#packages =c('pymc-devs/pymc3','stan-dev/rstanarm', 'stan-dev/rstan','arviz-devs/arviz')

#packages = c('stan-dev/rstanarm')


packageDataDf = data.frame()
for (i in 1:length(packages)) {
  page = 1L
  while(TRUE) {
    url <- paste('https://api.github.com/repos/', packages[i], 
                 '/pulls?state=all&page=', as.character(page), sep='')
    if (USE_CACHE) {
        result <- get_results(url)
    }
    else {
      result <- GET(url, config = authenticate(user = USER, 
                                               password = PERSONAL_TOKEN))
    }
    if (result$status_code == 200) {
      jsonTxt <- rawToChar(as.raw(strtoi(result$content, 16L)))
      newDataDf <- jsonlite::fromJSON(jsonTxt)
      n <- nrow(newDataDf)
      newDataLongDf <- data.frame(package = rep(packages[i], n), 
                                  created = as.Date(newDataDf$created_at), 
                                  closed = as.Date(newDataDf$closed_at))
      packageDataDf <- rbind(packageDataDf, newDataLongDf)
      if (! str_detect(result$headers$link, 'next')) { #last page
        break
      }
      page <- page + 1
      # print(sprintf("doing page %d", page))
    }
    else {
      print(paste("Error", result))
      stop()
    }
  }
}

packageDataDf$age <- packageDataDf$closed - packageDataDf$created

packageDataDf$org = str_extract(packageDataDf$package, "([^/]+)") #get org level

for (org in unique(packageDataDf$org)) {
  print(sprintf("Organization %s has mean aging to closure of %.0f days for %d PRs", 
                as.character(org),
                mean(packageDataDf[packageDataDf$org == org,]$age, na.rm = TRUE),
                sum(packageDataDf[packageDataDf$org == org,]$count, na.rm = TRUE)))

  print(sprintf("Mean open PR length is %.0f days from %s for %d PRs",
                mean(today(tzone = "UTC") - 
                       packageDataDf[packageDataDf$org == org &
                                    is.na(packageDataDf$closed),]$created),
                today(tzone = "UTC"),
                sum(packageDataDf[packageDataDf$org == org & 
                                     is.na(packageDataDf$closed),]$count)))
}

[1] "Organization stan-dev has mean aging to closure of 20 days for 0 PRs"
[1] "Mean open PR length is 215 days from 2021-05-17 for 0 PRs"
[1] "Organization arviz-devs has mean aging to closure of 7 days for 0 PRs"
[1] "Mean open PR length is 106 days from 2021-05-17 for 0 PRs"
[1] "Organization pymc-devs has mean aging to closure of 17 days for 0 PRs"
[1] "Mean open PR length is 69 days from 2021-05-17 for 0 PRs"

library(tidyverse)
library(ggrepel)

packageDataDf$floor_date_created = floor_date(packageDataDf$created, 'halfyear')
packageDataDf$count = 1

packageDataDf = packageDataDf %>% mutate(orgPrStat = if_else(is.na(closed),
                                                             paste(org,"open",
                                                                   sep = '_'),
                                                             paste(org,"closed",
                                                              sep = '_')))


yearlyPackageDf = packageDataDf %>% 
  group_by(floor_date_created, orgPrStat) %>% 
  summarize(PR_count = sum(count))

orgLabels = c()
orgs = unique(yearlyPackageDf$orgPrStat)

for (i in 1:length(orgs)) {
  orgVal = orgs[i]
   if (endsWith(orgVal, "closed")) {
     meanVal = mean(packageDataDf[packageDataDf$orgPrStat == orgVal,]$closed - 
                    packageDataDf[packageDataDf$orgPrStat == orgVal,]$created)  
     orgLabels[i] = sprintf("%s: mean days to closure = %.0f", orgVal, meanVal)
   }
   else {
     meanVal = mean(today(tzone = "UTC") - 
                      packageDataDf[packageDataDf$orgPrStat == orgVal,]$created)
     orgLabels[i] = sprintf("%s: mean days open = %.0f", orgVal, meanVal)
   }
}

agingPlot = ggplot(data = yearlyPackageDf, aes(
    x = floor_date_created,
    y = PR_count,
    group = orgPrStat,
    color = orgPrStat
  )) +
  scale_x_date(limits = as.Date(c(min(yearlyPackageDf$floor_date_created), 
                                  "2021-01-01"))) + 
  geom_line() +
  scale_color_discrete(name = "Organization PR aging",
                       breaks = orgs,
                       labels = orgLabels) +
  labs(x = "Semi-annual PR creation counts",
       y = "Pull request count"
  )

print(agingPlot)

CRAN downloads

Plot downloads from Rstudio’s CRAN mirror for specified packages. Baseline packages included to chart relative growth.

library(cranlogs)
library(ggplot2)
library(dplyr)
library(lubridate)
library(httr)
library(jsonlite)
library(stringr)
library(R.cache)
library(ggrepel)

#https://www.ubuntupit.com/best-r-machine-learning-packages/
packages <- c('rstan','lme4','Rcpp','randomForest','coda','glmnet','caret','mlr3','e1071','Rpart','KernLab','mlr','arules','mboost')
packages <- c('ggplot2','lme4','rstan','rstanarm','brms')

dls <- cran_downloads(
  packages = packages, 
  from ="2016-01-01",
  to = "2021-04-30"
)

# map to month from day data
mls <- dls %>% mutate(month=floor_date(date, "monthly")) %>% 
  group_by(month,package)  %>% 
  summarize(monthly_downloads=sum(count))

# mls data check, don't trust the above
 mls_val = (mls %>% filter(package=='rstan') %>%
            filter(month=='2018-02-01'))$monthly_downloads 
 
 dls_val = sum(cran_downloads(packages = c('rstan'), 
               from ="2018-02-01", to = "2018-02-28")$count)
 
 if (mls_val != dls_val) {
   stop(sprintf(paste("Problems with data, expect computed monthly total",        
                      "mls_val=%d and more simply computed monthly total dls_val=%d to be equal"),
   mls_val, dls_val))
 }

label_month <- max(mls$month)
mls_label <- mls %>% 
  mutate(label=if_else(month == label_month, 
                       str_replace(package,
                                   'ggplot2',
                                   'BASELINE ggplot2'),
                       NA_character_))

plot1 <- ggplot(data=mls_label,
                aes(x=month, y=monthly_downloads, color=package,
                    group=package)) +
         geom_line()

b_plot1 <- ggplot(data=mls_label,
                aes(x=as.numeric(month), y=log(monthly_downloads), color=package,
                    group=package)) +
         geom_line()

log_plot1 <- plot1 + scale_y_continuous(breaks=c(0,100,1000,10000,100000,1000000), 
                     trans = scales::log_trans())

log_plot1_display <- log_plot1 +
  geom_smooth(method='lm',formula=y~x, fullrange=TRUE, se=FALSE) +
  geom_label_repel(aes(label = label), na.rm = TRUE) +
         scale_color_discrete(guide = FALSE)

log_plot1_2024_scale <- log_plot1 +   
            xlim(as.Date('2016-01-01'),as.Date('2024-06-30'))

log_plot1_2024_slopes_display <- log_plot1_2024_scale +
  geom_smooth(method='lm',formula=y~x, fullrange=TRUE, se=FALSE) +
  geom_label_repel(aes(label = label), na.rm = TRUE) +
         scale_color_discrete(guide = FALSE)

print(log_plot1_2024_slopes_display)

Regression lines shown to express relative growth rates for Stan ecosystem components compared to ggplot2 and lme4.

Page rank analysis

Entirely taken from https://blog.revolutionanalytics.com/2014/12/a-reproducible-r-example-finding-the-most-popular-packages-using-the-pagerank-algorithm.html.

A version that worked over the dependencies in PyPi might be useful.

This takes a long time to run so no output shown.

library(miniCRAN)
library(igraph)
library(magrittr)

# taken entirely from: https://blog.revolutionanalytics.com/2014/12/a-reproducible-r-example-finding-the-most-popular-packages-using-the-pagerank-algorithm.html

#need to change date to yesterday
MRAN <- "http://mran.revolutionanalytics.com/snapshot/2021-05-06/"

pdb <- MRAN %>%
  contrib.url(type = "source") %>%
  available.packages(type="source", filters = NULL)

g <- pdb[, "Package"] %>%
  makeDepGraph(availPkgs = pdb, suggests=FALSE, enhances=FALSE, includeBasePkgs = FALSE)

pr <- g %>%
  page.rank(directed = FALSE) %>%
  use_series("vector") %>%
  sort(decreasing = TRUE) %>%
  as.matrix %>%
  set_colnames("page.rank")

set.seed(42)
pr %>%
  head(100) %>%
  rownames %>%
  makeDepGraph(pdb) %>%
  plot(main="Top packages by page rank", cex=0.5)

print(sprintf("Rstan is %dth highest page rank score for R package dependencies out of %d packages", which(row.names(pr) == 'rstan'), nrow(pr)))

Resources for CZI Proposals for NumFOCUS projects

Breck Baldwin

May 17, 2021