Exploring Github Topics

Github

Scrape Github for Repos of Interest

Author

Bryan Hanson

Published

January 25, 2020

As the code driving FOSS for Spectroscopy has matured, I began to think about how to explore Github in a systematic way for additional repositories with tools for spectroscopy. It turns out that a Github repo can have topics assigned to it, and you can use the Github API to search them. Wait, what? I didn’t know one could add topics to a repo, even though there is a little invite right there under the repo name:

Naturally I turned to StackOverflow to find out how to do this, and quickly encountered this question. It was asked when the topics feature was new, so one needs to do things just a bit differently now, but there is a way forward.

Before we get to implementation, let’s think about limitations:

This will only find repositories where topics have been set. I don’t know how broadly people use this feature, I had missed it when it was added.
Github topics are essentially tags with a controlled vocabulary, so for the best results you’ll need to manually explore the tags and then use these as your search terms.
The Github API only returns 30 results at a time for most types of queries. For our purposes this probably doesn’t matter much. The documentation explains how to iterate to get all possible results.
The Github API also limits the number of queries you can make to 60/hr unless you authenticate, in which case the limit goes to 6000/hr.

Let’s get to it! First, create a Github access token on your local machine using the instructions in this gist. Next, load the needed libraries:

set.seed(123)
library("httr")
library("knitr")
library("kableExtra")

Specify your desired search terms, and create a list structure to hold the results:

search_terms <- c("NMR", "infrared", "raman", "ultraviolet", "visible", "XRF", "spectroscopy")
results <- list()

Create the string needed to access the Github API, then GET the results, and stash them in the list we created:

nt <- length(search_terms) # nt = no. of search terms
for (i in 1:nt) {
  search_string <- paste0("https://api.github.com/search/repositories?q=topic:", search_terms[i])
    request <- GET(search_string, config(token = github_token))
  stop_for_status(request) # converts http errors to R errors or warnings
  results[[i]] <- content(request)
}
names(results) <- search_terms

Figure out how many results we have found, set up a data frame and then put the results into the table. The i, j, and k counters required a little experimentation to get right, as content(request) returns a deeply nested list and only certain items are desired.

nr <- 0L # nr = no. of responses
for (i in 1:nt) { # compute total number of results/items found
    nr <- nr + length(results[[i]]$items)
}

DF <- data.frame(term = rep(NA_character_, nr),
  repo_name = rep(NA_character_, nr),
  repo_url = rep(NA_character_, nr),
  stringsAsFactors = FALSE)

k <- 1L
for (i in 1:nt) {
    ni <- length(results[[i]]$items) # ni = no. of items
    for (j in 1:ni) {
        DF$term[k] <- names(results)[[i]]
        DF$repo_name[k] <- results[[i]]$items[[j]]$name
        DF$repo_url[k] <- results[[i]]$items[[j]]$html_url
        k <- k + 1L
    }
}
# remove duplicated repos which result when repos have several of our
# search terms of interest.
DF <- DF[-which(duplicated(DF$repo_name)),]

Now put it all in a table we can inspect manually, send to a web page so it’s clickable, or potentially write it out as a csv (If you want this as a csv you should probably write the results out a bit differently). In this case I want the results as a table in web page so I can click the repo links and go straight to them.

namelink <- paste0("[", DF$repo_name, "](", DF$repo_url, ")")
DF2 <- data.frame(DF$term, namelink, stringsAsFactors = FALSE)
names(DF2) <- c("Search Term", "Link to Repo")

We’ll show just 10 random rows as an example:

keep <- sample(1:nrow(DF2), 10)
options(knitr.kable.NA = '')
kable(DF2[keep, ]) %>%
  kable_styling(c("striped", "bordered"))

	Search Term	Link to Repo
31	infrared	pycroscopy
79	ultraviolet	woudc-data-registry
51	infrared	ir-repeater
14	NMR	spectra-data
67	raman	Raman-spectra
42	infrared	PrecIR
50	infrared	esp32-room-control-panel
118	spectroscopy	LiveViewLegacy
43	infrared	arduino-primo-tutorials
101	XRF	web_geochemistry

Obviously, these results must be inspected carefully as terms like “infrared” will pick up projects that deal with infrared remote control of robots and so forth. As far as my case goes, I have a lot of new material to look through…

A complete .Rmd file that carries out the search described above, and has a few enhancements, can be found at this gist.

Reuse

CC BY 4.0

Citation

BibTeX citation:

@online{hanson2020,
  author = {Hanson, Bryan},
  title = {Exploring {Github} {Topics}},
  date = {2020-01-25},
  url = {http://chemospec.org/posts/2020-01-25-GH-Topics/2020-01-25-GH-Topics.html},
  langid = {en}
}

For attribution, please cite this work as:

Hanson, Bryan. 2020. “Exploring Github Topics.” January 25, 2020. http://chemospec.org/posts/2020-01-25-GH-Topics/2020-01-25-GH-Topics.html.

Subscribe

Reuse

Citation