Exploring Github Topics
As the code driving FOSS for Spectroscopy has matured, I began to think about how to explore Github in a systematic way for additional repositories with tools for spectroscopy. It turns out that a Github repo can have topics assigned to it, and you can use the Github API to search them. Wait, what? I didn’t know one could add topics to a repo, even though there is a little invite right there under the repo name:
Naturally I turned to StackOverflow to find out how to do this, and quickly encountered this question. It was asked when the topics feature was new, so one needs to do things just a bit differently now, but there is a way forward.
Before we get to implementation, let’s think about limitations:
- This will only find repositories where topics have been set. I don’t know how broadly people use this feature, I had missed it when it was added.
- Github topics are essentially tags with a controlled vocabulary, so for the best results you’ll need to manually explore the tags and then use these as your search terms.
- The Github API only returns 30 results at a time for most types of queries. For our purposes this probably doesn’t matter much. The documentation explains how to iterate to get all possible results.
- The Github API also limits the number of queries you can make to 60/hr unless you authenticate, in which case the limit goes to 6000/hr.
Let’s get to it! First, create a Github access token on your local machine using the instructions in this gist. Next, load the needed libraries:
Specify your desired search terms, and create a list structure to hold the results:
Create the string needed to access the Github API, then GET
the results, and stash them in the list we created:
nt <- length(search_terms) # nt = no. of search terms
for (i in 1:nt) {
search_string <- paste0("https://api.github.com/search/repositories?q=topic:", search_terms[i])
request <- GET(search_string, config(token = github_token))
stop_for_status(request) # converts http errors to R errors or warnings
results[[i]] <- content(request)
}
names(results) <- search_terms
Figure out how many results we have found, set up a data frame and then put the results into the table. The i
, j
, and k
counters required a little experimentation to get right, as content(request)
returns a deeply nested list and only certain items are desired.
nr <- 0L # nr = no. of responses
for (i in 1:nt) { # compute total number of results/items found
nr <- nr + length(results[[i]]$items)
}
DF <- data.frame(term = rep(NA_character_, nr),
repo_name = rep(NA_character_, nr),
repo_url = rep(NA_character_, nr),
stringsAsFactors = FALSE)
k <- 1L
for (i in 1:nt) {
ni <- length(results[[i]]$items) # ni = no. of items
for (j in 1:ni) {
DF$term[k] <- names(results)[[i]]
DF$repo_name[k] <- results[[i]]$items[[j]]$name
DF$repo_url[k] <- results[[i]]$items[[j]]$html_url
k <- k + 1L
}
}
# remove duplicated repos which result when repos have several of our
# search terms of interest.
DF <- DF[-which(duplicated(DF$repo_name)),]
Now put it all in a table we can inspect manually, send to a web page so it’s clickable, or potentially write it out as a csv (If you want this as a csv you should probably write the results out a bit differently). In this case I want the results as a table in web page so I can click the repo links and go straight to them.
We’ll show just 10 random rows as an example:
keep <- sample(1:nrow(DF2), 10)
options(knitr.kable.NA = '')
kable(DF2[keep, ]) %>%
kable_styling(c("striped", "bordered"))
Search Term | Link to Repo | |
---|---|---|
31 | infrared | pycroscopy |
79 | ultraviolet | woudc-data-registry |
51 | infrared | ir-repeater |
14 | NMR | spectra-data |
67 | raman | Raman-spectra |
42 | infrared | PrecIR |
50 | infrared | esp32-room-control-panel |
118 | spectroscopy | LiveViewLegacy |
43 | infrared | arduino-primo-tutorials |
101 | XRF | web_geochemistry |
Obviously, these results must be inspected carefully as terms like “infrared” will pick up projects that deal with infrared remote control of robots and so forth. As far as my case goes, I have a lot of new material to look through…
A complete .Rmd
file that carries out the search described above, and has a few enhancements, can be found at this gist.
Reuse
Citation
@online{hanson2020,
author = {Hanson, Bryan},
title = {Exploring {Github} {Topics}},
date = {2020-01-25},
url = {http://chemospec.org/posts/2020-01-25-GH-Topics/2020-01-25-GH-Topics.html},
langid = {en}
}