Exploring Github Topics

R
Github
Scrape Github for Repos of Interest
Author

Bryan Hanson

Published

January 25, 2020

As the code driving FOSS for Spectroscopy has matured, I began to think about how to explore Github in a systematic way for additional repositories with tools for spectroscopy. It turns out that a Github repo can have topics assigned to it, and you can use the Github API to search them. Wait, what? I didn’t know one could add topics to a repo, even though there is a little invite right there under the repo name:

Repo Topics

Naturally I turned to StackOverflow to find out how to do this, and quickly encountered this question. It was asked when the topics feature was new, so one needs to do things just a bit differently now, but there is a way forward.

Before we get to implementation, let’s think about limitations:

Let’s get to it! First, create a Github access token on your local machine using the instructions in this gist. Next, load the needed libraries:

set.seed(123)
library("httr")
library("knitr")
library("kableExtra")

Specify your desired search terms, and create a list structure to hold the results:

search_terms <- c("NMR", "infrared", "raman", "ultraviolet", "visible", "XRF", "spectroscopy")
results <- list()

Create the string needed to access the Github API, then GET the results, and stash them in the list we created:

nt <- length(search_terms) # nt = no. of search terms
for (i in 1:nt) {
  search_string <- paste0("https://api.github.com/search/repositories?q=topic:", search_terms[i])
    request <- GET(search_string, config(token = github_token))
  stop_for_status(request) # converts http errors to R errors or warnings
  results[[i]] <- content(request)
}
names(results) <- search_terms

Figure out how many results we have found, set up a data frame and then put the results into the table. The i, j, and k counters required a little experimentation to get right, as content(request) returns a deeply nested list and only certain items are desired.

nr <- 0L # nr = no. of responses
for (i in 1:nt) { # compute total number of results/items found
    nr <- nr + length(results[[i]]$items)
}

DF <- data.frame(term = rep(NA_character_, nr),
  repo_name = rep(NA_character_, nr),
  repo_url = rep(NA_character_, nr),
  stringsAsFactors = FALSE)

k <- 1L
for (i in 1:nt) {
    ni <- length(results[[i]]$items) # ni = no. of items
    for (j in 1:ni) {
        DF$term[k] <- names(results)[[i]]
        DF$repo_name[k] <- results[[i]]$items[[j]]$name
        DF$repo_url[k] <- results[[i]]$items[[j]]$html_url
        k <- k + 1L
    }
}
# remove duplicated repos which result when repos have several of our
# search terms of interest.
DF <- DF[-which(duplicated(DF$repo_name)),]

Now put it all in a table we can inspect manually, send to a web page so it’s clickable, or potentially write it out as a csv (If you want this as a csv you should probably write the results out a bit differently). In this case I want the results as a table in web page so I can click the repo links and go straight to them.

namelink <- paste0("[", DF$repo_name, "](", DF$repo_url, ")")
DF2 <- data.frame(DF$term, namelink, stringsAsFactors = FALSE)
names(DF2) <- c("Search Term", "Link to Repo")

We’ll show just 10 random rows as an example:

keep <- sample(1:nrow(DF2), 10)
options(knitr.kable.NA = '')
kable(DF2[keep, ]) %>%
  kable_styling(c("striped", "bordered"))
Search Term Link to Repo
31 infrared pycroscopy
79 ultraviolet woudc-data-registry
51 infrared ir-repeater
14 NMR spectra-data
67 raman Raman-spectra
42 infrared PrecIR
50 infrared esp32-room-control-panel
118 spectroscopy LiveViewLegacy
43 infrared arduino-primo-tutorials
101 XRF web_geochemistry

Obviously, these results must be inspected carefully as terms like “infrared” will pick up projects that deal with infrared remote control of robots and so forth. As far as my case goes, I have a lot of new material to look through…

A complete .Rmd file that carries out the search described above, and has a few enhancements, can be found at this gist.

Reuse

Citation

BibTeX citation:
@online{hanson2020,
  author = {Hanson, Bryan},
  title = {Exploring {Github} {Topics}},
  date = {2020-01-25},
  url = {http://chemospec.org/posts/2020-01-25-GH-Topics/2020-01-25-GH-Topics.html},
  langid = {en}
}
For attribution, please cite this work as:
Hanson, Bryan. 2020. “Exploring Github Topics.” January 25, 2020. http://chemospec.org/posts/2020-01-25-GH-Topics/2020-01-25-GH-Topics.html.