Back in Part 2 I mentioned some of the challenges of learning linear algebra. One of those challenges is making sense of all the special types of matrices one encounters. In this post I hope to shed a little light on that topic.
I am strongly drawn to thinking in terms of categories and relationships. I find visual presentations like phylogenies showing the relationships between species very useful. In the course of my linear algebra journey, I came across an interesting Venn diagram developed by the very creative thinker Kenji Hiranabe. The diagram is discussed at Matrix World, but the latest version is at the Github link. A Venn diagram is a useful format, but I was inspired to recast the information in different format. Figure 1 shows a taxonomy I created using a portion of the information in Hiranabe’s Venn diagram.^{1} The taxonomy is primarily organized around what I am calling the structure of a matrix: what does it look like upon visual inspection? Of course this is most obvious with small matrices. To me at least, structure is one of the most obvious characteristics of a matrix: an upper triangular matrix really stands out for instance. Secondarily, the taxonomy includes a number of queries that one can ask about a matrix: for instance, is the matrix invertible? We’ll need to expand on all of this of course, but first take a look at the figure.^{2}
Let’s use R
to construct and inspect examples of each type of matrix. We’ll use integer matrices to keep the print output nice and neat, but of course real numbers could be used as well.^{3} Most of these are pretty straightforward so we’ll keep comments to a minimum for the simple cases.
A_rect < matrix(1:12, nrow = 3) # if you give nrow,
A_rect # R will compute ncol from the length of the data
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
Notice that R
is “column major” meaning data fills the first column, then the second column and so forth.
A_row < matrix(1:4, nrow = 1)
A_row
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
A_col < matrix(1:4, ncol = 1)
A_col
[,1]
[1,] 1
[2,] 2
[3,] 3
[4,] 4
Keep in mind that to save space in a textdense document one would often write A_col
as its transpose.^{4}
A_sq < matrix(1:9, nrow = 3)
A_sq
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
Creating an upper triangular matrix requires a few more steps. Function upper.tri()
returns a logical matrix which can be used as a mask to select entries. Function lower.tri()
can be used similarly. Both functions have an argument diag = TRUE/FALSE
indicating whether to include the diagonal.^{5}
upper.tri(A_sq, diag = TRUE)
[,1] [,2] [,3]
[1,] TRUE TRUE TRUE
[2,] FALSE TRUE TRUE
[3,] FALSE FALSE TRUE
A_upper < A_sq[upper.tri(A_sq)] # gives a logical matrix
A_upper # notice that a vector is returned, not quite what might have been expected!
[1] 4 7 8
A_upper < A_sq # instead, create a copy to be modified
A_upper[lower.tri(A_upper)] < 0L # assign the lower entries to zero
A_upper
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 0 5 8
[3,] 0 0 9
Notice to create an upper triangular matrix we use lower.tri()
to assign zeros to the lower part of an existing matrix.
If you give diag()
a single value it defines the dimensions and creates a matrix with ones on the diagonal, in other words, an identity matrix.
A_ident < diag(4)
A_ident
[,1] [,2] [,3] [,4]
[1,] 1 0 0 0
[2,] 0 1 0 0
[3,] 0 0 1 0
[4,] 0 0 0 1
If instead you give diag()
a vector of values these go on the diagonal and the length of the vector determines the dimensions.
A_diag < diag(1:4)
A_diag
[,1] [,2] [,3] [,4]
[1,] 1 0 0 0
[2,] 0 2 0 0
[3,] 0 0 3 0
[4,] 0 0 0 4
Matrices created by diag()
are symmetric matrices, but any matrix where is symmetric. There is no general function to create symmetric matrices since there is no way to know what data should be used. However, one can ask if a matrix is symmetric, using the function isSymmetric()
.
isSymmetric(A_diag)
[1] TRUE
Let’s take the queries in the taxonomy in order, as the hierarchy is everything.
A singular matrix is one in which one or more rows are multiples of another row, or alternatively, one or more columns are multiples of another column. Why do we care? Well, it turns out a singular matrix is a bit of a dead end, you can’t do much with it. An invertible matrix, however, is a very useful entity and has many applications. What is an invertible matrix? In simple terms, being invertible means the matrix has an inverse. This is not the same as the algebraic definition of an inverse, which is related to division:
Instead, for matrices, invertibility of is defined as the existence of another matrix such that
Just as cancels out in , cancels out to give the identity matrix. In other words, is really .
A singular matrix has determinant of zero. On the other hand, an invertible matrix has a nonzero determinant. So to determine which type of matrix we have before us, we can simply compute the determinant.
Let’s look at a few simple examples.
A_singular < matrix(c(1, 2, 3, 6), nrow = 2, ncol = 2)
A_singular # notice that col 2 is col 1 * 3, they are not independent
[,1] [,2]
[1,] 1 3
[2,] 2 6
det(A_singular)
[1] 0
A_invertible < matrix(c(2, 2, 7, 8), nrow = 2, ncol = 2)
A_invertible
[,1] [,2]
[1,] 2 7
[2,] 2 8
det(A_invertible)
[1] 2
A matrix that is diagonalizable can be expressed as:
where is a diagonal matrix – the diagonalized version of the original matrix . How do we find out if this is possible, and if possible, what are the values of and ? The answer is to decompose using the eigendecomposition:
Now there is a lot to know about the eigendecomposition, but for now let’s just focus on a few key points:
We can answer the original question by using the eigen()
function in R
. Let’s do an example.
A_eigen < matrix(c(1, 0, 2, 2, 3, 4, 0, 0, 2), ncol = 3)
A_eigen
[,1] [,2] [,3]
[1,] 1 2 0
[2,] 0 3 0
[3,] 2 4 2
eA < eigen(A_eigen)
eA
eigen() decomposition
$values
[1] 3 2 1
$vectors
[,1] [,2] [,3]
[1,] 0.4082483 0 0.4472136
[2,] 0.4082483 0 0.0000000
[3,] 0.8164966 1 0.8944272
Since eigen(A_eigen)
was successful, we can conclude that A_eigen
was diagonalizable. You can see the eigenvalues and eigenvectors in the returned value. We can reconstruct A_eigen
using Equation 4:
eA$vectors %*% diag(eA$values) %*% solve(eA$vectors)
[,1] [,2] [,3]
[1,] 1 2 0
[2,] 0 3 0
[3,] 2 4 2
Remember, diag()
creates a matrix with the values along the diagonal, and solve()
computes the inverse when it gets only one argument.
The only loose end is which matrices are not diagonalizable? These are covered in this Wikipedia article. Briefly, most nondiagonalizable matrices are fairly exotic and real data sets will likely not be a problem.
A normal matrix is one where . As far as I know, there is no function in R
to check this condition, but we’ll write our own in a moment. One reason being “normal” is interesting is if is a normal matrix, then the results of the eigendecomposition change slightly:
where is an orthogonal matrix, which we’ll talk about next.
An orthogonal matrix takes the definition of a normal matrix one step further: . If a matrix is orthogonal, then its transpose is equal to its inverse: , which of course makes any special computation of the inverse unnecessary. This is a significant advantage in computations.
To aid our learning, let’s write a simple function that will report if a matrix is normal, orthogonal, or neither.^{8}
normal_or_orthogonal < function(M) {
if (!inherits(M, "matrix")) stop("M must be a matrix")
norm < orthog < FALSE
tst1 < M %*% t(M)
tst2 < t(M) %*% M
norm < isTRUE(all.equal(tst1, tst2))
if (norm) orthog < isTRUE(all.equal(tst1, diag(dim(M)[1])))
if (orthog) message("This matrix is orthogonal\n") else
if (norm) message("This matrix is normal\n") else
message("This matrix is neither orthogonal nor normal\n")
invisible(NULL)
}
And let’s run a couple of tests.
normal_or_orthogonal(A_singular)
This matrix is neither orthogonal nor normal
Norm < matrix(c(1, 0, 1, 1, 1, 0, 0, 1, 1), nrow = 3)
normal_or_orthogonal(Norm)
This matrix is normal
normal_or_orthogonal(diag(3)) # the identity matrix is orthogonal
This matrix is orthogonal
Orth < matrix(c(0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0), nrow = 4)
normal_or_orthogonal(Orth)
This matrix is orthogonal
Taking these queries together, we see that symmetric and diagonal matrices are necessarily invertible, diagonalizable and normal. They are not however orthogonal. Identity matrices however, have all these properties. Let’s doublecheck these statements.
A_sym < matrix(
c(1, 5, 4, 5, 2, 9, 4, 9, 3),
ncol = 3) # symmetric matrix, not diagonal
A_sym
[,1] [,2] [,3]
[1,] 1 5 4
[2,] 5 2 9
[3,] 4 9 3
normal_or_orthogonal(A_sym)
This matrix is normal
normal_or_orthogonal(diag(1:3)) # diagonal matrix, symmetric, but not the identity matrix
This matrix is normal
normal_or_orthogonal(diag(3)) # identity matrix (also symmetric, diagonal)
This matrix is orthogonal
So what’s the value of these queries? As mentioned, they help us understand the relationships between different types of matrices, so they help us learn more deeply. On a practical computational level they may not have much value, especially when dealing with realworld data sets. However, there are some other interesting aspects of these queries that deal with decompositions and eigenvalues. We might cover these in the future.
A more personal thought: In the course of writing these posts, and learning more linear algebra, it increasingly seems to me that a lot of the “effort” that goes into linear algebra is about making tedious operations simpler. Anytime one can have more zeros in a matrix, or have orthogonal vectors, or break a matrix into parts, the simpler things become. However, I haven’t really seen this point driven home in texts or tutorials. I think linear algebra learners would do well to keep this in mind.
These are the main sources I relied on for this post.
I’m only using a portion because the Hiranbe’s original contains a bit too much information for someone trying to get their footing in the field.↩︎
I’m using the term taxonomy a little loosely of course, you can call it whatever you want. The name is not so important really, what is important is the hierarchy of concepts.↩︎
As could complex numbers.↩︎
Usually in written text a row matrix, sometimes called a row vector, is written as . In order to save space in documents, rather than writing , a column matrix/vector can be kept to a single line by writing it as its transpose: , but this requires a little mental gymnastics to visualize.↩︎
Upper and lower triangular matrices play a special role in linear algebra. Because of the presence of many zeros, multiplying them and inverting them is relatively easy, because the zeros cause terms to drop out.↩︎
This idea of the “most natural basis” is most easily visualized in two dimensions. If you have some data plotted on and axes, determining the line of best fit is one way of finding the most natural basis for describing the data. However, more generally and in more dimensions, principal component analysis (PCA) is the most rigorous way of finding this natural basis, and PCA can be calculated with the eigen()
function. Lots more information here.↩︎
The drop
argument to subsetting/extracting defaults to TRUE
which means that if subsetting reduces the necessary number of dimensions, the unneeded dimension attributes are dropped. Under the default, selecting a single column of a matrix leads to a vector, not a one column vector. In this all.equal()
expression we need both sides to evaluate to a matrix.↩︎
One might ask why R
does not provide a userfacing version of such a function. I think a good argument can be made that the authors of R
passed down a robust and lean set of linear algebra functions, geared toward getting work done, and throwing errors as necessary.↩︎
@online{hanson2022,
author = {Bryan Hanson},
editor = {},
title = {Notes on {Linear} {Algebra} {Part} 4},
date = {20220926},
url = {http://chemospec.org/LinearAlgNotesPt4.html},
langid = {en}
}
Update 19 September 2022: in “Use of outer() for Matrix Multiplication”, corrected use of “cross” to be “outer” and added example in R
. Also added links to work by Hiranabe.
This post is a survey of the linear algebrarelated functions from base R
. Some of these I’ve disccused in other posts and some I may discuss in the future, but this post is primarily an inventory: these are the key tools we have available. “Notes” in the table are taken from the help files.
Matrices, including row and column vectors, will be shown in bold e.g. or while scalars and variables will be shown in script, e.g. . R
code will appear like x < y
.
In the table, or is an upper/right triangular matrix. is a lower/left triangular matrix (triangular matrices are square). is a generic matrix of dimensions . is a square matrix of dimensions .
Function  Uses  Notes  

operators  
* 
scalar multiplication  
%*% 
matrix multiplication  two vectors the dot product; vector + matrix cross product (vector will be promoted as needed)^{1}  
basic functions  
t() 
transpose  interchange rows and columns  
crossprod() 
matrix multiplication  faster version of t(A) %*% A 

tcrossprod() 
matrix multiplication  faster version of A %*% t(A) 

outer() 
outer product & more  see discussion below  
det() 
computes determinant  uses the LU decomposition; determinant is a volume  
isSymmetric() 
name says it all  
Conj() 
computes complex conjugate  
decompositions  
backsolve() 
solves  
forwardsolve() 
solves  
solve() 
solves and  e.g. linear systems; if given only one matrix returns the inverse  
qr() 
solves  is an orthogonal matrix; can be used to solve ; see ?qr for several qr.* extractor functions 

chol() 
solves  Only applies to positive semidefinite matrices (where ); related to LU decomposition  
chol2inv() 
computes from the results of chol(M) 

svd() 
singular value decomposition  input ; can compute PCA; details  
eigen() 
eigen decomposition  requires ; can compute PCA; details 
One thing to notice is that there is no LU decomposition in base R
. It is apparently used “under the hood” in solve()
and there are versions available in contributed packages.^{2}
For details see the discussion in Part 1.↩︎
Discussed in this Stackoverflow question, which also has an implementation.↩︎
In fact, for the default outer()
, FUN = "*"
, outer()
actually calls tcrossprod()
.↩︎
@online{hanson2022,
author = {Bryan Hanson},
editor = {},
title = {Notes on {Linear} {Algebra} {Part} 3},
date = {20220910},
url = {http://chemospec.org/LinearAlgNotesPt3.html},
langid = {en}
}
For Part 1 of this series, see here.
If you open a linear algebra text, it’s quickly apparent how complex the field is. There are so many special types of matrices, so many different decompositions of matrices. Why are all these needed? Should I care about null spaces? What’s really important? What are the threads that tie the different concepts together? As someone who is trying to improve their understanding of the field, especially with regard to its applications in chemometrics, it can be a tough slog.
In this post I’m going to try to demonstrate how some simple chemometric tasks can be solved using linear algebra. Though I cover some math here, the math is secondary right now – the conceptual connections are more important. I’m more interested in finding (and sharing) a path through the thicket of linear algebra. We can return as needed to expand the basic math concepts. The cognitive effort to work through the math details is likely a lot lower if we have a sense of the big picture.
In this post, matrices, including row and column vectors, will be shown in bold e.g. while scalars and variables will be shown in script, e.g. . Variables used in R
code will appear like A
.
If you’ve had algebra, you have certainly run into “system of equations” such as the following:
In algebra, such systems can be solved several ways, for instance by isolating one or more variables and substituting, or geometrically (particularly for 2D systems, by plotting the lines and looking for the intersection). Once there are more than a few variables however, the only manageable way to solve them is with matrix operations, or more explicitly, linear algebra. This sort of problem is the core of linear algebra, and the reason the field is called linear algebra.
To solve the system above using linear algebra, we have to write it in the form of matrices and column vectors:
or more generally
where is the matrix of coefficients, is the column vector of variable names^{1} and is a column vector of constants. Notice that these matrices are conformable:^{2}
To solve such a system, when we have unknowns, we need equations.^{3} This means that has to be a square matrix, and square matrices play a special role in linear algebra. I’m not sure this point is always conveyed clearly when this material is introduced. In fact, it seems like many texts on linear algebra seem to bury the lede.
To find the values of ^{4}, we can do a little rearranging following the rules of linear algebra and matrix operations. First we premultiply both sides by the inverse of , which then gives us the identity matrix , which drops out.^{5}
So it’s all sounding pretty simple right? Ha. This is actually where things potentially break down. For this to work, must be invertible, which is not always the case.^{6} If there is no inverse, then the system of equations either has no solution or infinite solutions. So finding the inverse of a matrix, or discovering it doesn’t exist, is essential to solving these systems of linear equations.^{7} More on this eventually, but for now, we know must be a square matrix and we hope it is invertible.
We learn in algebra that a line takes the form . If one has measurements in the form of pairs that one expects to fit to a line, we need linear regression. Carrying out a linear regression is arguably one of the most important, and certainly a very common application of the linear systems described above. One can get the values of and by hand using algebra, but any computer will solve the system using a matrix approach.^{8} Consider this data:
To express this in a matrix form, we recast
into
where:
With our data above, this looks like:
If we multiply this out, each row works out to be an instance of . Hopefully you can appreciate that corresponds to and corresponds to .^{9}
This looks similar to seen in Equation 3, if you set to , to and to :
This contortion of symbols is pretty nasty, but honestly not uncommon when moving about in the world of linear algebra.
As it is composed of real data, presumably with measurement errors, there is not an exact solution to due to the error term. There is however, an approximate solution, which is what is meant when we say we are looking for the line of best fit. This is how linear regression is carried out on a computer. The relevant equation is:
The key point here is that once again we need to invert a matrix to solve this. The details of where Equation 11 comes from are covered in a number of places, but I will note here that refers to the best estimate of .^{10}
We now have two examples where inverting a matrix is a key step: solving a system of linear equations, and approximating the solution to a system of linear equations (the regression case). These cases are not outliers, the ability to invert a matrix is very important. So how do we do this? The LU decomposition can do it, and is widely used so worth spending some time on. A decomposition is the process of breaking a matrix into pieces that are easier to handle, or that give us special insight, or both. If you are a chemometrician you have almost certainly carried out Principal Components Analysis (PCA). Under the hood, PCA requires either a singular value decomposition, or an eigen decomposition (more info here).
So, about the LU decomposition: it breaks a matrix into two matrices, , a “lower triangular matrix”, and , an “upper triangular matrix”. These special matrices contain only zeros except along the diagonal and the entries below it (in the lower case), or along the diagonal and the entries above it (in the upper case). The advantage of triangular matrices is that they are very easy to invert (all those zeros make many terms drop out). So the LU decomposition breaks the tough job of inverting into two easier jobs.
When all is done, we only need to figure out and which as mentioned is straightforward.^{11}
To summarize, if we want to solve a system of equations we need to carry out matrix inversion, which is turn is much easier to do if one uses the LU decomposition to get two easy to invert triangular matrices. I hope you are beginning to see how pieces of linear algebra fit together, and why it might be good to learn more.
Let’s look at how R
does these operations, and check our understanding along the way. R
makes this really easy. We’ll start with the issue of invertibility. Let’s create a matrix for testing.
A1 < matrix(c(3, 5, 1, 11, 2, 0, 5, 2, 5), ncol = 3)
A1
[,1] [,2] [,3]
[1,] 3 11 5
[2,] 5 2 2
[3,] 1 0 5
In the matlib
package there is a function inv
that inverts matrices. It returns the inverted matrix, which we can verify by multiplying the inverted matrix by the original matrix to give the identity matrix (if inversion was successful). diag(3)
creates a 3 x 3 matrix with 1’s on the diagonal, in other words an identity matrix.
library("matlib")
A1_inv < inv(A1)
all.equal(A1_inv %*% A1, diag(3))
[1] "Mean relative difference: 8.999999e08"
The difference here is really small, but not zero. Let’s use a different function, solve
which is part of base R
. If solve
is given a single matrix, it returns the inverse of that matrix.
A1_solve < solve(A1) %*% A1
all.equal(A1_solve, diag(3))
[1] TRUE
That’s a better result. Why are there differences? inv
uses a method called Gaussian elimination which is similar to how one would invert a matrix using pencil and paper. On the other hand, solve
uses the LU decomposition discussed earlier, and no matrix inversion is necessary. Looks like the LU decomposition gives a somewhat better numerical result.
Now let’s look at a different matrix, created by replacing the third column of A1
with different values.
A2 < matrix(c(3, 5, 1, 11, 2, 0, 6, 10, 2), ncol = 3)
A2
[,1] [,2] [,3]
[1,] 3 11 6
[2,] 5 2 10
[3,] 1 0 2
And let’s compute its inverse using solve
.
solve(A2)
Error in solve.default(A2): system is computationally singular: reciprocal condition number = 6.71337e19
When R
reports that A2
is computationally singular, it is saying that it cannot be inverted. Why not? If you look at A2
, notice that column 3 is a multiple of column 1. Anytime one column is a multiple of another, or one row is a multiple of another, then the matrix cannot be inverted because the rows or columns are not independent.^{12} If this was a matrix of coefficients from an experimental measurement of variables, this would mean that some of your variables are not independent, they must be measuring the same underlying phenomenon.
Let’s solve the system from Equation 2. It turns out that the solve
function also handles this case, if you give it two arguments. Remember, solve
is using the LU decomposition behind the scenes, no matrix inversion is required.
A3 < matrix(c(1, 2, 3, 2, 1, 2, 3, 1, 1), ncol = 3)
A3
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 2 1 1
[3,] 3 2 1
colnames(A3) <c("x", "y", "z") # naming the columns will label the answer
b < c(3, 11, 5)
solve(A3, b)
x y z
2 4 3
The answer is the values of that make the system of equations true.
Let’s compute the values for in our regression data shown in Equation 6. First, let’s set up the needed matrices and plot the data since visualizing the data is always a good idea.
y = matrix(c(11.8, 7.2, 21.5, 17.2, 26.8), ncol = 1)
X = matrix(c(rep(1, 5), 2.1, 0.9, 3.9, 3.2, 5.1), ncol = 2) # design matrix
X
[,1] [,2]
[1,] 1 2.1
[2,] 1 0.9
[3,] 1 3.9
[4,] 1 3.2
[5,] 1 5.1
plot(X[,2], y, xlab = "x") # column 2 of X has the x values
The value of can be found via Equation 11:
solve((t(X) %*% X)) %*% t(X) %*% y
[,1]
[1,] 2.399618
[2,] 4.769862
The first value is for or or intecept, the second value is for or or slope.
Let’s compare this answer to R
’s builtin lm
function (for linear model):
fit < lm(y ~ X[,2])
fit
Call:
lm(formula = y ~ X[, 2])
Coefficients:
(Intercept) X[, 2]
2.40 4.77
We have good agreement! If you care to learn about the goodness of the fit, the residuals etc, then you can look at the help file ?lm
and str(fit)
. lm
returns pretty much all one needs to know about the results, but if you wish to calculate all the interesting values yourself you can do so by manipulating Equation 11 and its relatives.
Finally, let’s plot the line of best fit found by lm
to make sure everything looks reasonable.
plot(X[,2], y, xlab = "x")
abline(coef = coef(fit), col = "red")
That’s all for now, and a lot to digest. I hope you are closer to finding your own path through linear algebra. Remember that investing in learning the fundamentals prepares you for tackling the more complex topics. Thanks for reading!
These are the main sources I relied on for this post.
matlib
package are very helpful.Here we have the slightly unfortunate circumstance where symbol conventions cannot be completely harmonized. We are saying that which seems a bit silly since vector contains and components in addition to . I ask you to accept this for two reasons: First, most linear algebra texts use the symbols in Equation 3 as the general form for this topic, so if you go to study this further that’s what you’ll find. Second, I feel like using , and in Equation 1 will be familar to the most people. If you want to get rid of this infelicity, then you have to write Equation 1 (in part) as which I think clouds the interpretation. Perhaps however you feel my choices are equally bad.↩︎
Conformable means that the number of columns in the first matrix equals the number of rows in the second matrix. This is necessary because of the dot product definition of matrix multiplication. More details here.↩︎
Remember “story problems” where you had to read closely to express what was given in terms of equations, and find enough equations? “If Sally bought 10 pieces of candy and a drink for $1.50…”↩︎
We could also write this as to emphasize that it is a column vector. One might prefer this because the only vector one can write in a row of text is a row vector, so if we mean a column vector many people would prefer to write it transposed.↩︎
The inverse of a matrix is analogous to dividing a variable by itself, since it leads to that variable canceling out and thus simplifying the equation. However, strictly speaking there is no operation that qualifies as division in the matrix world.↩︎
For a matrix to be invertible, there must exist another matrix such that . However, this definition doesn’t offer any clues about how we might find the inverse.↩︎
In truth, there are other ways to solve that don’t require inversion of a matrix. However, if a matrix isn’t invertible, these other methods will also break down. We’ll demonstrate this later when we talk about the LU decomposition.↩︎
A very good discussion of the algebraic approach is available here.↩︎
This is another example of an infelicity of symbol conventions. The typical math/statistics text symbols are not the same as the symbols a student in Physics 101 would likely encounter.↩︎
The careful reader will note that the data set shown in Equation 9 is not square, there are more observations (rows) than variables (columns). This is fine and desireable for a linear regression, we don’t want to use just two data points as that would have no error but not necessarily be accurate. However, only square matrices have inverses, so what’s going on here? In practice, what’s happening is we are using something called a pseudoinverse. The first part of the right side of Equation 11 is in fact the pseudoinverse: . Perhaps we’ll cover this in a future post.↩︎
The switch in the order of matrices on the last line of Equation 12 is one of the properties of the inverse operator.↩︎
This means that the rank of the matrix is less than the number of columns. You can get the rank of a matrix by counting the number of nonzero eigenvalues via eigen(A2)$values
, which in this case gives 8.9330344, 5.9330344, 3.5953271^{16}. There are only two nonzero values, so the rank is two. Perhaps in another post we’ll discuss this in more detail.↩︎
@online{hanson2022,
author = {Bryan Hanson},
editor = {},
title = {Notes on {Linear} {Algebra} {Part} 2},
date = {20220901},
url = {http://chemospec.org/LinearAlgNotesPt2.html},
langid = {en}
}
R
, read no further and do something else!
If you are like me, you’ve had no formal training in linear algebra, which means you learn what you need to when you need to use it. Eventually, you cobble together some hardwon knowledge. That’s good, because almost everything in chemometrics involves linear algebra.
This post is essentially a set of personal notes about the dot product and the cross product, two important manipulations in linear algebra. I’ve tried to harmonize things I learned way back in college physics and math courses, and integrate information I’ve found in various sources I have leaned on more recently. Without a doubt, the greatest impediment to really understanding this material is the use of multiple terminology and notations. I’m going to try really hard to be clear and to the point in my dicussion.
The main sources I’ve relied on are:
Let’s get started. For sanity and consistency, let’s define two 3D vectors and two matrices to illustrate our examples. Most of the time I’m going to write vectors with an arrow over the name, as a nod to the treatment usually given in a physics course. This reminds us that we are thinking about a quantity with direction and magnitude in some coordinate system, something geometric. Of course in the R
language a vector is simply a list of numbers with the same data type; R
doesn’t care if a vector is a vector in the geometric sense or a list of states.
The dot product goes by these other names: inner product, scalar product. Typical notations include:^{1}
There are two main formulas for the dot product with vectors, the algebraic formula (Equation 5) and the geometric formula (Equation 6).
refers to the or Euclidian norm, namely the length of the vector:^{2}
The result of the dot product is a scalar. The dot product is also commutative: .
Suppose we wanted to compute .^{3} We use the idea of row and column vectors to accomplish this task. In the process, we discover that matrix multiplication is a series of dot products:
The red color shows how the dot product of the first row of and the first column of gives the first entry in . Every entry in results from a dot product. Every entry is a scalar, embedded in a matrix.
The cross product goes by these other names: outer product^{4}, tensor product, vector product.
The cross product of two vectors returns a vector rather than a scalar. Vectors are defined in terms of a basis which is a coordinate system. Earlier, when we defined it was intrinsically defined in terms of the standard basis set (in some fields this would be called the unit coordinate system). Thus a fuller definition of would be:
In terms of vectors, the cross product is defined as:
In my opinion, this is not exactly intuitive, but there is a pattern to it: notice that the terms for don’t involve the component. The details of how this result is computed relies on some properties of the basis set; this Wikipedia article has a nice explanation. We need not dwell on it however.
There is also a geometric formula for the cross product:
where is the unit vector perpendicular to the plane defined by and . The direction of is defined by the righthand rule. Because of this, the cross product is not commutative, i.e. . The cross product is however anticommutative:
Finally, there is a matrix definition of the cross product as well. Evaluation of the following determinant gives the cross product:
%*%
The workhorse for matrix multiplication in R
is the %*%
function. This function will accept any combination of vectors and matrices as inputs, so it is flexible. It is also smart: given a vector and a matrix, the vector will be treated as row or column matrix as needed to ensure conformity, if possible. Let’s look at some examples:
# Some data for examples
p < 1:5
q < 6:10
M < matrix(1:15, nrow = 3, ncol = 5)
M
[,1] [,2] [,3] [,4] [,5]
[1,] 1 4 7 10 13
[2,] 2 5 8 11 14
[3,] 3 6 9 12 15
# A vector times a vector
p %*% q
[,1]
[1,] 130
Notice that R
returns a data type of matrix, but it is a matrix, and thus a scalar value. That means we just computed the dot product, a descision R
made internally. We can verify this by noting that q %*% p
gives the same answer. Thus, R
handled these vectors as column vectors and computed .
# A vector times a matrix
M %*% p
[,1]
[1,] 135
[2,] 150
[3,] 165
As M
had dimensions , R
treated p
as a column vector in order to be conformable. The result is a vector, so this is the cross product.
If we try to compute p %*% M
we get an error, because there is nothing R
can do to p
which will make it conformable to M
.
p %*% M
Error in p %*% M: nonconformable arguments
What about multiplying matrices?
M %*% M
Error in M %*% M: nonconformable arguments
As you can see, when dealing with matrices, %*%
will not change a thing, and if your matrices are nonconformable then it’s an error. Of course, if we transpose either instance of M
we do have conformable matrices, but the answers are different, and this is neither the dot product or the cross product, just matrix multiplication.
t(M) %*% M
[,1] [,2] [,3] [,4] [,5]
[1,] 14 32 50 68 86
[2,] 32 77 122 167 212
[3,] 50 122 194 266 338
[4,] 68 167 266 365 464
[5,] 86 212 338 464 590
M %*% t(M)
[,1] [,2] [,3]
[1,] 335 370 405
[2,] 370 410 450
[3,] 405 450 495
What can we take from these examples?
R
will give you the dot product if you give it two vectors. Note that this is a design decision, as it could have returned the cross product (see Equation 14).R
will promote a vector to a row or column vector if it can to make it conformable with a matrix you provide. If it cannot, R
will give you an error. If it can, the cross product is returned.R
will give an error when they are not conformable.%*%
, does it all: dot product, cross product, or matrix multiplication, but you need to pay attention.There are other R
functions that do some of the same work:
crossprod
equivalent to t(M) %*% M
but faster.tcrossprod
equivalent to M %*% t(M)
but faster.outer
or %o%
The first two functions will accept combinations of vectors and matrices, as does %*%
. Let’s try it with two vectors:
crossprod(p, q)
[,1]
[1,] 130
Huh. crossprod
is returning the dot product! So this is the case where “the cross product is not the cross product.” From a clarity perspective, this is not ideal. Let’s try the other function:
tcrossprod(p, q)
[,1] [,2] [,3] [,4] [,5]
[1,] 6 7 8 9 10
[2,] 12 14 16 18 20
[3,] 18 21 24 27 30
[4,] 24 28 32 36 40
[5,] 30 35 40 45 50
There’s the cross product!
What about outer
? Remember that another name for the cross product is the outer product. So is outer
the same as tcrossprod
? In the case of two vectors, it is:
identical(outer(p, q), tcrossprod(p, q))
[1] TRUE
What about a vector with a matrix?
tst < outer(p, M)
dim(tst)
[1] 5 3 5
Alright, that clearly is not a cross product. The result is an array with dimensions , not a matrix (which would have only two dimensions). outer
does correspond to the cross product in the case of two vectors, but anything with higher dimensions gives a different beast. So perhaps using “outer” as a synonym for cross product is not a good idea.
Given what we’ve seen above, make your life simple and stick to %*%
, and pay close attention to the dimensions of the arguments, especially if row or column vectors are in use. In my experience, thinking about the units and dimensions of whatever it is you are calculating is very helpful. Later, if speed is really important in your work, you can use one of the faster alternatives.
An extensive dicussion of notations can be found here.↩︎
And curiously, the norm works out to be equal to the square root of the dot product of a vector with itself: ↩︎
To be multiplied, matrices must be conformable, namely the number of columns of the first matrix must match the number of rows of the second matrix. The reason is so that the dot product terms will match. In the present case we have .↩︎
Be careful, it turns out that “outer” may not be a great synonym for cross product, as explained later.↩︎
OK fine, here is the answer when treating and as row vectors: which expands exactly as the righthand side of Equation 14.↩︎
@online{hanson2022,
author = {Bryan Hanson},
editor = {},
title = {Notes on {Linear} {Algebra} {Part} 1},
date = {20220814},
url = {http://chemospec.org/20220814LinearAlgNotes.html},
langid = {en}
}
If you aren’t familiar with it, the FOSS for Spectroscopy web site lists Free and Open Source Software for spectroscopic applications. The collection is of course never really complete, and your package suggestions are most welcome (how to contribute). My methods for finding packages are improving and at this point the major repositories have been searched reasonably well.
A few days ago I pushed a major update, and at this point Python
packages outnumber R
packages more than two to one. The update was made possible because I recently had time to figure out how to search the PyPi.org site automatically.
In a previous post I explained the methods I used to find packages related to spectroscopy. These have been updated considerably and the rest of this post will cover the updated methods.
There are four places I search for packages related to spectroscopy.^{1}
packagefinder
package.^{2}The topics I search are as follows:
I search CRAN using packagefinder
; the process is quite straightforward and won’t be covered here. However, it is not an automated process (I should probably work on that).
The broad approach used to search Github is the same as described in the original post. However, the scripts have been refined and updated, and now exist as functions in a new package I created called webu
(for “webutilities”, but that name is taken on CRAN). The repo is here. webu
is not on CRAN and I don’t currently intend to put it there, but you can install from the repo of course if you wish to try it out.
Searching Github is now carried out by a supervising script called /utilities/run_searches.R
(in the FOSS4Spectroscopy
repo). The script contains some notes about finicky details, but is pretty simple overall and should be easy enough to follow.
Unlike Github, it is not necessary to authenticate to use the PyPi.org API. That makes things simpler than the Github case. The needed functions are in webu
and include some deliberate delays so as to not overload their servers. As for Github, searches are supervised by /utilities/run_searches.R
.
One thing I observed at PyPi.org is that authors do not always fill out all the fields that PyPi.org can accept, which means some fields are NULL
and we have to trap for that possibility. Package information is accessed via a JSON record, for instance the entry for nmrglue
can be seen here. This package is pretty typical in that the author_email
field is filled out, but the maintainer_email
field is not (they are presumably the same). If one considers these JSON files to be analogous to DESCRIPTION in R
packages, it looks like there is less oversight on PyPi.org compared to CRAN.
Julia packages are readily searched manually at juliapackages.org.
The raw results from the searches described above still need a lot of inspection and cleaning to be usable. The PyPi.org and Github results are saved in an Excel worksheet with the relevant URLs. These links can be followed to determine the suitability of each package. In the /Utilities
folder there are additional scripts to remove entries that are already in the main database (FOSS4Spec.xlsx), as well as to check the names of the packages: Python authors and/or policies seem to lead to cases where different packages can have names differing by case, but also authors are sometimes sloppy when referring to their own packages, sometimes using mypkg
and at other times myPkg
to refer to the same package.
Once in a while users submit their own package to the repo, and I also find interesting packages in my literature reading.↩︎
packagefinder
has recently been archived, but hopefully will be back soon.↩︎
@online{hanson2022,
author = {Bryan Hanson},
editor = {},
title = {FOSS4Spectroscopy: {R} Vs {Python}},
date = {20220706},
url = {http://chemospec.org/20220706F4SUpdate.html},
langid = {en}
}
I’m pleased to announce that my colleague David Harvey and I have recently released LearnPCA
, an R
package to help people with understanding PCA. In LearnPCA
we’ve tried to integrate our years of experience teaching the topic, along with the best insights we can find in books, tutorials and the nooks and crannies of the internet. Though our experience is in a chemometrics context, we use examples from different disciplines so that the package will be broadly helpful.
The package contains seven vignettes that proceed from the conceptual basics to advanced topics. As of version 0.2.0, there is also a Shiny app to help visualize the process of finding the principal component axes. The current vignettes are:
You can access the vignettes at the Github Site, you don’t even have to install the package. For the Shiny app, do the following:
install.packages("LearnPCA") # you'll need version 0.2.0
library("LearnPCA")
PCsearch()
We would really appreciate your feedback on this package. You can do so in the comments below, or open an issue.
@online{hanson2022,
author = {Bryan Hanson},
editor = {},
title = {Introducing {LearnPCA}},
date = {20220503},
url = {http://chemospec.org/20220503LearnPCAIntro.html},
langid = {en}
}
If you aren’t familiar with ChemoSpec
, you might wish to look at the introductory vignette first.
In this series of posts we are following the protocol as described in the printed publication closely (Blaise et al. 2021). The authors have also provided a Jupyter notebook. This is well worth your time, even if Python is not your preferred language, as there are additional examples and discussion for study.
Load the Spectra
object we created in Part 2 so we can summarize it.
library("ChemoSpec")
load("Worms2.RData") # restores the 'Worms2' Spectra object
sumSpectra(Worms2)
C. elegans metabolic phenotyping study (Blaise 2007)
There are 133 spectra in this set.
The yaxis unit is intensity.
The frequency scale runs from
8.9995 to 5e04 ppm
There are 8600 frequency values.
The frequency resolution is
0.001 ppm/point.
This data set is not continuous
along the frequency axis.
Here are the data chunks:
beg.freq end.freq size beg.indx end.indx
1 8.9995 5.0005 3.999 1 4000
2 4.5995 0.0005 4.599 4001 8600
The spectra are divided into 4 groups:
group no. color symbol alt.sym
1 Mut_L2 28 #FB0D16FF 0 m2
2 Mut_L4 33 #FFC0CBFF 15 m4
3 WT_L2 32 #511CFCFF 1 w2
4 WT_L4 40 #2E94E9FF 16 w4
*** Note: this is an S3 object
of class 'Spectra'
If you recall in Part 2 we removed five samples. Let’s rerun PCA without these samples and show the key plots. We will simply report these here without much discussion; they are pretty much as expected.
c_pca < c_pcaSpectra(Worms2, choice = "autoscale")
plotScree(c_pca)
p < plotScores(Worms2, c_pca, pcs = 1:2, ellipse = "rob", tol = 0.02)
p
p < plotScores(Worms2, c_pca, pcs = 2:3, ellipse = "rob", leg.loc = "bottomleft",
tol = 0.02)
p
One thing the published protocol does not explicitly discuss is an inspection of the loadings, but it is covered in the Jupyter notebook. The loadings are useful in order to see if any particular frequencies are driving the separation of the samples in the score plot. Let’s plot the loadings (Figure 4). Remember that these data were autoscaled, and hence all frequencies, including noisy frequencies, will contribute to the separation. If we had not scaled the data, these plots would look dramatically different.
p < plotLoadings(Worms2, c_pca, loads = 1:2)
p
The splot is another very useful way to find peaks that are important in separating the samples (Figure 5); we can see that the peaks around 1.301.32, 1.471.48, and 3.033.07 are important drivers of the separation in the score plot. Having discovered this, one can investigate the source of those peaks.
p < sPlotSpectra(Worms2, c_pca, tol = 0.001)
p
ChemoSpec
carries out exploratory data analysis, which is an unsupervised process. The next step in the protocol is PLSDA (partial least squares  discriminant analysis). I have written about ChemoSpec
+ PLS here if you would like more background on plain PLS. However, PLSDA is a technique that combines data reduction/variable selection along with classification. We’ll need the mixOmics
package (F et al. (2017)) package for this analysis; note that loading it replaces the plotLoadings
function from ChemoSpec
.
library("mixOmics")
Loading required package: MASS
Loading required package: lattice
Loaded mixOmics 6.20.0
Thank you for using mixOmics!
Tutorials: http://mixomics.org
Bookdown vignette: https://mixomicsteam.github.io/Bookdown
Questions, issues: Follow the prompts at http://mixomics.org/contactus
Cite us: citation('mixOmics')
Attaching package: 'mixOmics'
The following object is masked from 'package:ChemoSpec':
plotLoadings
Figure 6 shows the score plot; the results suggest that classification and modeling may be successful. The splsda
function carries out a single sparse computation. One computation should not be considered the ideal answer; a better approach is to use crossvalidation, for instance the bootsPLS
function in the bootsPLS
package (Rohart, Le Cao, and Wells (2018) which uses splsda
under the hood). However, that computation is too timeconsuming to demonstrate here.
X < Worms2$data
Y < Worms2$groups
splsda < splsda(X, Y, ncomp = 8)
plotIndiv(splsda,
col.per.group = c("#FB0D16FF", "#FFC0CBFF", "#511CFCFF", "#2E94E9FF"),
title = "sPLSDA Score Plot", legend = TRUE, ellipse = TRUE)
To estimate the number of components needed, the perf
function can be used. The results are in Figure 7 and suggest that five components are sufficient to describe the data.
perf.splsda < perf(splsda, folds = 5, nrepeat = 5)
plot(perf.splsda)
At this point, we have several ideas of how to proceed. Going forward, one might choose to focus on accurate classification, or on determining which frequencies should be included in a predictive model. Any model will need to refined and more details extracted. The reader is referred to the case study from the mixOmics folks which covers these tasks and explains the process.
This post was created using ChemoSpec
version 6.1.3 and ChemoSpecUtils
version 1.0.0.
@online{hanson2022,
author = {Bryan Hanson},
editor = {},
title = {Metabolic {Phenotyping} {Protocol} {Part} 3},
date = {20220501},
url = {http://chemospec.org/20220501ProtocolPt3.html},
langid = {en}
}
Part 1 of this series is here.
If you aren’t familiar with ChemoSpec
, you might wish to look at the introductory vignette first.
In this series of posts we are following the protocol as described in the printed publication closely (Blaise et al. 2021). The authors have also provided a Jupyter notebook. This is well worth your time, even if Python is not your preferred lanaguage, as there are additional examples and discussion for study.
I saved the Spectra
object we created in Part 1 so we can read it and remind ourselves of what’s in it. Due to the compression in R’s save
function the data takes up 4.9 Mb on disk. The original csv files total about 62 Mb.
library("ChemoSpec")
load("Worms.Rdata") # restores the 'Worms' Spectra object
sumSpectra(Worms)
C. elegans metabolic phenotyping study (Blaise 2007)
There are 139 spectra in this set.
The yaxis unit is intensity.
The frequency scale runs from
8.9995 to 5e04 ppm
There are 8600 frequency values.
The frequency resolution is
0.001 ppm/point.
This data set is not continuous
along the frequency axis.
Here are the data chunks:
beg.freq end.freq size beg.indx end.indx
1 8.9995 5.0005 3.999 1 4000
2 4.5995 0.0005 4.599 4001 8600
The spectra are divided into 4 groups:
group no. color symbol alt.sym
1 Mut_L2 32 #FB0D16FF 0 m2
2 Mut_L4 33 #FFC0CBFF 15 m4
3 WT_L2 34 #511CFCFF 1 w2
4 WT_L4 40 #2E94E9FF 16 w4
*** Note: this is an S3 object
of class 'Spectra'
We will follow the steps described in the published protocol closely.
Apply PQN normalization; scaling in ChemoSpec
is applied at the PCA stage (next).
Worms < normSpectra(Worms) # PQN is the default
Conduct classical PCA using autoscaling.^{1} Note that ChemoSpec
includes several different variants of PCA, each with scaling options. See the introductory vignette for more details. For more about what PCA is and how it works, please see the LearnPCA package.
c_pca < c_pcaSpectra(Worms, choice = "autoscale") # no scaling is the default
A key question at this stage is how many components are needed to describe the data set. Keep in mind that this depends on the choice of scaling. Figure 1 and Figure 2 are two different types of scree plots, which show the residual variance. This is the R^{2}_{x} value in the protocol (see protocol Figure 7a). Another approach to answering this question is to do a crossvalidated PCA.^{2} The results are shown in Figure 3. These are the Q^{2}_{x} values in protocol Figure 7a. All of these ways of looking at the variance explained suggest that retaining three or possibly four PCs is adequate.
plotScree(c_pca)
plotScree(c_pca, style = "trad")
cv_pcaSpectra(Worms, choice = "autoscale", pcs = 10)
Next, examine the score plots (Figure 4, Figure 5). In these plots, each data point is colored by its group membership (keep in mind this is completely independent of the PCA calculation). In addition, robust confidence ellipses are shown for each group. Inspection of these plots is one way to identify potential outliers. The other use is of course to see if the sample classes separate, and by how much.
Examination of these plots shows that separation by classes has not really been achieved using autoscaling. In Figure 4 we see four clear outlier candidates (samples 37, 101, 107, and 118). In Figure 5 we see some of these samples and should probably add sample 114 for a total of five candidates.
p < plotScores(Worms, c_pca, pcs = 1:2, ellipse = "rob", tol = 0.02)
p
p < plotScores(Worms, c_pca, pcs = 2:3, ellipse = "rob", leg.loc = "topright", tol = 0.02)
p
To label more sample points, you can increase the value of the argument tol
.
The protocol recommends plotting Hotelling’s T^{2} ellipse for the entire data set; this is not implemented in ChemoSpec
but we can easily do it if we are using ggplot2
plots (which is the default in ChemoSpec
). We need the ellipseCoord
function from the HotellingsEllipse
package.^{3}
source("ellipseCoord.R")
xy_coord < ellipseCoord(as.data.frame(c_pca$x), pcx = 1, pcy = 2, conf.limit = 0.95,
pts = 500)
p < plotScores(Worms, c_pca, which = 1:2, ellipse = "none", tol = 0.02)
p < p + geom_path(data = xy_coord, aes(x = x, y = y)) + scale_color_manual(values = "black")
p
We can see many of the same outliers by this approach as we saw in Figure 4 and Figure 5.
Another way to identify outliers is to use the approach described in Varmuza and Filzmoser (2009) section 3.7.3. Figure 7 and Figure 8 give the plots. Please see Filzmoser for the details, but any samples that are above the plotted threshold line are candidate outliers, and any samples above the threshold in both plots should be looked at very carefully. Though we are using classical PCA, Filzmoser recommends using these plots with robust PCA. These plots are a better approach than “eye balling it” on the score plots.
p < pcaDiag(Worms, c_pca, plot = "OD")
p
p < pcaDiag(Worms, c_pca, plot = "SD")
p
Comparison of these plots suggest that samples 37, 101, 107, 114 and 118 are likely outliers. These spectra should be examined to see if the reason for their outlyingness can be deduced. If good reason can be found, they can be removed as follows.^{4}
Worms2 < removeSample(Worms, rem.sam = c("37_", "101_", "107_", "114_", "118_"))
At this point one should repeat the PCA, score plots and diagnostic plots to get a good look at how removing these samples affected the results. Those tasks are left to the reader.
We will continue in the next post with a discussion of loadings.
This post was created using ChemoSpec
version 6.1.3 and ChemoSpecUtils
version 1.0.0.
Without scaling, the largest peaks will drive the separation in the scores plot.↩︎
Be sure you have ChemoSpec
6.1.3 or higher, as cv_pcaSpectra
had a bug in it! One benefit of writing these posts is finding lame bugs…↩︎
We are sourcing in a corrected version of the function, as the CRAN version has a small error in it.↩︎
The Jupyter notebook has details about this.↩︎
@online{hanson2022,
author = {Bryan Hanson},
editor = {},
title = {Metabolic {Phenotyping} {Protocol} {Part} 2},
date = {20220324},
url = {http://chemospec.org/20220324ProtocolPt2.html},
langid = {en}
}
Protip: These pages load slowly in some browsers. I had the best luck with Chrome. Try the reader view for a userfriendly version that prints well (if you are in to that).
@online{hanson2022,
author = {Bryan Hanson},
editor = {},
title = {Chemometrics in {Spectroscopy:} {Key} {References}},
date = {20220218},
url = {http://chemospec.org/20220218KeyReferences.html},
langid = {en}
}
I’ve been developing packages in R
for over a decade now. When adding new features to a package, I often import functions from another package, and of course that package goes in the Imports:
field of the DESCRIPTION
file. Later, I might change my approach entirely and no longer need that package. Do I remember to remove it from DESCRIPTION
? Generally not. The same thing happens when writing a new vignette, and it can happen with the Suggests:
field as well. It can also happen when one splits a packages into several smaller packages. If one forgets to delete a package from the DESCRIPTION
file, the dependencies become bloated, because all the imported and suggested packages have to be available to install the package. This adds overhead to the project, and increases the possibility of a namespace conflict.
In fact this just happened to me again! The author of a package I had in Suggests:
wrote to me and let me know their package would be archived. It was an easy enough fix for me, as it was a “stale” package in that I was no longer using it. I had added it for a vignette which I later deleted, as I decided a series of blog posts was a better approach.
So I decided to write a little function to check for such stale Suggests:
and Import:
entries. This post is about that function. As far as I can tell there is no builtin function for this purpose, and CRAN does not check for stale entries. So it was worth my time to automate the process.^{1}
The first step is to read in the DESCRIPTION
file for the package (so we want our working directory to be the top level of the package). There is a built in function for this. We’ll use the DESCRIPTION
file from the ChemoSpec
package as a demonstration.
# setwd("...") # set to the top level of the package
desc < read.dcf("DESCRIPTION", all = TRUE)
The argument all = TRUE
is a bit odd in that it has a particular purpose (see ?read.dcf
) which isn’t really important here, but has the side effect of returning a data frame, which makes our job simpler. Let’s look at what is returned.
str(desc)
'data.frame': 1 obs. of 18 variables:
$ Package : chr "ChemoSpec"
$ Type : chr "Package"
$ Title : chr "Exploratory Chemometrics for Spectroscopy"
$ Version : chr "6.1.2"
$ Date : chr "20220208"
$ Authors@R : chr "c(\nperson(\"Bryan A.\", \"Hanson\",\nrole = c(\"aut\", \"cre\"), email =\n\"hanson@depauw.edu\",\ncomment = c(" __truncated__
$ Description : chr "A collection of functions for topdown exploratory data analysis\nof spectral data including nuclear magnetic r" __truncated__
$ License : chr "GPL3"
$ Depends : chr "R (>= 3.5),\nChemoSpecUtils (>= 1.0)"
$ Imports : chr "plyr,\nstats,\nutils,\ngrDevices,\nreshape2,\nreadJDX (>= 0.6),\npatchwork,\nggplot2,\nplotly,\nmagrittr"
$ Suggests : chr "IDPmisc,\nknitr,\njs,\nNbClust,\nlattice,\nbaseline,\nmclust,\npls,\nclusterCrit,\nR.utils,\nRColorBrewer,\nser" __truncated__
$ URL : chr "https://bryanhanson.github.io/ChemoSpec/"
$ BugReports : chr "https://github.com/bryanhanson/ChemoSpec/issues"
$ ByteCompile : chr "TRUE"
$ VignetteBuilder : chr "knitr"
$ Encoding : chr "UTF8"
$ RoxygenNote : chr "7.1.2"
$ NeedsCompilation: chr "no"
We are interested in the Imports
and Suggests
elements. Let’s look more closely.
head(desc$Imports)
[1] "plyr,\nstats,\nutils,\ngrDevices,\nreshape2,\nreadJDX (>= 0.6),\npatchwork,\nggplot2,\nplotly,\nmagrittr"
You can see there are a bunch of newlines in there (\n
), along with some version specifications, in parentheses. We need to clean this up so we have a simple list of the packages as a vector. For clean up we’ll use the following helper function.
clean_up < function(string) {
string < gsub("\n", "", string) # remove newlines
string < gsub("\\(.+\\)", "", string) # remove parens & anything within them
string < unlist(strsplit(string, ",")) # split the long string into pieces
string < trimws(string) # remove any white space around words
}
After we apply this to the raw results, we have what we are after, a clean list of imported packages.
imp < clean_up(desc$Imports)
imp
[1] "plyr" "stats" "utils" "grDevices" "reshape2" "readJDX"
[7] "patchwork" "ggplot2" "plotly" "magrittr"
Next, we can search the entire package looking for these package names to see if they are used in the package. They might appear in import statements, vignettes, code and so forth, so it’s not sufficient to just look at code. This is a job for grep
, but we’ll call grep
from within R
so that we don’t have to use the command line and transfer the results to R
, that gets messy and is errorprone.
if (length(imp) >= 1) { # Note 1
imp_res < rep("FALSE", length(imp)) # Boolean to keep track of whether we found a package or not
for (i in 1:length(imp)) {
args < paste("r e '", imp[i], "' *", sep = "") # assemble arguments for grep
g_imp < system2("grep", args, stdout = TRUE)
if (length(g_imp) > 1L) imp_res[i] < TRUE # Note 2
}
}
g_imp
contains the results of the grep process. If there are imports in the package, each imported package name will be found by grep in the DESCRIPTION
file. That’s not so interesting, so we don’t count it. For a package to be stale, it will be found in DESCRIPTION
but no where else.We can do the same process for the Suggests:
field of DESCRIPTION
. And then it would be nice to present the results in a more useable form. At this point we can put it all togther in an easytouse function.^{2}
# run from the package top level
check_stale_imports_suggests < function() {
# helper function: removes extra characters
# from strings read by read.dcf
clean_up < function(string) {
string < gsub("\n", "", string)
string < gsub("\\(.+\\)", "", string)
string < unlist(strsplit(string, ","))
string < trimws(string)
}
desc < read.dcf("DESCRIPTION", all = TRUE)
# look for use of imported packages
imp < clean_up(desc$Imports)
if (length(imp) == 0L) message("No Imports: entries found")
if (length(imp) >= 1) {
imp_res < rep("FALSE", length(imp))
for (i in 1:length(imp)) {
args < paste("r e '", imp[i], "' *", sep = "")
g_imp < system2("grep", args, stdout = TRUE)
# always found once in DESCRIPTION, hence > 1
if (length(g_imp) > 1L) imp_res[i] < TRUE
}
}
# look for use of suggested packages
sug < clean_up(desc$Suggests)
if (length(sug) == 0L) message("No Suggests: entries found")
if (length(sug) >= 1) {
sug_res < rep("FALSE", length(sug))
for (i in 1:length(sug)) {
args < paste("r e '", sug[i], "' *", sep = "")
g_sug < system2("grep", args, stdout = TRUE)
# always found once in DESCRIPTION, hence > 1
if (length(g_sug) > 1L) sug_res[i] < TRUE
}
}
# arrange output in easy to read format
role < c(rep("Imports", length(imp)), rep("Suggests", length(sug)))
return(data.frame(
pkg = c(imp, sug),
role = role,
found = c(imp_res, sug_res)))
}
Applying this function to my ChemoSpec2D
package (as of the date of this post), we see the following output. You can see a bunch of packages are imported but never used, so I have some work to do. This was the result of copying the DESCRIPTION
file from ChemoSpec
when I started ChemoSpec2D
and obviously I never went back and cleaned things up.
pkg role found
1 plyr Imports TRUE
2 stats Imports TRUE
3 utils Imports TRUE
4 grDevices Imports TRUE
5 reshape2 Imports TRUE
6 readJDX Imports TRUE
7 patchwork Imports TRUE
8 ggplot2 Imports TRUE
9 plotly Imports TRUE
10 magrittr Imports TRUE
11 IDPmisc Suggests TRUE
12 knitr Suggests TRUE
13 js Suggests TRUE
14 NbClust Suggests TRUE
15 lattice Suggests TRUE
16 baseline Suggests TRUE
17 mclust Suggests TRUE
18 pls Suggests TRUE
19 clusterCrit Suggests TRUE
20 R.utils Suggests TRUE
21 RColorBrewer Suggests TRUE
22 seriation Suggests FALSE
23 MASS Suggests FALSE
24 robustbase Suggests FALSE
25 grid Suggests TRUE
26 pcaPP Suggests FALSE
27 jsonlite Suggests FALSE
28 gsubfn Suggests FALSE
29 signal Suggests TRUE
30 speaq Suggests FALSE
31 tinytest Suggests FALSE
32 elasticnet Suggests FALSE
33 irlba Suggests FALSE
34 amap Suggests FALSE
35 rmarkdown Suggests TRUE
36 bookdown Suggests FALSE
37 chemometrics Suggests FALSE
38 hyperSpec Suggests FALSE
As you will see in a moment, during testing I found a bunch of stale entries I need to remove from several packages!↩︎
In easy to use form as a Gist.↩︎
@online{hanson2022,
author = {Bryan Hanson},
editor = {},
title = {Do {You} Have {Stale} {Imports} or {Suggests?}},
date = {20220209},
url = {http://chemospec.org/20220209ImportsSuggests.html},
langid = {en}
}
If you aren’t familiar with ChemoSpec
, you might wish to look at the introductory vignette first.
Blaise et al. (2021) have published a detailed protocol for metabolomic phenotyping. They illustrate the protocol using a data set composed of 139 ^{1}H HRMAS SSNMR spectra (Blaise et al. 2007) of the model organism Caenorhabditis elegans. There are two genotypes, wild type and a mutant, and worms from two life stages.
This series of posts follows the published protocol closely in order to illustrate how to implement the protocol using ChemoSpec
. As in any chemometric analysis, there are decisions to be made about how to process the data. In these posts we are interested in which functions to use, and how to examine the results. We are not exploring all possible data processing choices, and argument choices are not necessarily optimized.
The data set is large, over 30 Mb, so we will grab it directly from the Github repo where it is stored. We will use a custom function to grab the data (you can see the function in the source for this document if interested). The URLs given below point to the frequency scale, the raw data matrix and the variables that describe the sample classification by genotype and life stage (L2 are gravid adults, L4 are larvae).
urls < c("https://raw.githubusercontent.com/Gscorreia89/chemometricstutorials/master/data/ppm.csv",
"https://raw.githubusercontent.com/Gscorreia89/chemometricstutorials/master/data/X_spectra.csv",
"https://raw.githubusercontent.com/Gscorreia89/chemometricstutorials/master/data/worm_yvars.csv")
raw < get_csvs_from_github(urls, sep = ",") # a list of data sets
names(raw)
[1] "ppm.csv" "X_spectra.csv" "worm_yvars.csv"
The format of the data as provided in Github is not really suited to using either of the builtin import functions in ChemoSpec
. Therefore we will construct the Spectra
object by hand, a useful exercise in its own right. The requirements for a Spectra
object are described in ?Spectra
.
First, we’ll take the results in raw
and convert them to the proper form. Each element of raw
is a data frame.
# frequencies are in the 1st list element
freq < unlist(raw[[1]], use.names = FALSE)
# intensities are in the 2nd list element
data < as.matrix(raw[[2]])
dimnames(data) < NULL # remove the default data frame col names
ns < nrow(data) # ns = number of samples  used later
# get genotype & lifestage, recode into something more readible
yvars < raw[[3]]
names(yvars) < c("genotype", "stage")
yvars$genotype < ifelse(yvars$genotype == 1L, "WT", "Mut")
yvars$stage < ifelse(yvars$stage == 1L, "L2", "L4")
table(yvars) # quick look at how many in each group
stage
genotype L2 L4
Mut 32 33
WT 34 40
Next we’ll construct some useful sample names, create the groups vector, assign the colors and symbols, and finally put it all together into a Spectra
object.
# build up sample names to include the group membership
sample_names < as.character(1:ns)
sample_names < paste(sample_names, yvars$genotype, sep = "_")
sample_names < paste(sample_names, yvars$stage, sep = "_")
head(sample_names)
[1] "1_WT_L4" "2_Mut_L4" "3_Mut_L4" "4_WT_L4" "5_Mut_L4" "6_WT_L4"
# use the sample names to create the groups vector
grp < gsub("[09]+_", "", sample_names) # remove 1_ etc, leaving WT_L2 etc
groups < as.factor(grp)
levels(groups)
[1] "Mut_L2" "Mut_L4" "WT_L2" "WT_L4"
# set up the colors based on group membership
data(Col12) # see ?colorSymbol for a swatch
colors < grp
colors < ifelse(colors == "WT_L2", Col12[1], colors)
colors < ifelse(colors == "WT_L4", Col12[2], colors)
colors < ifelse(colors == "Mut_L2", Col12[3], colors)
colors < ifelse(colors == "Mut_L4", Col12[4], colors)
# set up the symbols based on group membership
sym < grp # see ?points for the symbol codes
sym < ifelse(sym == "WT_L2", 1, sym)
sym < ifelse(sym == "WT_L4", 16, sym)
sym < ifelse(sym == "Mut_L2", 0, sym)
sym < ifelse(sym == "Mut_L4", 15, sym)
sym < as.integer(sym)
# set up the alt symbols based on group membership
alt.sym < grp
alt.sym < ifelse(alt.sym == "WT_L2", "w2", alt.sym)
alt.sym < ifelse(alt.sym == "WT_L4", "w4", alt.sym)
alt.sym < ifelse(alt.sym == "Mut_L2", "m2", alt.sym)
alt.sym < ifelse(alt.sym == "Mut_L4", "m4", alt.sym)
# put it all together; see ?Spectra for requirements
Worms < list()
Worms$freq < freq
Worms$data < data
Worms$names < sample_names
Worms$groups < groups
Worms$colors < colors
Worms$sym < sym
Worms$alt.sym < alt.sym
Worms$unit < c("ppm", "intensity")
Worms$desc < "C. elegans metabolic phenotyping study (Blaise 2007)"
class(Worms) < "Spectra"
chkSpectra(Worms) # verify we have everything correct
sumSpectra(Worms)
C. elegans metabolic phenotyping study (Blaise 2007)
There are 139 spectra in this set.
The yaxis unit is intensity.
The frequency scale runs from
8.9995 to 5e04 ppm
There are 8600 frequency values.
The frequency resolution is
0.001 ppm/point.
This data set is not continuous
along the frequency axis.
Here are the data chunks:
beg.freq end.freq size beg.indx end.indx
1 8.9995 5.0005 3.999 1 4000
2 4.5995 0.0005 4.599 4001 8600
The spectra are divided into 4 groups:
group no. color symbol alt.sym
1 Mut_L2 32 #FB0D16FF 0 m2
2 Mut_L4 33 #FFC0CBFF 15 m4
3 WT_L2 34 #511CFCFF 1 w2
4 WT_L4 40 #2E94E9FF 16 w4
*** Note: this is an S3 object
of class 'Spectra'
Let’s look at one sample from each group to make sure everything looks reasonable (Figure Figure 1). At least these four spectra look good. Note that we are using the latest ChemoSpec
that uses ggplot2
graphics by default (announced here).
p < plotSpectra(Worms, which = c(35, 1, 34, 2), lab.pos = 7.5, offset = 0.008, amplify = 35,
yrange = c(0.05, 1.1))
p
In the next post we’ll continue with some basic exploratory data analysis.
This post was created using ChemoSpec
version 6.1.3 and ChemoSpecUtils
version 1.0.0.
@online{hanson2022,
author = {Bryan Hanson},
editor = {},
title = {Metabolic {Phenotyping} {Protocol} {Part} 1},
date = {20220201},
url = {http://chemospec.org/20220201ProtocolPt1.html},
langid = {en}
}
Thanks to Mr. Tejasvi Gupta and the support of GSOC, ChemoSpec
and ChemoSpec2D
were extended to produce ggplot2
graphics and plotly
graphics! ggplot2
is now the default output, and the ggplot2
object is returned, so if one doesn’t like the choice of theme or any other aspect, one can customize the object to one’s desire. The ggplot2
graphics output are generally similar in layout and spirit to the base
graphics output, but significant improvements have been made in labeling data points using the ggrepel
package. The original base
graphics are still available as well. Much of this work required changes in ChemoSpecUtils
which supports the common needs of both packages.
Tejasvi did a really great job with this project, and I think users of these packages will really like the results. We have greatly expanded the prerelease testing of the graphics, and as far as we can see every thing works as intended. Of course, please file an issue if you see any problems or unexpected behavior.
To see more about how the new graphics options work, take a look at GraphicsOptions. Here are the functions that were updated:
plotSpectra
surveySpectra
surveySpectra2
reviewAllSpectra
(formerly loopThruSpectra
)plotScree
(resides in ChemoSpecUtils
)plotScores
(resides in ChemoSpecUtils
)plotLoadings
(uses patchwork
and hence plotly
isn’t available)plot2Loadings
sPlotSpectra
pcaDiag
plotSampleDist
aovPCAscores
aovPCAloadings
(uses patchwork
and hence plotly
isn’t available)Tejasvi and I are looking forward to your feedback. There are many other smaller changes that we’ll let users discover as they work. And there’s more work to be done, but other projects need attention and I need a little rest!
@online{hanson2021,
author = {Bryan Hanson},
editor = {},
title = {GSOC 2021: {New} {Graphics} for {ChemoSpec}},
date = {20211013},
url = {http://chemospec.org/20211013GSOCCSGraphics.html},
langid = {en}
}
hyperSpec
(see here for Erick’s wrap up blog post at the end of last year). Sang Truong is the very talented student who will be joining us. Sang’s project is described here.ChemoSpec
will be upgraded to use ggplot2
graphics along with interactive graphics for many of the plots that are currently rendered in base graphics. Erick, who was the student working on hyperSpec
last summer, will be my comentor on this project. We are looking forward to having Tejasvi Gupta as the student on this project.@online{hanson2021,
author = {Bryan Hanson},
editor = {},
title = {GSOC 2021: {hyperSpec} and {ChemoSpec!}},
date = {20210522},
url = {http://chemospec.org/20210522GSOChyperSpecChemoSpec.html},
langid = {en}
}
One of the projects I maintain is the FOSS for Spectroscopy web site. The table at that site lists various software for use in spectroscopy. Historically, I have used the Github or Python Package Index search engines to manually search by topic such as “NMR” to find repositories of interest. Recently, I decided to try to automate at least some of this process. In this post I’ll present the code and steps I developed to search Github by topics. Fortunately, I wasn’t starting from scratch, as I had learned some basic webscraping techniques when I wrote the functions that get the date of the most recent repository update. All the code for this website and project can be viewed here. The steps reported here are current as of the publication of this post, but are subject to change in the future.^{1}
First off, did you know Github allows repository owners to tag their repositories using topical keywords? I didn’t know this for a long time. So add topics to your repositories if you don’t have them already. By the way, the Achilles heel of this project is that good pieces of software may not have any topical tags at all. If you run into this, perhaps you would consider creating an issue to ask the owner to add tags.
If you look at the Utilities
directory of the project, you’ll see the scripts and functions that power this search process.
Search Repos for Topics Script.R
supervises the whole process. It sources:searchRepos.R
(a function)searchTopic.R
(a function)First let’s look at the supervising script. First, the necessary preliminaries:
library("jsonlite")
library("httr")
library("stringr")
library("readxl")
library("WriteXLS")
source("Utilities/searchTopic.R")
source("Utilities/searchRepos.R")
Note that this assumes one has the top level directory, FOSS4Spectroscopy
, as the working directory (this is a bit easier than constantly jumping around).
Next, we pull in the Excel spreadsheet that contains all the basic data about the repositories that we already know about, so we can eventually remove those from the search results.
known < as.data.frame(read_xlsx("FOSS4Spec.xlsx"))
known < known$name
Now we define some topics and run the search (more on the search functions in a moment):
topics < c("NMR", "EPR", "ESR")
res < searchRepos(topics, "github_token", known.repos = known)
We’ll also talk about that github_token
in a moment. With the search results in hand, we have a few steps to make a useful file name and save it in the Searches
folder for future use.
file_name < paste(topics, collapse = "_")
file_name < paste("Search", file_name, sep = "_")
file_name < paste(file_name, "xlsx", sep = ".")
file_name < paste("Searches", file_name, sep = "/")
WriteXLS(res, file_name,
row.names = FALSE, col.names = TRUE, na = "NA")
At this point, one can open the spreadsheet in Excel and check each URL (the links are live in the spreadsheet). After vetting each site,^{2} one can append the new results to the existing FOSS4Spec.xlsx
data base and refresh the entire site so the table is updated.
To make this job easier, I like to have the search results spreadsheet open and then open all the URLs using the as follows. Then I can quickly clean up the spreadsheet (it helps to have two monitors for this process).
found < as.data.frame(read_xlsx(file_name))
for (i in 1:nrow(found)) {
if (grepl("^https?://", found$url[i], ignore.case = TRUE)) BROWSE(found$url[i])
}
In order to use the Github API, you have to authenticate. Otherwise you will be severely ratelimited. If you are authenticated, you can make up to 5,000 API queries per hour.
To authenticate, you need to first establish some credentials with Github, by setting up a “key” and a “secret”. You can set these up here by choosing the “Oauth Apps” tab. Record these items in a secure way, and be certain you don’t actually publish them by pushing.
Now you are ready to authenticate your R
instance using “Web Application Flow”.^{3}
myapp < oauth_app("FOSS", key = "put_your_key_here", secret = "put_your_secret_here")
github_token < oauth2.0_token(oauth_endpoints("github"), myapp)
If successful, this will open a web page which you can immediately close. In the R
console, you’ll need to choose whether to do a onetime authentification, or leave a hidden file behind with authentification details. I use the onetime option, as I don’t want to accidently publish the secrets in the hidden file (since they are easy to overlook, being hidden and all).
searchTopic
searchTopic
is a function that accesses the Github API to search for a single topic.^{4} This function is “pretty simple” in that it is short, but there are six helper functions defined in the same file. So, “short not short”. This function does all the heavy lifting; the major steps are:
Carry out an authenticated query of the topics associated with all Github repositories. This first “hit” returns up to 30 results, and also a header than tells how many more pages of results are out there.
Process that first set of results by converting the response to a JSON structure, because nice people have already built functions to handle such things (I’m looking at you httr
).
Check that structure for a message that will tell us if we got stopped by Github access issues (and if so, report access stats).
Extract only the name, description and repository URL from the huge volume of information captured.
Inspect the first response to see how many more pages there are, then loop over page two (we already have page 1) to the number of pages, basically repeating step 2.
Along the way, all the results are stored in a data.frame.
searchRepos
searchRepos
does two simple things:
searchTopic
only handles one topic at a time.There are two other scripts in the Utilities
folder that streamline maintenance of the project.
mergeSearches.R
which will merge several search results into one, removing duplicates along the way.mergeMaintainers.R
which will query CRAN for the maintainers of all packages in FOSS4Spec.xlsx
, and add this info to the file.^{5} Maintainers are not currently displayed on the main website. However, I hope to eventually email all maintainers so they can finetune the information about their entries.Clearly it would be good for someone who knows Python
to step in and write the analogous search code for PyPi.org. Depending upon time contraints, I may use this as an opportunity to learn more Python
, but really, if you want to help that would be quicker!
And that folks, is how the sausage is made.
This code has been tested on a number of searches and I’ve captured every exception I’ve encountered. If you have problems using this code, please file an issue. It’s nearly impossible that it is perfect at this point!↩︎
Some search terms produce quite a few false positives. I also review each repository to make sure the project is actually FOSS, is not a student project etc (more details on the main web site).↩︎
While I link to the documentation for completeness, the steps described next do all the work.↩︎
See notes in the file: I have not been able to get the Github API to work with multiple terms, so we search each one individually.↩︎
Want to contribute? If you know the workings of the PyPi.org API it would be nice to automatically pull the maintainer’s contact info.↩︎
@online{hanson2021,
author = {Bryan Hanson},
editor = {},
title = {Automatically {Searching} {Github} {Repos} by {Topic}},
date = {20210419},
url = {http://chemospec.org/20210419SearchGHTopics.html},
langid = {en}
}
hyperSpec
package which had grown quite large and hard to maintain.^{1} The essence of the project was to break the original hyperSpec
package into smaller packages.^{2} As part of that project, we needed to be able to:
In this post I’ll describe how we used Dirk Eddelbuettel’s drat
package and Github Actions to automate the deployment of packages between repositories.
drat
is a package that simplifies the creation and modification of CRANlike repositories. The structure of a CRANlike repository is officially described briefly here.^{3} Basically, there is required set of subdirectories, required files containing package metadata, and source packages that are the result of the usual build and check process. One can also have platformspecific binary packages. drat
will create the directories and metadata for you, and provides utilities that will move packages to the correct location and update the corresponding metadata.^{4} The link above provides access to all sorts of documentation. My advice is to not overthink the concept. A repository is simply a directory structure and a couple of required metadata files, which must be kept in sync with the packages present. drat
does the heavylifting for you.
Github Actions are basically a series of tasks that one can have Github run when there is an “event” on a repo, like a push or pull. Github Actions are used extensively for continuous integrations tasks, but they are not limited to such use. Github Actions are written in a simply yamllike script that is rather easy to follow even if the details are not familiar. Github Actions uses shell commands, but much of the time the shell simply calls Rscript
to run native R
functions. One can run tasks on various hardware and OS versions.
The deployed packages reside on the ghpages
branch of rhyperspec/pkgrepo
in the form of the usual .tar.gz
source archives, ready for users to install. One of the important features of this repo is the table of hosted packages displayed in the README
. The table portion of README.md
file is generated automatically whenever someone, or something, pushes to this repo. I include the notion that something might push because as you will see next, the deploy process will automatically push archives to this repo from the repo where they are created. The details of how this README.md
is generated are in dratupdatereadme.yaml
. If you take a look, you’ll see that we use some shellscripting to find any .tar.gz
archives and create a markdownready table structure, which Github then automatically displays (as it does with all README.md
files at the top level of a repo). The yaml
file also contains a little drat
action that will refresh the repo in case that someone manually removes an archive file by git operations. Currently we do not host binary packages at this repo, but that is certainly possible by extension of the methods used for the source packages.
The automatic deploy process is used in several rhyperSpec
repos. I’ll use the chondro
repo to illustrate the process. chondro
is a simple package containing a > 2 Mb data set. If the package is updated, the package is built and checked and then deployed automatically to rhyperSpec/pkgrepo
(described above). The magic is in dratinsertpackage.yaml
. The first part of this file does the standard build and check process.^{5} The second part takes care of deploying to rhyperspec/pkgrepo
. The basic steps are given next (study the file for the details). It is essential to keep in mind that each task in Github Actions starts from the same top level directory.^{6} Tasks are set off by the syntax  name: task description
.
rhyperSpec/pkgrepo
. This is helpful for troubleshooting.rhyperSpec/pkgrepo
into a temporary directory and checkout the ghpages branch..tar.gz
files in the check
folder, which is where we directed Github Actions to carry out the build and check process (the first half of this workflow).^{7} Note that the argument full.names = TRUE
is essential to getting the correct path. Use drat
to insert the .tar.gz
files into the cloned rhyperSpec/pkgrepo
temporary directory.rhyperspec/pkgrepo
branch back to its home, now with the new .tar.gz
files included. Use a git commit message that will show where the new tar ball came from.Thanks for reading. Let me know if you have any questions, via the comments, email, etc.
This portion of the hyperSpec
GSOC 2020 project was primarily the work of hyperSpec
team members Erick Oduniyi, Bryan Hanson and Vilmantas Gegzna. Erick was supported by GSOC in summer 2020.
The work continues this summer, hopefully again with the support of GSOC.↩︎
Project background and results.↩︎
A more loquacious description that may be slightly dated is here.↩︎
drat
is using existing R
functions, mainly from the tools
package. They are just organized and presented from the perspective of a user who wants to create a repo.↩︎
Modified from the recipes here.↩︎
The toughest part of writing this workflow was knowing where one was in the directory tree of the Github Actions workspace. We made liberal use of getwd()
, list.files()
and related functions during troubleshooting. All of these “helps” have been removed from the mature version of the workflow. As noted in the workflow, the top directory is /home/runner/work/${{ REPOSITORY_NAME }}/${{ REPOSITORY_NAME }}
.↩︎
It’s helpful to understand in a general way what happens during the build and check process (e.g. the directories and files created).↩︎
@online{hanson2021,
author = {Bryan Hanson},
editor = {},
title = {Using {Github} {Actions} and Drat to {Deploy} {R} {Packages}},
date = {20210411},
url = {http://chemospec.org/20210411GHAdrat.html},
langid = {en}
}
My suite of spectroscopy R
packages has been updated on CRAN. There are only a few small changes, but they will be important to some of you:
ChemoSpecUtils
now provides a set of colorblindfriendly colors, see ?colorSymbol
. These are available for use in ChemoSpec
and ChemoSpec2D
.readJDX
now includes a function, splitMultiblockDX
, that will split a multiblock JCAMPDX file into separate files, which can then be imported via the usual functions in the package.Here are the links to the documentation:
As always, let me know if you discover trouble or have questions.
@online{hanson2021,
author = {Bryan Hanson},
editor = {},
title = {Spectroscopy {Suite} {Update}},
date = {20210327},
url = {http://chemospec.org/20210327SpecSuiteupdate.html},
langid = {en}
}
The ChemoSpec
package carries out exploratory data analysis (EDA) on spectroscopic data. EDA is often described as “letting that data speak”, meaning that one studies various descriptive plots, carries out clustering (HCA) as well as dimension reduction (e.g. PCA), with the ultimate goal of finding any natural structure in the data.
As such, ChemoSpec
does not feature any predictive modeling functions because other packages provide the necessary tools. I do however hear from users several times a year about how to interface a ChemoSpec
object with these other packages, and it seems like a post about how to do this is overdue. I’ll illustrate how to carry out partial least squares (PLS) using data stored in a ChemoSpec
object and the package chemometrics
by Peter Filzmoser and Kurt Varmuza (Filzmoser and Varmuza 2017). One can also use the pls
package (Mevik, Wehrens, and Liland 2020).
PLS is a technique related to regression and PCA that tries to develop a mathematical model between a matrix of sample vectors, in our case, spectra, and one or more separately measured dependent variables that describe the same samples (typically, chemical analyses). If one can develop a reliable model, then going forward one can measure the spectrum of a new sample and use the model to predict the value of the dependent variables, presumably saving time and money. This post will focus on interfacing ChemoSpec
objects with the needed functions in chemometrics
. I won’t cover how to evaluate and refine your model, but you can find plenty on this in Varmuza and Filzmoser (2009) chapter 4, along with further background (there’s a lot of math in there, but if you aren’t too keen on the math, gloss over it to get the other nuggets). Alternatively, take a look at the vignette that ships with chemometrics
via browseVignettes("chemometrics")
.
As our example we’ll use the marzipan NIR data set that one can download in Matlab format from here.^{1} The corresponding publication is (Christensen et al. 2004). This data set contains NIR spectra of marzipan candies made with different recipes and recorded using several different instruments, along with data about moisture and sugar content. We’ll use the data recorded on the NIRSystems 6500 instrument, covering the 4002500 nm range. The following code chunk gives a summary of the data set and shows a plot of the data. Because we are focused on how to carry out PLS, we won’t worry about whether this data needs to be normalized or otherwise preprocessed (see the Christensen paper for lots of details).
library("ChemoSpec")
Loading required package: ChemoSpecUtils
As of version 6, ChemoSpec offers new graphics output options
For details, please see ?GraphicsOptions
The ChemoSpec graphics option is set to 'ggplot2'
To change it, do
options(ChemoSpecGraphics = 'option'),
where 'option' is one of 'base' or 'ggplot2' or'plotly'.
load("Marzipan.RData")
sumSpectra(Marzipan)
Marzipan NIR data set from www.models.life.ku.dk/Marzipan
There are 32 spectra in this set.
The yaxis unit is absorbance.
The frequency scale runs from
450 to 2448 wavelength (nm)
There are 1000 frequency values.
The frequency resolution is
2 wavelength (nm)/point.
The spectra are divided into 9 groups:
group no. color symbol alt.sym
1 a 5 #FB0D16FF 1 a
2 b 4 #FFC0CBFF 16 b
3 c 4 #2AA30DFF 2 c
4 d 4 #9BCD9BFF 17 d
5 e 3 #700D87FF 3 e
6 f 3 #A777F2FF 8 f
7 g 2 #FD16D4FF 4 g
8 h 3 #B9820DFF 5 h
9 i 4 #B9820DFF 5 i
*** Note: this is an S3 object
of class 'Spectra'
plotSpectra(Marzipan, which = 1:32, lab.pos = 3000)
In order to carry out PLS, one needs to provide a matrix of spectroscopic data, with samples in rows (let’s call it , you’ll see why in a moment). Fortunately this data is available directly from the ChemoSpec
object as Marzipan$data
.^{2} One also needs to provide a matrix of the additional dependent data (let’s call it ). It is critical that the order of rows in correspond to the order of rows in the matrix of spectroscopic data, .
Since we are working in R
we know there are a lot of ways to do most tasks. Likely you will have the additional data in a spreadsheet, so let’s see how to bring that into the workspace. You’ll need samples in rows, and variables in columns. For your sanity and erroravoidance, you should include a header of variable names and the names of the samples in the first column. Save the spreadsheet as a csv file. I did these steps using the sugar and moisture data from the original paper. Read the file in as follows.
Y < read.csv("Marzipan.csv", header = TRUE)
str(Y)
'data.frame': 32 obs. of 3 variables:
$ sample : chr "a1" "a2" "a3" "a4" ...
$ sugar : num 32.7 34.9 33.9 33.2 33.2 ...
$ moisture: num 15 14.9 14.7 14.9 14.9 ...
The function we’ll be using wants a matrix as input, so convert the data frame that read.csv
generates to a matrix. Note that we’ll select only the numeric variables on the fly, as unlike a data frame, a matrix can only be composed of one data type.
Y < as.matrix(Y[, c("sugar", "moisture")])
str(Y)
num [1:32, 1:2] 32.7 34.9 33.9 33.2 33.2 ...
 attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:2] "sugar" "moisture"
Now we are ready to carry out PLS. Since we have a multivariate , we need to use the appropriate function (use pls1_nipals
if your matrix is univariate).
library("chemometrics")
Loading required package: rpart
pls_out < pls2_nipals(X = Marzipan$data, Y, a = 5)
And we’re done! Be sure to take a look at str(pls_out)
to see what you got back from the calculation. For the next steps in evaluating your model, see section 3.3 in the chemometrics
vignette.
I have converted the data from Matlab to a ChemoSpec
object; if anyone wants to know how to do this let me know and I’ll put up a post on that process.↩︎
str(Marzipan)
will show you the structure of the ChemoSpec
object (or in general, any R
object). The official definition of a ChemoSpec
object can be seen via ?Spectra
.↩︎
@online{hanson2021,
author = {Bryan Hanson},
editor = {},
title = {Interfacing {ChemoSpec} to {PLS}},
date = {20210208},
url = {http://chemospec.org/20210208PLS.html},
langid = {en}
}
Checking in from Kansas!
This past summer (2020) I had the amazing opportunity to participate in the Google Summer of Code (GSoC or GSOC). As stated on the the GSOC website, GSOC is a “global program focused on bringing more student developers into open source software development. Students work with an opensource organization on a 3month programming project during their break from school.”
This was a particularly meaningful experience as it was my last undergraduate summer internship. I’m a senior studying computer engineering at the University of Kansas, and at the beginning of the summer I still didn’t feel super comfortable working on public (opensource) projects. So, I thought this program would help build my confidence as a computer and software engineer. Moreover:
R
organization because that is my favorite programming language.rhyperspec
because I thought that would be the most impactful in terms of practicing project management and software ecosystem development.In the process I hoped to:
Git/Github
, including continuous integrationR
And through a lot of hard work all of those things came to be! Truthfully, even though the summer project was successful there is still a lot of work to do:
hyperSpec
for baseline
with bridge packageshyperSpec
for EMSC
with bridge packageshyperSpec
for matrixStats
with bridge packages.So, I’m excited to continue to work with the team! I think there are a ton of ideas I and the team have and hopefully we will get to explore them in deeper context. Speaking of the team, I have them to thank for an awesome GSOC 2020
experience. If you are interested in the journey that was the GSoC 2020 experience (perhaps you might be interested in trying the program next year), then please feel free to jump around here to get a feel for the things that I learned and how I worked with the rhyperspec
team this summer.
Best, E. Oduniyi
@online{oduniyi2020,
author = {Erick Oduniyi},
editor = {},
title = {GSOC {Wrap} {Up}},
date = {20200908},
url = {http://chemospec.org/20200908GSOChyperSpec.html},
langid = {en}
}
It is wellrecognized that one of the virtues of the R
language is the extensive tools it provides for working with distributions. Functions exist to generate random number draws, determine quantiles, and examine the probability density and cumulative distribution curves that describe each distribution.
This toolbox gives one the ability to create simulated data sets for testing very easily. If you need a few random numbers from a Gaussian distribution then rnorm
is your friend:
rnorm(3)
[1] 1.67497913 1.49447605 0.02394601
Imagine you were developing a new technique to determine if two methods of manufacturing widgets produced widgets of the same mass.^{1} Even before the widgets were manufactured, you could test your code by simulating widget masses using rnorm
:
widget_1_masses < rnorm(100, 5.0, 0.5) # mean mass 5.0
widget_2_masses < rnorm(100, 4.5, 0.5) # mean mass 4.5
Variations on this approach can be used to simulate spectral data sets.^{2} The information I will share here is accumulated knowledge. I have no formal training in the theory behind the issues discussed, just skills I have picked up in various places and by experimenting. If you see something that is wrong or needs clarification or elaboration, please use the comments to set me straight!
What peak shape is expected for a given type of spectroscopy? In principle this is based on the theory behind the method, either some quantum mechanical model or an approximation of it. For some methods, like NMR, this might be fairly straightforward, at least in simple systems. But the frequencies involved in some spectroscopies not too different from others, and coupling is observed. Two examples which “interfere” with each other are:
After theoretical considerations, we should keep in mind that all spectroscopies have some sort of detector, electronic components and basic data processing that can affect peak shape. A CCD on a UV detector is one of the simpler situations. FTIR has a mechanical interferometer, and the raw signal from both IR and NMR is Fouriertransformed prior to use. So there are not only theoretical issues to think about, but also engineering, instrument tuning, electrical engineering and mathematical issues to consider.
Even with myriad theoretical and practical considerations, a Gaussian curve is a good approximation to a simple peak, and more complex peaks can be built by summing Gaussian curves. If we want to simulate a simple peak with a Gaussian shape, we can use the dnorm
function, which gives us the “density” of the distribution:
std_deviations < seq(5, 5, length.out = 100)
Gaussian_1 < dnorm(std_deviations)
plot(std_deviations, Gaussian_1, type = "l",
xlab = "standard deviations", ylab = "Gaussian Density")
If we want this to look more like a “real” peak, we can increase the x range and use x values with realistic frequency values. And if we want our spectrum to be more complex, we can add several of these curves together. Keep in mind that the area under the density curve is 1.0, and the peak width is determined by the value of argument sd
(the standard deviation). For example if you want to simulate the UV spectrum of vanillin, which has maxima at about 230, 280 and 315 nm, one can do something along these lines:
wavelengths < seq(220, 350, by = 1.0)
Peak1 < dnorm(wavelengths, 230, 22)
Peak2 < dnorm(wavelengths, 280, 17)
Peak3 < dnorm(wavelengths, 315, 17)
Peaks123 < colSums(rbind(1.6 * Peak1, Peak2, Peak3))
plot(wavelengths, Peaks123, type = "l",
xlab = "wavelengths (nm)", ylab = "arbitrary intensity")
The coefficient on Peak1
is needed to increase the contribution of that peak in order to better resemble the linked spectrum (note that the linked spectrum yaxis is ; we’re just going for a rough visual approximation).
It’s a simple, if tedious, task to add Gaussian curves in this manner to simulate a single spectrum. One can also create several different spectra, and then combine them in various ratios to create a data set representing samples composed of mixtures of compounds. UV spectra are tougher due to the vibrational coupling; NMR spectra are quite straightforward since we know the area of each magnetic environment in the structure (but we also have to deal with doublets etc.). If you plan to do a lot of this, take a look at the SpecHelpers
package, which is designed to streamline these tasks.
A relatively minor exception to the typical Gaussian peak shape is NMR. Peaks in NMR are typically described as “Lorentzian”, which corresponds to the Cauchy distribution (Goldenberg 2016). This quick comparison shows that NMR peaks are expected to be less sharp and have fatter tails:
Gaussian_1 < dnorm(std_deviations)
Cauchy_1 < dcauchy(std_deviations)
plot(std_deviations, Gaussian_1, type = "l",
xlab = "standard deviations", ylab = "density")
lines(std_deviations, Cauchy_1, col = "red")
For many types of spectroscopies there is a need to correct the baseline when processing the data. But if you are simulating spectroscopic (or chromatographic) data, how can you introduce baseline anomalies? Such anomalies can take many forms, for instance a linear dependence on wavelength (i.e. a steadily rising baseline without curvature). But more often one sees complex rolling baseline issues.
Let’s play with introducing different types of baseline abberations. First, let’s create a set of three simple spectra. We’ll use a simple function to scale the set of spectra so the range is on the interval [0…1] for ease of further manipulations.
wavelengths < 200:800
Spec1 < dnorm(wavelengths, 425, 30)
Spec2 < dnorm(wavelengths, 550, 20) * 3 # boost the area
Spec3 < dnorm(wavelengths, 615, 15)
Spec123 < rbind(Spec1, Spec2, Spec3)
dim(Spec123) # matrix with samples in rows
[1] 3 601
scale01 < function(M) {
# scales the range of the matrix to [0...1]
mn < min(M)
M < M  mn
mx < max(M)
M < M/mx
}
Here are the results; the dotted line is the sum of the three spectra, offset vertically for ease of comparison.
Spec123 < scale01(Spec123)
plot(wavelengths, Spec123[1,], col = "black", type = "l",
xlab = "wavelength (nm)", ylab = "intensity",
ylim = c(0, 1.3))
lines(wavelengths, Spec123[2,], col = "red")
lines(wavelengths, Spec123[3,], col = "blue")
lines(wavelengths, colSums(Spec123) + 0.2, lty = 2)
One clever way to introduce baseline anomalies is to use a Vandermonde matrix. This is a trick I learned while working with the team on the hyperSpec
overhaul funded by GSOC.^{3} It’s easiest to explain by an example:
vander < function(x, order) outer(x, 0:order, `^`)
vdm < vander(wavelengths, 2)
dim(vdm)
[1] 601 3
vdm[1:5, 1:3]
[,1] [,2] [,3]
[1,] 1 200 40000
[2,] 1 201 40401
[3,] 1 202 40804
[4,] 1 203 41209
[5,] 1 204 41616
vdm < scale(vdm, center = FALSE, scale = c(1, 50, 2000))
Looking at the first few rows of vdm
, you can see that the first column is a simple multiplier, in this case an identity vector. This can be viewed as an offset term.^{4} The second column contains the original wavelength values, in effect a linear term. The third column contains the square of the original wavelength values. If more terms had been requested, they would be the cubed values etc. In the code above we also scaled the columns of the matrix so that the influence of the linear and especially the squared terms don’t dominate the absolute values of the final result. Scaling does not affect the shape of the curves.
To use this Vandermonde matrix, we need another matrix which will function as a set of coefficients.
coefs < matrix(runif(nrow(Spec123) * 3), ncol = 3)
coefs
[,1] [,2] [,3]
[1,] 0.81126877 0.9094830 0.2902550
[2,] 0.15101528 0.9878546 0.3244570
[3,] 0.05994409 0.5978804 0.9532016
If we multiply the coefficients by the tranposed Vandermonde matrix, we get back a set of offsets which are the rows of the Vandermonde matrix modified by the coefficients. We’ll scale things so that Spec123
and offsets
are on the same overall scale and then further scale so that the spectra are not overwhelmed by the offsets in the next step.
offsets < coefs %*% t(vdm)
dim(offsets) # same dimensions as Spec123 above
[1] 3 601
offsets < scale01(offsets) * 0.1
These offsets can then be added to the original spectrum to obtain our spectra with a distorted baseline. Here we have summed the individual spectra. We have added a line based on extrapolating the first 20 points of the distorted data, which clearly shows the influence of the squared term.
FinalSpec1 < offsets + Spec123
plot(wavelengths, colSums(FinalSpec1), type = "l", col = "red",
xlab = "wavelength (nm)", ylab = "intensity")
lines(wavelengths, colSums(Spec123))
fit < lm(colSums(FinalSpec1)[1:20] ~ wavelengths[1:20])
lines(wavelengths, fit$coef[2]*wavelengths + fit$coef[1],
col = "red", lty = 2) # good ol' y = mx + b
The Vandermonde matrix approach works by creating offsets that are added to the original spectrum. However, it is limited to creating baseline distortions that generally increase at higher values. To create other types of distortions, you can use your imagination. For instance, you could reverse the order of the rows of offsets
and/or use higher terms, scale a row, etc. One could also play with various polynomial functions to create the desired effect over the wavelength range of interest. For instance, the following code adds a piece of an inverted parabola to the original spectrum to simulate a baseline hump.
hump < 1*(15*(wavelengths  450))^2 # piece of a parabola
hump < scale01(hump)
FinalSpec2 < hump * 0.1 + colSums(Spec123)
plot(wavelengths, FinalSpec2, type = "l",
xlab = "wavelengths (nm)", ylab = "intensity")
lines(wavelengths, hump * 0.1, lty = 2) # trace the hump
In the plot, the dotted line traces out the value of hump * 0.1
, the offset.
In the next post we’ll look at ways to introduce noise into simulated spectra.
Of course, this is simply the ttest.↩︎
For that matter, you can also simulate chromatograms using the methods we are about to show. It’s even possible to introduce tailing of a peak. For a function to do this, see the SpecHelpers
package.↩︎
The work I’m showing here is based on original code in package hyperSpec
by Claudia Belietes.↩︎
As a vector of 1’s it will have no effect on the calculations to come. However, you could multiply this column by a value to add an offset to your simulated spectra. This would be a means of simulating a steady electronic bias in an instrument’s raw data.↩︎
@online{hanson2020,
author = {Bryan Hanson},
editor = {},
title = {Simulating {Spectroscopic} {Data} {Part} 1},
date = {20200628},
url = {http://chemospec.org/20200628SimSpecDataPt1.html},
langid = {en}
}
hyperSpec
is an R
package for working with hyperspectral data sets. Hyperspectral data can take many forms, but a common application is a series of spectra collected over an x, y grid, for instance Raman imaging of medical specimens. hyperSpec
was originally written by Claudia Beleites and she currently guides a core group of contributors.^{1}
Claudia, regular hyperSpec
contributor Roman Kiselev and myself have joined forces this summer in a Google Summer of Code project to fortify hyperSpec
. We are pleased to report that the project was accepted by RGSOC administrators, and, as of a few days ago, the excellent proposal written by Erick Oduniyi was approved by Google. Erick is a senior computer engineering major at Wichita State University in Kansas. Erick gravitates toward interdisciplinary projects. This, and his experience with R
, Python
and related skills gives him an excellent background for this project.
The focus of this project is to fortify the infrastructure of hyperSpec
. Over the years, keeping hyperSpec
uptodate has grown a bit unwieldy. While todo lists always evolve, the current interrelated goals include:
hyperSpec
: Prune hyperSpec
back to it’s core functionality to keep it lightweight. Relocate portions, such as importing data, into their own dedicated packages.hyperSpec
: Analyze the ecosystem of hyperSpec
with an eye to reducing dependencies as much as possible and ensuring that necessary dependencies are the best choices. Avoid “reinventing the wheel”, as long as the available “wheels” are computationally efficient and stable (code base and API).hyperSpec
: Having decided on how to reorganize hyperSpec
and which dependencies are necessary and optimal, ensure that hyperSpec
, the constellation of new subpackages, and all dependencies are integrated efficiently. There are a number of data preprocessing and plotting functions that need to be streamlined and interfaced to external packages more consistently. Some portions may need substantial refactoring.Addressing each of these goals will make hyperSpec
much easier to maintain, less fragile, and easier for others to contribute. Every step will bring enhanced documentation and vignettes, along with new unit tests. Work will begin in earnest on June 1st, and we are looking forward to a very productive summer.
Finally, on behalf of all participants, let me just say how grateful we are to Google for establishing the GSOC program and for supporting Erick’s work this summer!
A little history for the curious: the hyperSpec
and ChemoSpec
packages were written around the same time, independent of each other (~2009). Eventually, Claudia and I became aware of each other’s work, and we have collaborated in ways large and small ever since (I like working with Claudia because I always learn from her!). We have jointly mentored GSOC students twice before. One side project is hyperChemoBridge
, a small package that converts hyperSpec
objects into Spectra
objects (the native ChemoSpec
format) and viceversa.↩︎
The descriptors here are Erick’s clever choice of words.↩︎
@online{hanson2020,
author = {Bryan Hanson},
editor = {},
title = {Fortifying {hyperSpec:} {Getting} {Ready} for {GSOC}},
date = {20200507},
url = {http://chemospec.org/20200507GSOChyperSpec.html},
langid = {en}
}