6 min read

Interfacing ChemoSpec to PLS

The ChemoSpec package carries out exploratory data analysis (EDA) on spectroscopic data. EDA is often described as “letting that data speak”, meaning that one studies various descriptive plots, carries out clustering (HCA) as well as dimension reduction (e.g. PCA), with the ultimate goal of finding any natural structure in the data.

As such, ChemoSpec does not feature any predictive modeling functions because other packages provide the necessary tools. I do however hear from users several times a year about how to interface a ChemoSpec object with these other packages, and it seems like a post about how to do this is overdue. I’ll illustrate how to carry out partial least squares (PLS) using data stored in a ChemoSpec object and the package chemometrics by Peter Filzmoser and Kurt Varmuza (Filzmoser and Varmuza 2017). One can also use the pls package (Mevik, Wehrens, and Liland 2020).

PLS is a technique related to regression and PCA that tries to develop a mathematical model between a matrix of sample vectors, in our case, spectra, and one or more separately measured dependent variables that describe the same samples (typically, chemical analyses). If one can develop a reliable model, then going forward one can measure the spectrum of a new sample and use the model to predict the value of the dependent variables, presumably saving time and money. This post will focus on interfacing ChemoSpec objects with the needed functions in chemometrics. I won’t cover how to evaluate and refine your model, but you can find plenty on this in Varmuza and Filzmoser (2009) chapter 4, along with further background (there’s a lot of math in there, but if you aren’t too keen on the math, gloss over it to get the other nuggets). Alternatively, take a look at the vignette that ships with chemometrics via browseVignettes("chemometrics").

As our example we’ll use the marzipan NIR data set that one can download in Matlab format from here.1 The corresponding publication is (Christensen et al. 2004). This data set contains NIR spectra of marzipan candies made with different recipes and recorded using several different instruments, along with data about moisture and sugar content. We’ll use the data recorded on the NIRSystems 6500 instrument, covering the 400-2500 nm range. The following code chunk gives a summary of the data set and shows a plot of the data. Because we are focused on how to carry out PLS, we won’t worry about whether this data needs to be normalized or otherwise pre-processed (see the Christensen paper for lots of details).

library("ChemoSpec")
## Loading required package: ChemoSpecUtils
load("Marzipan.RData")
sumSpectra(Marzipan)
## 
##  Marzipan NIR data set from www.models.life.ku.dk/Marzipan 
## 
##  There are 32 spectra in this set.
##  The y-axis unit is absorbance.
## 
##  The frequency scale runs from
##  450 to 2448 wavelength (nm)
##  There are 1000 frequency values.
##  The frequency resolution is
##  2 wavelength (nm)/point.
## 
## 
##  The spectra are divided into 9 groups: 
## 
##   group no.     color symbol alt.sym
## 1     a   5 #FB0D16FF      1       a
## 2     b   4 #FFC0CBFF     16       b
## 3     c   4 #2AA30DFF      2       c
## 4     d   4 #9BCD9BFF     17       d
## 5     e   3 #700D87FF      3       e
## 6     f   3 #A777F2FF      8       f
## 7     g   2 #FD16D4FF      4       g
## 8     h   3 #B9820DFF      5       h
## 9     i   4 #B9820DFF      5       i
## 
## 
## *** Note: this is an S3 object
## of class 'Spectra'
plotSpectra(Marzipan, which = 1:32, lab.pos = 3000)

In order to carry out PLS, one needs to provide a matrix of spectroscopic data, with samples in rows (let’s call it \(X\), you’ll see why in a moment). Fortunately this data is available directly from the ChemoSpec object as Marzipan$data.2 One also needs to provide a matrix of the additional dependent data (let’s call it \(Y\)). It is critical that the order of rows in \(Y\) correspond to the order of rows in the matrix of spectroscopic data, \(X\).

Since we are working in R we know there are a lot of ways to do most tasks. Likely you will have the additional data in a spreadsheet, so let’s see how to bring that into the workspace. You’ll need samples in rows, and variables in columns. For your sanity and error-avoidance, you should include a header of variable names and the names of the samples in the first column. Save the spreadsheet as a csv file. I did these steps using the sugar and moisture data from the original paper. Read the file in as follows.

Y <- read.csv("Marzipan.csv", header = TRUE)
str(Y)
## 'data.frame':    32 obs. of  3 variables:
##  $ sample  : chr  "a1" "a2" "a3" "a4" ...
##  $ sugar   : num  32.7 34.9 33.9 33.2 33.2 ...
##  $ moisture: num  15 14.9 14.7 14.9 14.9 ...

The function we’ll be using wants a matrix as input, so convert the data frame that read.csv generates to a matrix. Note that we’ll select only the numeric variables on the fly, as unlike a data frame, a matrix can only be composed of one data type.

Y <- as.matrix(Y[, c("sugar", "moisture")])
str(Y)
##  num [1:32, 1:2] 32.7 34.9 33.9 33.2 33.2 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:2] "sugar" "moisture"

Now we are ready to carry out PLS. Since we have a multivariate \(Y\), we need to use the appropriate function (use pls1_nipals if your \(Y\) matrix is univariate).

library("chemometrics")
## Loading required package: rpart
pls_out <- pls2_nipals(X = Marzipan$data, Y, a = 5)

And we’re done! Be sure to take a look at str(pls_out) to see what you got back from the calculation. For the next steps in evaluating your model, see section 3.3 in the chemometrics vignette.

References

Christensen, Jakob, Lars Nørgaard, Hanne Heimdal, Joan Grønkjær Pedersen, and Søren Balling Engelsen. 2004. “Rapid Spectroscopic Analysis of Marzipan—Comparative Instrumentation.” Journal of Near Infrared Spectroscopy 12 (1): 63–75. https://doi.org/10.1255/jnirs.408.

Filzmoser, Peter, and Kurt Varmuza. 2017. Chemometrics: Multivariate Statistical Analysis in Chemometrics. https://CRAN.R-project.org/package=chemometrics.

Mevik, Bjørn-Helge, Ron Wehrens, and Kristian Hovde Liland. 2020. Pls: Partial Least Squares and Principal Component Regression. https://CRAN.R-project.org/package=pls.

Varmuza, K., and P. Filzmoser. 2009. Introduction to Multivariate Statistical Analysis in Chemometrics. CRC Press.


  1. I have converted the data from Matlab to a ChemoSpec object; if anyone wants to know how to do this let me know and I’ll put up a post on that process.↩︎

  2. str(Marzipan) will show you the structure of the ChemoSpec object (or in general, any R object). The official definition of a ChemoSpec object can be seen via ?Spectra.↩︎