Implementing the HATS-PR algorithm

This is Part 2 of a series on aligning 2D NMR, as implemented in the package `ChemoSpec2D`

. Part 1 Part 3

In Part 1 I briefly mentioned that we would be using the HATS-PR algorithm of Robinette *et al.* (Robinette et al. 2011). I also discussed the choice of objective function which is used to report on the quality of the alignment. HATS-PR stands for “Hierachical Alignment of Two-Dimensional Spectra - Pattern Recognition”. In `ChemoSpec2D`

the algorithm is implemented in the `hats_alignSpectra2D`

function. Here are the major steps of the HATS-PR algorithm:

- Contruct a
*guide tree*using hierarchical clustering (HCA): compute the distance between the spectra, and use these distances to construct a dendrogram (the guide tree). As the name suggests, this tree is used to guide the alignment. The most similar spectra are aligned first, then the next most similar, and so on. In later rounds one applies the alignment procedure to sets of spectra that have already been aligned. In Robinette*et al.*they use the Pearson correlation coefficient as the distance measure. In`ChemoSpec2D`

you can choose from a number of distance measures. I encourage you to experiment with the choices and see how they affect the alignment process for your data sets. - For each alignment event, check the alignment using the objective function, which recall is a distance measure. If the objective function is below the threshold, no alignment is needed (“below” assumes we are minimizing the objective function, but we might also be maximizing and hence trying to exceed the threshold). If alignment is necessary, move one of the spectra relative to the other
*in some fashion*, checking each new postion with the objective function until the best alignment is found. This is an exercise in optimization.

The heart of the task is in the phrase *in some fashion*. At one extreme, one can imagine holding one spectrum fixed, and sliding the other spectrum left and right, up and down, over some range of values – essentially a grid of data points. At each position on the virtual grid, evaluate the objective function and keep track of the results. This will always find the answer, but such a brute force search will be very time-consuming and undesirable, especially if the search space is large. Alternatively, do something more efficient! Robinette *et al.* use a “simple gradient ascent” approach, but there is a vast literature on optimization strategies that we can consider. In `ChemoSpec2D`

we use a machine learning approach (details next), but the function is written in such a way that one can add other optimization approaches seamlessly. *Anything* is better than a brute force approach.

`mlrMBO`

The name `mlrMBO`

comes from “machine learning with R model-based optimization.” `mlrMBO`

is a powerful and flexible package for general purpose optimization, especially in the cases where the objective function is computationally expensive. There is a nice introductory vignette.

The basic steps in the model-based optimization using `mlrMBO`

as implemented in `hats_alignSpectra2D`

in package `ChemoSpec2D`

are as follows:

- Define your objective function. Our choice of the Euclidean distance was described in Part 1, along with other options. Most distance measures are not computationally expensive in terms of code. However, the huge number of data points in a typical 2D NMR spectrum bogs things down considerably. The approach taken in model-based optimation mitigates this to a great deal, since the objective function is only used for the initial response surface.
- Generate an “initial design”, by which we mean a strategy to search the possible optimization space.
`hats_alignSpectra2D`

takes arguments`maxF1`

and`maxF2`

which define the space that will be considered as the two spectra are shifted relative to each other. The space potentially covered is`-maxF1`

to`maxF1`

and similarly for the F2 dimension. We take advantage of concepts from the design of experiments field, and use the lhs package to generate a Latin Hypercube Sample of our space. - The sample points selected by
`lhs`

are evaluated using the objective function. - The values of the objective function at the sample points are used to create a surrogate model, essentially a response surface. The key here is that the surrogate model is computationally fast and will stand in for the actual objective function during the optimization.
`mlrMBO`

provides many options for the surrogate model. For`hats_alignSpectra2D`

we use a response surface based on kriging, which is a means of interpolating values that was originally developed in the geospatial statistics world. - New samples points are suggested by the kriging algorithm, evaluated using the surrogate function, and used to update (improve) the model. Each iteration improves the quality of the model.
- After reaching the designated threshold or the number of iterations specified, the best answer is returned. In this case the best answer is the optimal shift of one spectrum relative to the other, in each dimension.

In addition to the differences noted above, the implementation of HATS-PR in `ChemoSpec2D`

carries out only *global* alignment. The algorithm described by Robinette *et al.* includes *local* alignment steps which I have not implemented. Local alignment is a possible future addition.

If you are going to actually execute the code here (as opposed to just reading along), you’ll need the development version of `ChemoSpec2D`

(I improved some of the plots that track the alignment progress since the last CRAN release). And you’ll need certain packages. Here are the steps to install everything:

```
chooseCRANmirror() # choose a CRAN mirror
install.packages("remotes")
library("remotes")
# devel branch -- you need 0.4.156 or higher
install_github(repo = "bryanhanson/ChemoSpec2D@devel")
library("ChemoSpec2D")
# other packages needed
install.packages("mlrMBO") # will also install mlr, smoof, ParamHelpers
install.packages("lhs")
```

Now you are ready for the main event! Part 3

Robinette, Steven L., Ramadan Ajredini, Hasan Rasheed, Abdulrahman Zeinomar, Frank C. Schroeder, Aaron T. Dossey, and Arthur S. Edison. 2011. “Hierarchical Alignment and Full Resolution Pattern Recognition of 2d NMR Spectra: Application to Nematode Chemical Ecology.” *Analytical Chemistry* 83 (5): 1649–57. https://doi.org/10.1021/ac102724x.

For attribution, please cite this work as

Hanson (2020, March 2). Chemometrics & Spectroscopy Using R: Aligning 2D NMR Spectra Part 2. Retrieved from https://chemospec.org/posts/2020-03-02-NMR-Align-Pt2/

BibTeX citation

@misc{hanson2020aligning, author = {Hanson, Bryan}, title = {Chemometrics & Spectroscopy Using R: Aligning 2D NMR Spectra Part 2}, url = {https://chemospec.org/posts/2020-03-02-NMR-Align-Pt2/}, year = {2020} }