I was inspired by the really simple EFNMR instrument developed by Andy Nichol (“Nuclear Magnetic Resonance for Everybody”). Nichol’s work made it clear that one could observe an NMR signal without complex equipment. As I did more reading however, I settled on following the build of Carl Michal (Michal (2010)) as it will allow for more complex experiments, and provides more opportunity to learn electronic circuits.
Michal’s design uses two coils: a polarization coil, and a transmit/receive (T/R) coil. This post will cover the construction of the polarization coil. Michal’s polarization coil is a threelayer solenoid constructed with 18 AWG magnet wire. Each layer is a separate wire but in operation, the three layers are wired in parallel. I scaled the coil dimensions down somewhat so that I could use materials that are readily accessible to me.^{1} The plan is to use a 50 mL centrifuge tube as the sample holder. The sample will be placed in a T/R coil wound around a 1.25” schedule 40 PVC pipe. The T/R coil will be located inside the polarization coil, which will be wound on a 2” schedule 40 PVC pipe. The dimensions of these pipes were chosen to allow the sample to nest easily inside the T/R coil which nests inside the polarization coil. Figure 1 shows a crosssection of the design.^{2}
The form for the polarization coil was made from a 12 cm length of 2” PVC pipe. Two retaining rings were very carefully cut from a 2” PVC coupling. The retaining rings were 1 cm wide. The parts are shown in Figure 2. The rings were then glued to the ends of the form using a minimal amount of standard PVC glue. The inner edges of the rings correspond to the original end of the coupling which provides a clean and straight edge where it will rest against the magnet wire. The ends of the assembly were lightly sanded. As built, the length available for the windings is 102 mm.
Next, three holes were drilled close to each of the retaining rings, about 1 cm apart. The magnet wire will pass through these holes, which will serve to keep the wire in place as it is wound. Figure 3 shows these holes. A short length of wire was placed in the holes as a “keeper” as the winding was carried out. This ensured that the winding for the first layer did not block the holes for the second and third layers of wire (Figure 4).
A winding jig was constructed from 1/4” hobby plywood. The base is 6 x 12”. Small nails and glue were used to assemble the sides and back. A 1/4” threaded rod serves as the rotational axis. Nuts and washers secure a simple handle as well as position the rod overall in the jig. Figure 5 and Figure 6 show the jig.
A holder for the wire spool was constructed with 1/16” x 1” aluminum bar. The bar was bent into a shape that would provide a way to apply friction to the sides of the spool, thus controlling the tension on the wire as it pays out. The spool is mounted on a 1/4” threaded rod and there are wingnuts on each side, which when tightened press the aluminum bar against the spool. The threaded rod does tend to unscrew as the wire is spooled out, but the process is slow enough that one can correct this as needed. If I were going to do this alot I would replace the wingnut on the side that tends to unwind with two nuts locked against each other. The holder is loosely attached to the work bench so that it can pivot as needed to accommodate the changing angle of the wire as it moves across the form. Figure 7 shows the design.
The form was more or less centered on the threaded rod using a couple of wooden guide pieces. The winding process is shown in Figure 8. The wire for the first layer comes from inside the form and up through one of the holes and is wound on the form. The action of the keepers is apparent. The fingers are used to position the wire correctly. In principle tension on the wire is provided by tightening the wing nuts on the wire supply holder. However, I did not tighten them enough and I had to wrestle with getting layer one tight enough. This caused problems with the subsequent layers as you will see!
The completed layer one is shown in Figure 9. The winding looks even. Layer two is shown in Figure 10. Because layer one was a little loose, the wire for layer two would sometimes slip inbetween the wires of layer one and force them apart. This was exacerbated because I was using more tension on the wire supply for layer two. Clearly the layer is not even. In addition, winding layer two was more difficult because without the white background one cannot see the progress very well.
The problems only worsened with layer 3 (Figure 11). I am not happy with the final result, but the wire is positionally stable and it should carry out its function well enough. What I’ve learned here will help when winding the T/R coil.
The polymeric insulation on the leads was sanded off (Figure 12) and the resistance of each coil was measured. Each gave a resistance of about 0.7 and there were no shorts between the layers.
The next step will be the construction of the polarization coil power supply, and integration of the Arduino controller. I’m not in a hurry!
@online{hanson2023,
author = {Hanson, Bryan},
title = {Building an {EFNMR} {Part} 1},
pages = {undefined},
date = {20231024},
url = {http://chemospec.org/posts/20231024EFNMRBuild1/EFNMRBuild1.html},
langid = {en}
}
As all organic chemists know, in NMR we use the rule to determine splitting, and Pascal’s triangle as a nemonic to remember the relative areas of the peaks within a multiplet. For instance, we expect that the group in ethanol to be a triplet with areas 1:2:1, due to the group having two proton neighbors in the group. We treat the two protons in as magnetically equivalent.
The rule works at typical fields used for structural determination, let’s say 60 MHz and above.^{1} At these fields one is working in the socalled “weak coupling” region. However, as one lowers the field to really low values, one encounters the “strong coupling” region, where one observes “Jcoupledspectra” or JCS. Under strong coupling, the protons in ethanol are no longer magnetically equivalent, and each of them couples differently to other nuclei, and the rule breaks down.
The strict requirement for JCS is that there be two or more protons attached to a spin heteroatom and the magnetic field be quite small. For a simple system, let’s say , the strict requirement to see separate lines for the nolongerequivalent protons is:
If this seems a bit strange, well, 1) it is, and 2) it has always been the case that the “equivalent” protons in for example a group do couple, we just don’t normally see it or worry about it.^{2}
How small does the magnetic field have to be for Jcoupled spectra to appear? This is covered in detail in Appelt et al. (2010) but generally speaking JCS appear at around to Tesla.^{3} The magnetic field of earth is around 50 mT, right in the sweet spot. The Larmor resonance frequency for at this field strength is around 2 kHz.
In the case of a system like , the number of lines that will be observed is
Where is the set of odd numbers (for odd , one evaluates until ; for even , evaluate until ). The leading multiplier of 2 accounts for the doublet due to . This formula doesn’t exactly roll off the tongue. We can evaluate it to get the first few terms:
N < 5L # evaluate 1:N terms
no.lines < rep(NA_integer_, N) # initialize storage
for (i in 1:N) {
odd < (1:i) %% 2
n < (1:i)[as.logical(odd)] # get odd n no larger than N
no.lines[i] < sum(i  n + 1) # take advantage of R's vectorization
}
names(no.lines) < paste("N=", 1:N, sep = "") # pretty it up
no.lines < no.lines * 2 # account for J_HX
no.lines
N=1 N=2 N=3 N=4 N=5
2 4 8 12 18
A couple of examples should clarify the situation. All of these will be from the perspective of observing .^{4} These examples are taken from Appelt et al. (2007).
At high field the spectrum of would be a symmetric doublet with a peak separation of .
In earth’s field, the spectrum is first split into a doublet by , but the spacing is not symmetric. Then, each part of the doublet is split further into two peaks, also asymmetrically and with varying linewidths. Figure 1 shows how the splitting changes as a function of field strength. Note that in the strong coupling region there are four peaks, as predicted above.
For the case of methanol in Earth’s field, the spectrum is first asymmetrically split by the with spacing . Then each part of the doublet is further split into four peaks. Figure 2 shows the EF spectrum of methanol. The asymmetry of the line spacing and line widths is apparent.
For a broad overview of this topic, take a look at Kaseman et al. (2020); for a detailed walkthrough of the theory with many more examples, see Appelt et al. (2007), and other papers by Appelt et al. Be prepared to spend some time with these papers.
@online{hanson2023,
author = {Hanson, Bryan},
title = {The n + 1 Rule in {Earth’s} {Field} {NMR}},
pages = {undefined},
date = {20230918},
url = {http://chemospec.org/posts/20230918EFNMR2/EFNMR2.html},
langid = {en}
}
This is a good example of Free and Open Source Software (FOSS). ChemoSpec is licensed under GPL3 which permits any reasonable use as long as there is attribution to the original authors.
Check out the first line of the “About Delta” box:
@online{hanson2023,
author = {Hanson, Bryan},
title = {JEOL’s {Delta} {Now} {Includes} {ChemoSpec}},
pages = {undefined},
date = {20230823},
url = {http://chemospec.org/posts/20230823CSDelta/CSDelta.html},
langid = {en}
}
It’s been nearly a year, and there are a number of new entries. Let’s do a quick comparison of the results from November 2022 versus August 2023. Back in November 2022 there were 246 packages; nearly a year later there are 287. Figure 1 shows a Venn diagram of the changes.
Software development in spectroscopy is clearly actively occurring in the Python ecosystem; R has stalled (see Table 1). Interpretation of this observation is challenging. A few thoughts:
language  Nov 2022  Aug 2023 

Python  162  198 
R  60  61 
C++  4  5 
Java  4  4 
Julia  4  5 
C  2  2 
Qt  2  2 
Cshell  1  1 
C#  1  2 
Fortran  1  1 
Go  1  1 
html  1  1 
JavaScript  1  2 
TypeScript  1  1 
XML  1  1 
Table 2 shows the change in package focus. Most categories grew modestly.
category  Nov 2022  Aug 2023 

Any  32  34 
Data Sharing  33  41 
EEM  3  3 
EPR, ESR  5  7 
IR (all flavors)  35  38 
Raman  28  34 
UVVis, UV, Vis  19  20 
LIBS  3  5 
Muon  1  0 
PES  1  2 
XRF, XAS  10  15 
NMR  87  97 
Time Series  3  3 
I’ve curated this site for several years now. One thing that is clear is that there is a lot of duplication of effort and features. I mentioned above a few reasons for this, but at some point it makes more sense to add to an existing package than to write one from scratch. However, this can only happen if people look around for existing software first. That of course is one purpose of the FOSS for Spectroscopy web site.
As I look at it,
This design decision is the core of building a package. Once you have decided on a structure:
In an ideal world, a data storage structure is chosen and everything else can be built later, quickly at first and then more slowly. The reality however is that people keep reinventing most of the wheel. I suppose this is not too different from people inventing entirely new computer languages…
@online{hanson2023,
author = {Hanson, Bryan},
title = {FOSS4Spectroscopy {Update}},
pages = {undefined},
date = {20230815},
url = {http://chemospec.org/posts/20230815F4SUpdate/F4SUpdate.html},
langid = {en}
}
Let’s take a closer look from first principles what kinds of information one can glean from EFNMR. We’ll restrict our discussion to spin nuclei with ~100% abundance, like , or – you’ll see why soon enough. Table 1 gives some relevant physical parameters for these nuclei.
Nuclei  Gyromagnetic ratio  Larmor Freq. 

26.7522  100  
25.1815  94  
10.8394  40.5 
Excellent general references on NMR theory are Friebolin (Friebolin 2011) and Claridge (Claridge 2016).
The line width of an NMR signal is primarily dependent on the homogeneity of the field, which in the case of earth’s field is very good. Appelt et al. (2006) state that when observations are made >100 meters from buildings and ferrous structures^{1} the homogeneity of the earth’s magnetic field for small sample volumes is in the range of . They further state that when seconds line widths will be less than 0.1 Hz.^{2} This all sounds very promising: narrow lines imply good separation between peaks.
One of the characteristics of highfield NMR which makes it so useful is the dispersion of chemical shifts as a function of structure. Unfortunately, EFNMR has effectively zero chemical shift dispersion. The equation for computing chemical shift, , is:
where the units are:
since is a field strength independent quantity. Taking to be zero, e.g. TMS added to the sample, we can rearrange the equation to get . Consider the compound whose methyl group has a chemical shift of 2.63 ppm. Using an earth’s field Larmor frequency of 19.1 KHz, we can compute the shift of in Hz as 0.0191 Hz. This is an extremely small value, smaller than the typical line width in earth’s field (so the promise of narrow line widths is not going to save us).
For further comparison, we can do the same calculation for which has a shift of 4.90 ppm. The result is exactly the same, 0.0191 Hz. We can see that these two compounds with differing numbers of halogens, which would be trivial to distinguish with a low field benchtop instrument operating at 80 MHz, are indistinguishable in earth’s field. This is due to the very small value of earth’s magnetic field.
While the chemical shift dispersion in earth’s field is clearly nil, heteronuclear J couplings are readily observed due to their greater magnitude, up to about 200 Hz. Appelt et al. (2006) gives a number of interesting examples involving , and containing compounds.
Basic NMR theory tells us that the energy difference between the two quantum states for a spin nucleus is proportional to the field strength :
where is . A plot for is shown in Figure 1; the rightmost point corresponds to a 1,000 MHz instrument. Clearly as goes to zero the goes to zero in a simple linear fashion.
We can then relate the number of nuclei in the upper energy state, , to that in the lower energy state, , at thermal equilibrium as:
where is the Boltzman constant and is the temperature in Kelvin. The ratio of population states is nearly equal for any value of but of course gets even worse as decreases. This is the reason for the low overall sensitivity of NMR as an analytical technique. We can compute the ratio for at room temperature; we’ll compare the value for earth’s field to those of a 100 and 1,000 MHz instruments:
45 uT (Earth) 2.35 T (100 MHz) 23.49 T (1,000 MHz)
1.000000000000 0.999999999998 0.999999999982
As you can see, in earth’s field there is basically no difference in the two population states, meaning there is no signal to observe. Clearly a problem!
If all the nuclei were in we could measure the energy required to bump them up to , or more commonly, bump them up and then watch the energy given off as equilibrium returns. Unfortunately, the signal produced is proportional to , which is effectively zero in earth’s field. At the same time however, the more spins we have, the higher the signal will be. More spins total in the detection coil sweet spot will be helpful, but there are other factors mitigating against making large coils to accommodate large samples. One way around this is to use signal averaging.
In the case of earth’s field NMR, the usual way around this problem of very limited signal is to prepolarize the sample.^{3} This basically involves subjecting the sample to a fairly high magnetic field for a brief period before measuring the any signals. This prepolarization field forces more of the nuclei to assume the lower energy state, thus increasing which means there is a signal to be observed. Mohorič has an excellent but technical discussion of the details of this process (Mohorič and Stepišnick 2009).
What is the Larmor (resonance) frequency in earth’s field? Earth’s magnetic field varies from about 25 to 65 T; we’ll use an intermediate value of 45 T for our calculations. The Larmor frequency is given by the equation:
Notice there is a simple linear relation between and .^{4} If we plug in values for our nuclei we get the following values in Hz:
1H 19F 31P
19159.852 18034.921 7763.148
What we have shown here is that for EFNMR, resonance frequencies are in the audio (20  20,000 Hz) and lower radio (20,000 Hz +) frequency range. Why is this important? It greatly simplifies signal detection because audio receivers are essentially radios, and the electronics for working in this frequency range are extremely well worked out, and not expensive to buy or build.
The first earth’s field NMR experiment was apparently conducted by Martin Packard and Russell Varian while at Varian Associates (Packard and Varian 1954). Varian Associates was of course a major instrument player, including NMR, and for a long time marketed their instruments largely toward colleges. ^{5}
@online{hanson2023,
author = {Hanson, Bryan},
title = {Earth’s {Field} {NMR}},
pages = {undefined},
date = {20230726},
url = {http://chemospec.org/posts/20230719EFNMR1/EFNMR1.html},
langid = {en}
}
A number of simple designs for photometers and spectrometers have been published. What drew me to McClain’s approach is that his goal is to teach some basic electronics relevant to instrument design, which is something I have wanted to learn for sometime (apparently since 2014, though actually I think this goes back to watching my father build a Heath Kit stereo receiver which used tubes). Further, McClain starts with a very simple design, and then adds circuit modules to improve the design. Everything is laid out logically and is easy to follow. At each step there is an opportunity to go further to understand how the circuit actually works in detail.
In this post I’ll describe the project at various stages. All the electronics are McClain’s design, but instead of McClain’s cuvette holder I used the design of Kvittingen (Kvittingen et al. (2017)) which uses LEGO bricks as a sample holder and can accommodate an additional detector for fluorescence measurements.
This design is a photometer, and not a spectrophotometer, because only one wavelength at a time can be measured. The source LED must have an emission spectrum overlapping with the of the compound to be measured; LEDs are available which cover pieces of the whole visible spectrum so it’s pretty easy to swap for a different wavelength range. The detector photodiode (a type of LED, working in reverse) responds over a broad wavelength range, though with greatly varying efficiency. If one wants to measure fluorescence, the photodiode is moved to the 90 position.^{1}
A couple of important notes:
In this version a standard “green” LED (maximum emission at 523 nm) is used as the light source and has the simplest possible power supply. As built, the system provides a current of about 26 mA to the LED. The data sheet recommends 30 mA max.
The detector in this version is a photodiode linked to a TIA, a transimpedance amplifier. This is an current to voltage (I to V) converter, and something similar can be used in any instrument where a detector generates a current. Figure 1 shows the circuit.
The main deviation from McClain’s design is that R2 needed to be set to 3M in order to reach about 1V on the output. McClain gives a range of 100K to 1M. As the value of this resistor goes up, the output voltage goes up due to increasing amplification. This change is likely necessary as the photodiode in use here is a bit different than McClain specified. After some experimentation, the current on I1 (which replicates the current produced by the photodiode in the simulation) was set to 1/10,000 of the value of the current of D1, based upon currents observed when isolating D2 from the rest of the circuit.
Monitoring the current and voltage across D2 as built and warmed up, the values were about 0.3 A and 0.23 V; if the LEGO holding D1 was moved immediately adjacent to that holding D2 these numbers were 0.7 A and 0.26 V. These readings support the discussion above that the photodiode was generating a relatively small response.
Figure 2 and Figure 3 show the project from each side.
The next step in McClain’s scheme is to change the basic power supply to a more sophisticated “relaxation oscillator” which produces a square wave output with a certain frequency. The idea here is to eliminate stray room light from affecting the output by using a specific AClike frequency as the source and then modify the detector to only see this frequency. Stray room light may consist of random light causing DC offsets in the circuit, or something more determinant like 60 Hz flicker from light fixtures.
The relaxation oscillator circuit was modeled in CircuitLab before building the circuit. The circuit is in Figure 4 and the simulation results are shown in Figure 5.
Capacitor C2 controls the frequency of the square wave produced by the relaxation oscillator. Figure 6 shows the oscilloscope traces with C2 set to 1F which gives a frequency of about 8 Hz, as seen in the video below. This serves as visual “proof of concept”. Figure 7 shows the oscilloscope traces for a value of 4700 pF for C2 which generates a square wave with frequency 1,500 Hz. This is higher than the frequency of any room light flickering and thus will serve as a “carrier” of the absorbance value unaltered by any stray room light, once we add the other modules to the detection side.
Note that all oscilloscope traces have two vertical scales, one on the left and one on the right, color coordinated with the trace.
The built version of the relaxation oscillator corresponds well with the simulation.
This final version contains all the circuits as described by McClain. I decided to measure voltages directly at the output rather than use an Arduino and display to provide an absorbance value.
Figure 8 shows the final circuit. Note that several test points are labeled and referred to in the discussion below.
The details of the relaxation oscillator are exactly as described above.
As the simulation of the relaxation oscillator shows, the current output of the op amp is very small. Consequently a simple transistor is used to bump up the current driving the LED source to an appropriate value.
The I to V converter circuit is the same as described earlier.
A high pass filter takes a signal that is timevarying, in our case a square wave, and filters it so that only high frequency components are kept. This is a key part of the detector design, since we create an approximately 1,500 Hz square wave and any other component, like 60 Hz flicker from room lights, should be eliminated. Figure 9 shows an isolated version of our high pass filter, and Figure 10 shows the frequency dependency filtering.
A half wave rectifier converts an alternating current, alternating between positive and negative values, into a positive only form. Essentially, the negative portion of the signal is converted to positive values, and the positive portion is set to zero. Figure 11 shows the action of the rectifier.
The final step is an active low pass filter which only passes signals below a certain frequency and amplifies them (that’s the active part). Importantly, in addition to amplifying the signal, the op amp emits a steady DC voltage which is ultimately proportional to the current hitting the photodiode. This is the value we are after when making absorbance measurements. Figure 12 shows the actual output.
If we isolate the low pass filter circuit we can try to understand its operation in greater detail. Figure 13 shows the isolated circuit with simulation inputs configured to match the measured inputs.
If we look at the frequency dependence of this circuit, we see that low frequencies are passed relatively unattenuated (Figure 14), as expected. The combination of the earlier high pass filter and this low pass filter amounts to a band pass filter. This suggests a potential follow up design which uses a band pass filter followed by rectification and conversion to DC by some combination of op amps.
In addition to the filtering behavior, we know that the circuit produces a steady DC current from the approximately square wave input. Let’s check this using the simulator again, but this time looking at output voltages. Figure 15 shows the results, which should ideally be close to those in Figure 12.
A calibration curve was prepared using a 10 mL plastic syringe and some small bottles. Two drops of red food coloring were added to 10 mL of water to create the first solution. Three mL of the stock solution was added to seven mL of water. This 2nd solution was then diluted in similar fashion and so forth, to get five total solutions. Tap water was used. The green LED was disconnected and the dark current was measured. Next, tap water was used as a blank. Then the voltage for each sample was recorded (voltage measurements are taken at point F in Figure 8). Listing 1 shows the computational steps. Figure 16 shows the samples from most concentrated to least concentrated.
Table 1 shows the results. A calibration curve is shown in Figure 17. Clearly the most concentrated samples exceed the linear behavior expected for Beer’s Law (as observed by McClain). If the two most concentrated samples are dropped, the result is a nice linear relationship, as seen in Figure 18 and the summary of the fit in Listing 2.
Concentration  Voltage  Absorbance 

1.0000  0.0262  2.611089 
0.3000  0.0268  2.581818 
0.0900  0.0340  2.284567 
0.0270  0.0999  1.074541 
0.0081  0.1960  0.369747 
Call:
lm(formula = DF35$Absorbance ~ DF35$Concentration)
Residuals:
1 2 3
0.03688 0.15983 0.12294
Coefficients:
Estimate Std. Error t value Pr(>t)
(Intercept) 0.3118 0.1840 1.694 0.3394
DF35$Concentration 22.3292 3.3801 6.606 0.0956 .

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.205 on 1 degrees of freedom
Multiple Rsquared: 0.9776, Adjusted Rsquared: 0.9552
Fstatistic: 43.64 on 1 and 1 DF, pvalue: 0.09564
Not too bad!
@online{hanson2023,
author = {Hanson, Bryan},
title = {Home {Built} {Photometer}},
pages = {undefined},
date = {20230716},
url = {http://chemospec.org/posts/20230716Photometer/Photometer.html},
langid = {en}
}
The development of simple, homebuilt NMR instruments over the past two decades is very interesting and appealing. These instruments typically don’t have a magnet, but rather use the earth’s magnetic field and some type of polarization process to improve sensitivity. Most of these instruments use an inexpensive microprocessor like an Arduino or Raspberry Pi to control the instrument, along with some purposebuilt electronic circuits. Good examples are the work of Michal (Michal (2010), Michal (2020)), Trevelyan (Manley (2019)) and Bryden (Bryden et al. (2021)). These instruments of course aren’t able to give the same results as higherfield instruments with superconducting magnets or Halbach arrays. What can you do with these instruments? Because earth’s magnetic field is very homogeneous locally, the line widths are very narrow, and thus coupling constants and can be measured.^{1} However, the chemical shift range is really small, so structural studies are out. Sensitivity is relatively poor as well. Imaging (MRI) is in principle possible. By the way, there are also examples of DIY Nuclear Quadropole Resonce (NQR) instruments as well, which require no magnetic field (Hiblot et al. (2008)).
Recently, a simpler DIY NMR instrument was published as a Hackaday project by Andy Nichol. This “Nuclear Magnetic Resonance for Everybody” project is unique due to its use of only offtheshelf commericially available hardware components. Because the hydrogen Larmor precession frequency in earth’s magnetic field is in the audio range, the project uses a standard and readily available audio amplifier to simplify the signal detection process. In addition, the complexities of pulse programming are avoided in this project by using a mechanical switch to switch between polarization and detection modes. Finally, a single coil is employed for both polarization and detection. Signal processing is handled by readily available software.
This is an interesting project and it is the most basic entry point into DIY NMR that I have encountered. If it whets your appetite, the project can be made progressively more sophisticated by selectively bringing in the more advanced features of some of the other designs.
@online{hanson2023,
author = {Hanson, Bryan},
title = {DIY {NMR} in {Earth’s} {Field}},
pages = {undefined},
date = {20230612},
url = {http://chemospec.org/posts/20230612DIYNMR/DIYNMR.html},
langid = {en}
}
@online{hanson2022,
author = {Hanson, Bryan},
title = {You {Can} {Now} {Subscribe}},
pages = {undefined},
date = {20221107},
url = {http://chemospec.org/posts/20221107AnnounceSubscribe/AnnounceSubscribe.html},
langid = {en}
}
Back in Part 2 I mentioned some of the challenges of learning linear algebra. One of those challenges is making sense of all the special types of matrices one encounters. In this post I hope to shed a little light on that topic.
I am strongly drawn to thinking in terms of categories and relationships. I find visual presentations like phylogenies showing the relationships between species very useful. In the course of my linear algebra journey, I came across an interesting Venn diagram developed by the very creative thinker Kenji Hiranabe. The diagram is discussed at Matrix World, but the latest version is at the Github link. A Venn diagram is a useful format, but I was inspired to recast the information in different format. Figure 1 shows a taxonomy I created using a portion of the information in Hiranabe’s Venn diagram.^{1} The taxonomy is primarily organized around what I am calling the structure of a matrix: what does it look like upon visual inspection? Of course this is most obvious with small matrices. To me at least, structure is one of the most obvious characteristics of a matrix: an upper triangular matrix really stands out for instance. Secondarily, the taxonomy includes a number of queries that one can ask about a matrix: for instance, is the matrix invertible? We’ll need to expand on all of this of course, but first take a look at the figure.^{2}
Let’s use R
to construct and inspect examples of each type of matrix. We’ll use integer matrices to keep the print output nice and neat, but of course real numbers could be used as well.^{3} Most of these are pretty straightforward so we’ll keep comments to a minimum for the simple cases.
A_rect < matrix(1:12, nrow = 3) # if you give nrow,
A_rect # R will compute ncol from the length of the data
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
Notice that R
is “column major” meaning data fills the first column, then the second column and so forth.
A_row < matrix(1:4, nrow = 1)
A_row
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
A_col < matrix(1:4, ncol = 1)
A_col
[,1]
[1,] 1
[2,] 2
[3,] 3
[4,] 4
Keep in mind that to save space in a textdense document one would often write A_col
as its transpose.^{4}
A_sq < matrix(1:9, nrow = 3)
A_sq
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
Creating an upper triangular matrix requires a few more steps. Function upper.tri()
returns a logical matrix which can be used as a mask to select entries. Function lower.tri()
can be used similarly. Both functions have an argument diag = TRUE/FALSE
indicating whether to include the diagonal.^{5}
upper.tri(A_sq, diag = TRUE)
[,1] [,2] [,3]
[1,] TRUE TRUE TRUE
[2,] FALSE TRUE TRUE
[3,] FALSE FALSE TRUE
A_upper < A_sq[upper.tri(A_sq)] # gives a logical matrix
A_upper # notice that a vector is returned, not quite what might have been expected!
[1] 4 7 8
A_upper < A_sq # instead, create a copy to be modified
A_upper[lower.tri(A_upper)] < 0L # assign the lower entries to zero
A_upper
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 0 5 8
[3,] 0 0 9
Notice to create an upper triangular matrix we use lower.tri()
to assign zeros to the lower part of an existing matrix.
If you give diag()
a single value it defines the dimensions and creates a matrix with ones on the diagonal, in other words, an identity matrix.
A_ident < diag(4)
A_ident
[,1] [,2] [,3] [,4]
[1,] 1 0 0 0
[2,] 0 1 0 0
[3,] 0 0 1 0
[4,] 0 0 0 1
If instead you give diag()
a vector of values these go on the diagonal and the length of the vector determines the dimensions.
A_diag < diag(1:4)
A_diag
[,1] [,2] [,3] [,4]
[1,] 1 0 0 0
[2,] 0 2 0 0
[3,] 0 0 3 0
[4,] 0 0 0 4
Matrices created by diag()
are symmetric matrices, but any matrix where is symmetric. There is no general function to create symmetric matrices since there is no way to know what data should be used. However, one can ask if a matrix is symmetric, using the function isSymmetric()
.
isSymmetric(A_diag)
[1] TRUE
Let’s take the queries in the taxonomy in order, as the hierarchy is everything.
A singular matrix is one in which one or more rows are multiples of another row, or alternatively, one or more columns are multiples of another column. Why do we care? Well, it turns out a singular matrix is a bit of a dead end, you can’t do much with it. An invertible matrix, however, is a very useful entity and has many applications. What is an invertible matrix? In simple terms, being invertible means the matrix has an inverse. This is not the same as the algebraic definition of an inverse, which is related to division:
Instead, for matrices, invertibility of is defined as the existence of another matrix such that
Just as cancels out in , cancels out to give the identity matrix. In other words, is really .
A singular matrix has determinant of zero. On the other hand, an invertible matrix has a nonzero determinant. So to determine which type of matrix we have before us, we can simply compute the determinant.
Let’s look at a few simple examples.
A_singular < matrix(c(1, 2, 3, 6), nrow = 2, ncol = 2)
A_singular # notice that col 2 is col 1 * 3, they are not independent
[,1] [,2]
[1,] 1 3
[2,] 2 6
det(A_singular)
[1] 0
A_invertible < matrix(c(2, 2, 7, 8), nrow = 2, ncol = 2)
A_invertible
[,1] [,2]
[1,] 2 7
[2,] 2 8
det(A_invertible)
[1] 2
A matrix that is diagonalizable can be expressed as:
where is a diagonal matrix – the diagonalized version of the original matrix . How do we find out if this is possible, and if possible, what are the values of and ? The answer is to decompose using the eigendecomposition:
Now there is a lot to know about the eigendecomposition, but for now let’s just focus on a few key points:
We can answer the original question by using the eigen()
function in R
. Let’s do an example.
A_eigen < matrix(c(1, 0, 2, 2, 3, 4, 0, 0, 2), ncol = 3)
A_eigen
[,1] [,2] [,3]
[1,] 1 2 0
[2,] 0 3 0
[3,] 2 4 2
eA < eigen(A_eigen)
eA
eigen() decomposition
$values
[1] 3 2 1
$vectors
[,1] [,2] [,3]
[1,] 0.4082483 0 0.4472136
[2,] 0.4082483 0 0.0000000
[3,] 0.8164966 1 0.8944272
Since eigen(A_eigen)
was successful, we can conclude that A_eigen
was diagonalizable. You can see the eigenvalues and eigenvectors in the returned value. We can reconstruct A_eigen
using Equation 4:
eA$vectors %*% diag(eA$values) %*% solve(eA$vectors)
[,1] [,2] [,3]
[1,] 1 2 0
[2,] 0 3 0
[3,] 2 4 2
Remember, diag()
creates a matrix with the values along the diagonal, and solve()
computes the inverse when it gets only one argument.
The only loose end is which matrices are not diagonalizable? These are covered in this Wikipedia article. Briefly, most nondiagonalizable matrices are fairly exotic and real data sets will likely not be a problem.
In texts, eigenvalues and eigenvectors are universally introduced as a scaling relationship
where is a column eigenvector and is a scalar eigenvalue. One says “ scales by a factor of .” A single vector is used as one can readily illustrate how that vector grows or shrinks in length when multiplied by . Let’s call this the “bottom up” explanation.
Let’s check that is true using our values from above by extracting the first eigenvector and eigenvalue from eA
. Notice that we are using regular multiplication on the righthandside, i.e. *
, rather than %*%
, because eA$values[1]
is a scalar. Also on the righthandside, we have to add drop = FALSE
to the subsetting process or the result is no longer a matrix.^{7}
isTRUE(all.equal(
A_eigen %*% eA$vectors[,1],
eA$values[1] * eA$vectors[,1, drop = FALSE]))
[1] TRUE
If instead we start from Equation 4 and rearrange it to show the relationship between and we get:
Let’s call this the “top down” explanation. We can verify this as well, making sure to convert eA$values
to a diagonal matrix as the values are stored as a vector to save space.
isTRUE(all.equal(A_eigen %*% eA$vectors, eA$vectors %*% diag(eA$values)))
[1] TRUE
Notice that in Equation 6 is on the right of , but in Equation 5 the corresponding value, , is to the left of . This is a bit confusing until one realizes that Equation 5 could have been written
since is a scalar. It’s too bad that the usual, bottom up, presentation seems to conflict with the top down approach. Perhaps the choice in Equation 5 is a historical artifact.
A normal matrix is one where . As far as I know, there is no function in R
to check this condition, but we’ll write our own in a moment. One reason being “normal” is interesting is if is a normal matrix, then the results of the eigendecomposition change slightly:
where is an orthogonal matrix, which we’ll talk about next.
An orthogonal matrix takes the definition of a normal matrix one step further: . If a matrix is orthogonal, then its transpose is equal to its inverse: , which of course makes any special computation of the inverse unnecessary. This is a significant advantage in computations.
To aid our learning, let’s write a simple function that will report if a matrix is normal, orthogonal, or neither.^{8}
normal_or_orthogonal < function(M) {
if (!inherits(M, "matrix")) stop("M must be a matrix")
norm < orthog < FALSE
tst1 < M %*% t(M)
tst2 < t(M) %*% M
norm < isTRUE(all.equal(tst1, tst2))
if (norm) orthog < isTRUE(all.equal(tst1, diag(dim(M)[1])))
if (orthog) message("This matrix is orthogonal\n") else
if (norm) message("This matrix is normal\n") else
message("This matrix is neither orthogonal nor normal\n")
invisible(NULL)
}
And let’s run a couple of tests.
normal_or_orthogonal(A_singular)
This matrix is neither orthogonal nor normal
Norm < matrix(c(1, 0, 1, 1, 1, 0, 0, 1, 1), nrow = 3)
normal_or_orthogonal(Norm)
This matrix is normal
normal_or_orthogonal(diag(3)) # the identity matrix is orthogonal
This matrix is orthogonal
Orth < matrix(c(0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0), nrow = 4)
normal_or_orthogonal(Orth)
This matrix is orthogonal
The columns of an orthogonal matrix are orthogonal to each other. We can show this by taking the dot product between any pair of columns. Remember is the dot product is zero the vectors are orthogonal.
t(Orth[,1]) %*% Orth[,2] # col 1 dot col 2
[,1]
[1,] 0
t(Orth[,1]) %*% Orth[,3] # col 1 dot col 3
[,1]
[1,] 0
Finally, not only are the columns orthogonal, but each column vector has length one, making them orthonormal.
sqrt(sum(Orth[,1]^2))
[1] 1
Taking these queries together, we see that symmetric and diagonal matrices are necessarily invertible, diagonalizable and normal. They are not however orthogonal. Identity matrices however, have all these properties. Let’s doublecheck these statements.
A_sym < matrix(
c(1, 5, 4, 5, 2, 9, 4, 9, 3),
ncol = 3) # symmetric matrix, not diagonal
A_sym
[,1] [,2] [,3]
[1,] 1 5 4
[2,] 5 2 9
[3,] 4 9 3
normal_or_orthogonal(A_sym)
This matrix is normal
normal_or_orthogonal(diag(1:3)) # diagonal matrix, symmetric, but not the identity matrix
This matrix is normal
normal_or_orthogonal(diag(3)) # identity matrix (also symmetric, diagonal)
This matrix is orthogonal
So what’s the value of these queries? As mentioned, they help us understand the relationships between different types of matrices, so they help us learn more deeply. On a practical computational level they may not have much value, especially when dealing with realworld data sets. However, there are some other interesting aspects of these queries that deal with decompositions and eigenvalues. We might cover these in the future.
A more personal thought: In the course of writing these posts, and learning more linear algebra, it increasingly seems to me that a lot of the “effort” that goes into linear algebra is about making tedious operations simpler. Anytime one can have more zeros in a matrix, or have orthogonal vectors, or break a matrix into parts, the simpler things become. However, I haven’t really seen this point driven home in texts or tutorials. I think linear algebra learners would do well to keep this in mind.
These are the main sources I relied on for this post.
@online{hanson2022,
author = {Hanson, Bryan},
title = {Notes on {Linear} {Algebra} {Part} 4},
pages = {undefined},
date = {20220926},
url = {http://chemospec.org/posts/20220926LinearAlgNotesPt4/LinearAlgNotesPt4.html},
langid = {en}
}
Update 19 September 2022: in “Use of outer() for Matrix Multiplication”, corrected use of “cross” to be “outer” and added example in R
. Also added links to work by Hiranabe.
This post is a survey of the linear algebrarelated functions from base R
. Some of these I’ve disccused in other posts and some I may discuss in the future, but this post is primarily an inventory: these are the key tools we have available. “Notes” in the table are taken from the help files.
Matrices, including row and column vectors, will be shown in bold e.g. or while scalars and variables will be shown in script, e.g. . R
code will appear like x < y
.
In the table, or is an upper/right triangular matrix. is a lower/left triangular matrix (triangular matrices are square). is a generic matrix of dimensions . is a square matrix of dimensions .
Function  Uses  Notes  

operators  
* 
scalar multiplication  
%*% 
matrix multiplication  two vectors the dot product; vector + matrix cross product (vector will be promoted as needed)^{1}  
basic functions  
t() 
transpose  interchange rows and columns  
crossprod() 
matrix multiplication  faster version of t(A) %*% A 

tcrossprod() 
matrix multiplication  faster version of A %*% t(A) 

outer() 
outer product & more  see discussion below  
det() 
computes determinant  uses the LU decomposition; determinant is a volume  
isSymmetric() 
name says it all  
Conj() 
computes complex conjugate  
decompositions  
backsolve() 
solves  
forwardsolve() 
solves  
solve() 
solves and  e.g. linear systems; if given only one matrix returns the inverse  
qr() 
solves  is an orthogonal matrix; can be used to solve ; see ?qr for several qr.* extractor functions 

chol() 
solves  Only applies to positive semidefinite matrices (where ); related to LU decomposition  
chol2inv() 
computes from the results of chol(M) 

svd() 
singular value decomposition  input ; can compute PCA; details  
eigen() 
eigen decomposition  requires ; can compute PCA; details 
One thing to notice is that there is no LU decomposition in base R
. It is apparently used “under the hood” in solve()
and there are versions available in contributed packages.^{2}
As seen in Part 1 calling outer()
on two vectors does indeed give the cross product (technically corresponding to tcrossprod()
). This works because the defaults carry out multiplication.^{3} However, looking through the R
source code for uses of outer()
, the function should really be thought of in simple terms as creating all possible combinations of the two inputs. In that way it is similar to expand.grid()
. Here are two illustrations of the flexibility of outer()
:
# generate a grid of x,y values modified by a function
# from ?colorRamp
m < outer(1:20, 1:20, function(x,y) sin(sqrt(x*y)/3))
str(m)
num [1:20, 1:20] 0.327 0.454 0.546 0.618 0.678 ...
# generate all combinations of month and year
# modified from ?outer; any function accepting 2 args can be used
outer(month.abb, 2000:2002, FUN = paste)
[,1] [,2] [,3]
[1,] "Jan 2000" "Jan 2001" "Jan 2002"
[2,] "Feb 2000" "Feb 2001" "Feb 2002"
[3,] "Mar 2000" "Mar 2001" "Mar 2002"
[4,] "Apr 2000" "Apr 2001" "Apr 2002"
[5,] "May 2000" "May 2001" "May 2002"
[6,] "Jun 2000" "Jun 2001" "Jun 2002"
[7,] "Jul 2000" "Jul 2001" "Jul 2002"
[8,] "Aug 2000" "Aug 2001" "Aug 2002"
[9,] "Sep 2000" "Sep 2001" "Sep 2002"
[10,] "Oct 2000" "Oct 2001" "Oct 2002"
[11,] "Nov 2000" "Nov 2001" "Nov 2002"
[12,] "Dec 2000" "Dec 2001" "Dec 2002"
Bottom line: outer()
can be used for linear algebra but its main uses lie elsewhere. You don’t need it for linear algebra!
Here’s an interesting connection discussed in this Wikipedia entry. In Part 1 we demonstrated how the repeated application of the dot product underpins matrix multiplication. The first row of the first matrix is multiplied elementwise by the first column of the second matrix, shown in red, to give the first element of the answer matrix. This process is then repeated so that every row (first matrix) has been multiplied by every column (second matrix).
If instead, we treat the first column of the first matrix as a column vector and outer multiply it by the first row of the second matrix as a row vector, we get the following matrix:
Now if you repeat this process for the second column of the first matrix and the second row of the second matrix, you get another matrix. And if you do it one more time using the third column/third row, you get a third matrix. If you then add these three matrices together, you get as seen in Equation 1. Notice how each element in in Equation 1 is a sum of three terms? Each of those terms comes from one of the three matrices just described.
To sum up, one can use the dot product on each row (first matrix) by each column (second matrix) to get the answer, or you can use the outer product on the columns sequentially (first matrix) by rows sequentially (second matrix) to get several matrices, which one then sums to get the answer. It’s pretty clear which option is less work and easier to follow, but I think it’s an interesting connection between operations. The first case corresponds to view “MM1” in The Art of Linear Algebra while the second case is view “MM4”. See this work by Kenji Hiranabe.
Here’s a simple proof in R
.
M1 < matrix(1:6, nrow = 3, byrow = TRUE)
M1
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
M2 < matrix(7:10, nrow = 2, byrow = TRUE)
M2
[,1] [,2]
[1,] 7 8
[2,] 9 10
tst1 < M1 %*% M2 # uses dot product
# next line is sum of sequential outer products:
# 1st col M1 by 1st row M2 + 2nd col M1 by 2nd row M2
tst2 < outer(M1[,1], M2[1,]) + outer(M1[,2], M2[2,])
all.equal(tst1, tst2)
[1] TRUE
@online{hanson2022,
author = {Hanson, Bryan},
title = {Notes on {Linear} {Algebra} {Part} 3},
pages = {undefined},
date = {20220910},
url = {http://chemospec.org/posts/20220910LinearAlgNotesPt3/LinearAlgNotesPt3.html},
langid = {en}
}
For Part 1 of this series, see here.
If you open a linear algebra text, it’s quickly apparent how complex the field is. There are so many special types of matrices, so many different decompositions of matrices. Why are all these needed? Should I care about null spaces? What’s really important? What are the threads that tie the different concepts together? As someone who is trying to improve their understanding of the field, especially with regard to its applications in chemometrics, it can be a tough slog.
In this post I’m going to try to demonstrate how some simple chemometric tasks can be solved using linear algebra. Though I cover some math here, the math is secondary right now – the conceptual connections are more important. I’m more interested in finding (and sharing) a path through the thicket of linear algebra. We can return as needed to expand the basic math concepts. The cognitive effort to work through the math details is likely a lot lower if we have a sense of the big picture.
In this post, matrices, including row and column vectors, will be shown in bold e.g. while scalars and variables will be shown in script, e.g. . Variables used in R
code will appear like A
.
If you’ve had algebra, you have certainly run into “system of equations” such as the following:
In algebra, such systems can be solved several ways, for instance by isolating one or more variables and substituting, or geometrically (particularly for 2D systems, by plotting the lines and looking for the intersection). Once there are more than a few variables however, the only manageable way to solve them is with matrix operations, or more explicitly, linear algebra. This sort of problem is the core of linear algebra, and the reason the field is called linear algebra.
To solve the system above using linear algebra, we have to write it in the form of matrices and column vectors:
or more generally
where is the matrix of coefficients, is the column vector of variable names^{1} and is a column vector of constants. Notice that these matrices are conformable:^{2}
To solve such a system, when we have unknowns, we need equations.^{3} This means that has to be a square matrix, and square matrices play a special role in linear algebra. I’m not sure this point is always conveyed clearly when this material is introduced. In fact, it seems like many texts on linear algebra seem to bury the lede.
To find the values of ^{4}, we can do a little rearranging following the rules of linear algebra and matrix operations. First we premultiply both sides by the inverse of , which then gives us the identity matrix , which drops out.^{5}
So it’s all sounding pretty simple right? Ha. This is actually where things potentially break down. For this to work, must be invertible, which is not always the case.^{6} If there is no inverse, then the system of equations either has no solution or infinite solutions. So finding the inverse of a matrix, or discovering it doesn’t exist, is essential to solving these systems of linear equations.^{7} More on this eventually, but for now, we know must be a square matrix and we hope it is invertible.
We learn in algebra that a line takes the form . If one has measurements in the form of pairs that one expects to fit to a line, we need linear regression. Carrying out a linear regression is arguably one of the most important, and certainly a very common application of the linear systems described above. One can get the values of and by hand using algebra, but any computer will solve the system using a matrix approach.^{8} Consider this data:
To express this in a matrix form, we recast
into
where:
With our data above, this looks like:
If we multiply this out, each row works out to be an instance of . Hopefully you can appreciate that corresponds to and corresponds to .^{9}
This looks similar to seen in Equation 3, if you set to , to and to :
This contortion of symbols is pretty nasty, but honestly not uncommon when moving about in the world of linear algebra.
As it is composed of real data, presumably with measurement errors, there is not an exact solution to due to the error term. There is however, an approximate solution, which is what is meant when we say we are looking for the line of best fit. This is how linear regression is carried out on a computer. The relevant equation is:
The key point here is that once again we need to invert a matrix to solve this. The details of where Equation 11 comes from are covered in a number of places, but I will note here that refers to the best estimate of .^{10}
We now have two examples where inverting a matrix is a key step: solving a system of linear equations, and approximating the solution to a system of linear equations (the regression case). These cases are not outliers, the ability to invert a matrix is very important. So how do we do this? The LU decomposition can do it, and is widely used so worth spending some time on. A decomposition is the process of breaking a matrix into pieces that are easier to handle, or that give us special insight, or both. If you are a chemometrician you have almost certainly carried out Principal Components Analysis (PCA). Under the hood, PCA requires either a singular value decomposition, or an eigen decomposition (more info here).
So, about the LU decomposition: it breaks a matrix into two matrices, , a “lower triangular matrix”, and , an “upper triangular matrix”. These special matrices contain only zeros except along the diagonal and the entries below it (in the lower case), or along the diagonal and the entries above it (in the upper case). The advantage of triangular matrices is that they are very easy to invert (all those zeros make many terms drop out). So the LU decomposition breaks the tough job of inverting into two easier jobs.
When all is done, we only need to figure out and which as mentioned is straightforward.^{11}
To summarize, if we want to solve a system of equations we need to carry out matrix inversion, which is turn is much easier to do if one uses the LU decomposition to get two easy to invert triangular matrices. I hope you are beginning to see how pieces of linear algebra fit together, and why it might be good to learn more.
Let’s look at how R
does these operations, and check our understanding along the way. R
makes this really easy. We’ll start with the issue of invertibility. Let’s create a matrix for testing.
A1 < matrix(c(3, 5, 1, 11, 2, 0, 5, 2, 5), ncol = 3)
A1
[,1] [,2] [,3]
[1,] 3 11 5
[2,] 5 2 2
[3,] 1 0 5
In the matlib
package there is a function inv
that inverts matrices. It returns the inverted matrix, which we can verify by multiplying the inverted matrix by the original matrix to give the identity matrix (if inversion was successful). diag(3)
creates a 3 x 3 matrix with 1’s on the diagonal, in other words an identity matrix.
library("matlib")
A1_inv < inv(A1)
all.equal(A1_inv %*% A1, diag(3))
[1] "Mean relative difference: 8.999999e08"
The difference here is really small, but not zero. Let’s use a different function, solve
which is part of base R
. If solve
is given a single matrix, it returns the inverse of that matrix.
A1_solve < solve(A1) %*% A1
all.equal(A1_solve, diag(3))
[1] TRUE
That’s a better result. Why are there differences? inv
uses a method called Gaussian elimination which is similar to how one would invert a matrix using pencil and paper. On the other hand, solve
uses the LU decomposition discussed earlier, and no matrix inversion is necessary. Looks like the LU decomposition gives a somewhat better numerical result.
Now let’s look at a different matrix, created by replacing the third column of A1
with different values.
A2 < matrix(c(3, 5, 1, 11, 2, 0, 6, 10, 2), ncol = 3)
A2
[,1] [,2] [,3]
[1,] 3 11 6
[2,] 5 2 10
[3,] 1 0 2
And let’s compute its inverse using solve
.
solve(A2)
Error in solve.default(A2): system is computationally singular: reciprocal condition number = 6.71337e19
When R
reports that A2
is computationally singular, it is saying that it cannot be inverted. Why not? If you look at A2
, notice that column 3 is a multiple of column 1. Anytime one column is a multiple of another, or one row is a multiple of another, then the matrix cannot be inverted because the rows or columns are not independent.^{12} If this was a matrix of coefficients from an experimental measurement of variables, this would mean that some of your variables are not independent, they must be measuring the same underlying phenomenon.
Let’s solve the system from Equation 2. It turns out that the solve
function also handles this case, if you give it two arguments. Remember, solve
is using the LU decomposition behind the scenes, no matrix inversion is required.
A3 < matrix(c(1, 2, 3, 2, 1, 2, 3, 1, 1), ncol = 3)
A3
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 2 1 1
[3,] 3 2 1
colnames(A3) <c("x", "y", "z") # naming the columns will label the answer
b < c(3, 11, 5)
solve(A3, b)
x y z
2 4 3
The answer is the values of that make the system of equations true.
While we’ve emphasized the importance and challenges of inverting matrices, we’ve also pointed out that to solve a linear system there are alternatives to looking at the problem from the perspective of Equation 5. Here’s an approach using the LU decomposition, starting with substituting with :
We want to solve for the column vector of variables. To do so, define a new vector and substitute it in:
Next we solve for . One way we could do this is to premultiply both sides by but we are looking for a way to avoid using the inverse. Instead, we evaluate to give a series of expressions using the dot product (in other words plain matrix multiplication). Because is lower triangular, many of the terms we might have gotten actually disappear because of the zero coefficients. What remains is simple enough that we can algebraically find each element of starting from the first row (this is called forward substitution). Once we have , we can find by solving using a similar approach, but working from the last row upward (this is backward substitution). This is a good illustration of the utility of triangular matrices: some operations can move from the linear algebra realm to the algebra realm. Wikipedia has a good illustration of forward and backward substitution.
Let’s compute the values for in our regression data shown in Equation 6. First, let’s set up the needed matrices and plot the data since visualizing the data is always a good idea.
y = matrix(c(11.8, 7.2, 21.5, 17.2, 26.8), ncol = 1)
X = matrix(c(rep(1, 5), 2.1, 0.9, 3.9, 3.2, 5.1), ncol = 2) # design matrix
X
[,1] [,2]
[1,] 1 2.1
[2,] 1 0.9
[3,] 1 3.9
[4,] 1 3.2
[5,] 1 5.1
plot(X[,2], y, xlab = "x") # column 2 of X has the x values
The value of can be found via Equation 11:
solve((t(X) %*% X)) %*% t(X) %*% y
[,1]
[1,] 2.399618
[2,] 4.769862
The first value is for or or intecept, the second value is for or or slope.
Let’s compare this answer to R
’s builtin lm
function (for linear model):
fit < lm(y ~ X[,2])
fit
Call:
lm(formula = y ~ X[, 2])
Coefficients:
(Intercept) X[, 2]
2.40 4.77
We have good agreement! If you care to learn about the goodness of the fit, the residuals etc, then you can look at the help file ?lm
and str(fit)
. lm
returns pretty much all one needs to know about the results, but if you wish to calculate all the interesting values yourself you can do so by manipulating Equation 11 and its relatives.
Finally, let’s plot the line of best fit found by lm
to make sure everything looks reasonable.
plot(X[,2], y, xlab = "x")
abline(coef = coef(fit), col = "red")
That’s all for now, and a lot to digest. I hope you are closer to finding your own path through linear algebra. Remember that investing in learning the fundamentals prepares you for tackling the more complex topics. Thanks for reading!
These are the main sources I relied on for this post.
matlib
package are very helpful.@online{hanson2022,
author = {Hanson, Bryan},
title = {Notes on {Linear} {Algebra} {Part} 2},
pages = {undefined},
date = {20220901},
url = {http://chemospec.org/posts/20220901LinearAlgNotesPt2/LinearAlgNotesPt2.html},
langid = {en}
}
R
, read no further and do something else!
If you are like me, you’ve had no formal training in linear algebra, which means you learn what you need to when you need to use it. Eventually, you cobble together some hardwon knowledge. That’s good, because almost everything in chemometrics involves linear algebra.
This post is essentially a set of personal notes about the dot product and the cross product, two important manipulations in linear algebra. I’ve tried to harmonize things I learned way back in college physics and math courses, and integrate information I’ve found in various sources I have leaned on more recently. Without a doubt, the greatest impediment to really understanding this material is the use of multiple terminology and notations. I’m going to try really hard to be clear and to the point in my dicussion.
The main sources I’ve relied on are:
Let’s get started. For sanity and consistency, let’s define two 3D vectors and two matrices to illustrate our examples. Most of the time I’m going to write vectors with an arrow over the name, as a nod to the treatment usually given in a physics course. This reminds us that we are thinking about a quantity with direction and magnitude in some coordinate system, something geometric. Of course in the R
language a vector is simply a list of numbers with the same data type; R
doesn’t care if a vector is a vector in the geometric sense or a list of states.
The dot product goes by these other names: inner product, scalar product. Typical notations include:^{1}
There are two main formulas for the dot product with vectors, the algebraic formula (Equation 5) and the geometric formula (Equation 6).
refers to the or Euclidian norm, namely the length of the vector:^{2}
The result of the dot product is a scalar. The dot product is also commutative: .
From the perspective of matrices, if we think of and as column vectors with dimensions 3 x 1, then transposing gives us conformable matrices and we find the result of matrix multiplication is the dot product (compare to Equation 5):
Even though this is matrix multiplication, the answer is still a scalar.
Now, rather confusingly, if we think of and as row vectors, and we transpose ,then we get the dot product:
Equations Equation 8 and Equation 9 can be a source of real confusion at first. They give the impression that the dot product can be either or . However, this is only true in the limited contexts defined above. To summarize:
Unfortunately I think this distinction is not always clearly made by authors, and is a source of great confusion to linear algebra learners. Be careful when working with row and column vectors.
Suppose we wanted to compute .^{3} We use the idea of row and column vectors to accomplish this task. In the process, we discover that matrix multiplication is a series of dot products:
The red color shows how the dot product of the first row of and the first column of gives the first entry in . Every entry in results from a dot product. Every entry is a scalar, embedded in a matrix.
The cross product goes by these other names: outer product^{4}, tensor product, vector product.
The cross product of two vectors returns a vector rather than a scalar. Vectors are defined in terms of a basis which is a coordinate system. Earlier, when we defined it was intrinsically defined in terms of the standard basis set (in some fields this would be called the unit coordinate system). Thus a fuller definition of would be:
In terms of vectors, the cross product is defined as:
In my opinion, this is not exactly intuitive, but there is a pattern to it: notice that the terms for don’t involve the component. The details of how this result is computed relies on some properties of the basis set; this Wikipedia article has a nice explanation. We need not dwell on it however.
There is also a geometric formula for the cross product:
where is the unit vector perpendicular to the plane defined by and . The direction of is defined by the righthand rule. Because of this, the cross product is not commutative, i.e. . The cross product is however anticommutative:
As we did for the dot product, we can look at the cross product from the perspective of column vectors. Instead of transposing the first matrix as we did for the dot product, we transpose the second one:
Interestingly, we are using the dot product to compute the cross product.
The case where we treat and as row vectors is left to the reader.^{5}
Finally, there is a matrix definition of the cross product as well. Evaluation of the following determinant gives the cross product:
%*%
The workhorse for matrix multiplication in R
is the %*%
function. This function will accept any combination of vectors and matrices as inputs, so it is flexible. It is also smart: given a vector and a matrix, the vector will be treated as row or column matrix as needed to ensure conformity, if possible. Let’s look at some examples:
# Some data for examples
p < 1:5
q < 6:10
M < matrix(1:15, nrow = 3, ncol = 5)
M
[,1] [,2] [,3] [,4] [,5]
[1,] 1 4 7 10 13
[2,] 2 5 8 11 14
[3,] 3 6 9 12 15
# A vector times a vector
p %*% q
[,1]
[1,] 130
Notice that R
returns a data type of matrix, but it is a matrix, and thus a scalar value. That means we just computed the dot product, a descision R
made internally. We can verify this by noting that q %*% p
gives the same answer. Thus, R
handled these vectors as column vectors and computed .
# A vector times a matrix
M %*% p
[,1]
[1,] 135
[2,] 150
[3,] 165
As M
had dimensions , R
treated p
as a column vector in order to be conformable. The result is a vector, so this is the cross product.
If we try to compute p %*% M
we get an error, because there is nothing R
can do to p
which will make it conformable to M
.
p %*% M
Error in p %*% M: nonconformable arguments
What about multiplying matrices?
M %*% M
Error in M %*% M: nonconformable arguments
As you can see, when dealing with matrices, %*%
will not change a thing, and if your matrices are nonconformable then it’s an error. Of course, if we transpose either instance of M
we do have conformable matrices, but the answers are different, and this is neither the dot product or the cross product, just matrix multiplication.
t(M) %*% M
[,1] [,2] [,3] [,4] [,5]
[1,] 14 32 50 68 86
[2,] 32 77 122 167 212
[3,] 50 122 194 266 338
[4,] 68 167 266 365 464
[5,] 86 212 338 464 590
M %*% t(M)
[,1] [,2] [,3]
[1,] 335 370 405
[2,] 370 410 450
[3,] 405 450 495
What can we take from these examples?
R
will give you the dot product if you give it two vectors. Note that this is a design decision, as it could have returned the cross product (see Equation 14).R
will promote a vector to a row or column vector if it can to make it conformable with a matrix you provide. If it cannot, R
will give you an error. If it can, the cross product is returned.R
will give an error when they are not conformable.%*%
, does it all: dot product, cross product, or matrix multiplication, but you need to pay attention.There are other R
functions that do some of the same work:
crossprod
equivalent to t(M) %*% M
but faster.tcrossprod
equivalent to M %*% t(M)
but faster.outer
or %o%
The first two functions will accept combinations of vectors and matrices, as does %*%
. Let’s try it with two vectors:
crossprod(p, q)
[,1]
[1,] 130
Huh. crossprod
is returning the dot product! So this is the case where “the cross product is not the cross product.” From a clarity perspective, this is not ideal. Let’s try the other function:
tcrossprod(p, q)
[,1] [,2] [,3] [,4] [,5]
[1,] 6 7 8 9 10
[2,] 12 14 16 18 20
[3,] 18 21 24 27 30
[4,] 24 28 32 36 40
[5,] 30 35 40 45 50
There’s the cross product!
What about outer
? Remember that another name for the cross product is the outer product. So is outer
the same as tcrossprod
? In the case of two vectors, it is:
identical(outer(p, q), tcrossprod(p, q))
[1] TRUE
What about a vector with a matrix?
tst < outer(p, M)
dim(tst)
[1] 5 3 5
Alright, that clearly is not a cross product. The result is an array with dimensions , not a matrix (which would have only two dimensions). outer
does correspond to the cross product in the case of two vectors, but anything with higher dimensions gives a different beast. So perhaps using “outer” as a synonym for cross product is not a good idea.
Given what we’ve seen above, make your life simple and stick to %*%
, and pay close attention to the dimensions of the arguments, especially if row or column vectors are in use. In my experience, thinking about the units and dimensions of whatever it is you are calculating is very helpful. Later, if speed is really important in your work, you can use one of the faster alternatives.
@online{hanson2022,
author = {Hanson, Bryan},
title = {Notes on {Linear} {Algebra} {Part} 1},
pages = {undefined},
date = {20220814},
url = {http://chemospec.org/posts/20220814LinearAlgNotes/20220814LinearAlgNotes.html},
langid = {en}
}
A few days ago I pushed a major update, and at this point Python
packages outnumber R
packages more than two to one. The update was made possible because I recently had time to figure out how to search the PyPi.org site automatically.
In a previous post I explained the methods I used to find packages related to spectroscopy. These have been updated considerably and the rest of this post will cover the updated methods.
There are four places I search for packages related to spectroscopy.^{1}
packagefinder
package.^{2}The topics I search are as follows:
I search CRAN using packagefinder
; the process is quite straightforward and won’t be covered here. However, it is not an automated process (I should probably work on that).
The broad approach used to search Github is the same as described in the original post. However, the scripts have been refined and updated, and now exist as functions in a new package I created called webu
(for “webutilities”, but that name is taken on CRAN). The repo is here. webu
is not on CRAN and I don’t currently intend to put it there, but you can install from the repo of course if you wish to try it out.
Searching Github is now carried out by a supervising script called /utilities/run_searches.R
(in the FOSS4Spectroscopy
repo). The script contains some notes about finicky details, but is pretty simple overall and should be easy enough to follow.
Unlike Github, it is not necessary to authenticate to use the PyPi.org API. That makes things simpler than the Github case. The needed functions are in webu
and include some deliberate delays so as to not overload their servers. As for Github, searches are supervised by /utilities/run_searches.R
.
One thing I observed at PyPi.org is that authors do not always fill out all the fields that PyPi.org can accept, which means some fields are NULL
and we have to trap for that possibility. Package information is accessed via a JSON record, for instance the entry for nmrglue
can be seen here. This package is pretty typical in that the author_email
field is filled out, but the maintainer_email
field is not (they are presumably the same). If one considers these JSON files to be analogous to DESCRIPTION in R
packages, it looks like there is less oversight on PyPi.org compared to CRAN.
Julia packages are readily searched manually at juliapackages.org.
The raw results from the searches described above still need a lot of inspection and cleaning to be usable. The PyPi.org and Github results are saved in an Excel worksheet with the relevant URLs. These links can be followed to determine the suitability of each package. In the /Utilities
folder there are additional scripts to remove entries that are already in the main database (FOSS4Spec.xlsx), as well as to check the names of the packages: Python authors and/or policies seem to lead to cases where different packages can have names differing by case, but also authors are sometimes sloppy when referring to their own packages, sometimes using mypkg
and at other times myPkg
to refer to the same package.
@online{hanson2022,
author = {Hanson, Bryan},
title = {FOSS4Spectroscopy: {R} Vs {Python}},
pages = {undefined},
date = {20220706},
url = {http://chemospec.org/posts/20220706F4SUpdate/20220706F4SUpdate.html},
langid = {en}
}
I’m pleased to announce that my colleague David Harvey and I have recently released LearnPCA
, an R
package to help people with understanding PCA. In LearnPCA
we’ve tried to integrate our years of experience teaching the topic, along with the best insights we can find in books, tutorials and the nooks and crannies of the internet. Though our experience is in a chemometrics context, we use examples from different disciplines so that the package will be broadly helpful.
The package contains seven vignettes that proceed from the conceptual basics to advanced topics. As of version 0.2.0, there is also a Shiny app to help visualize the process of finding the principal component axes. The current vignettes are:
You can access the vignettes at the Github Site, you don’t even have to install the package. For the Shiny app, do the following:
install.packages("LearnPCA") # you'll need version 0.2.0
library("LearnPCA")
PCsearch()
We would really appreciate your feedback on this package. You can do so in the comments below, or open an issue.
@online{hanson2022,
author = {Hanson, Bryan},
title = {Introducing {LearnPCA}},
pages = {undefined},
date = {20220503},
url = {http://chemospec.org/posts/20220503LearnPCAIntro/20220503LearnPCAIntro.html},
langid = {en}
}
If you aren’t familiar with ChemoSpec
, you might wish to look at the introductory vignette first.
In this series of posts we are following the protocol as described in the printed publication closely (Blaise et al. 2021). The authors have also provided a Jupyter notebook. This is well worth your time, even if Python is not your preferred language, as there are additional examples and discussion for study.
Load the Spectra
object we created in Part 2 so we can summarize it.
library("ChemoSpec")
load("Worms2.RData") # restores the 'Worms2' Spectra object
sumSpectra(Worms2)
C. elegans metabolic phenotyping study (Blaise 2007)
There are 133 spectra in this set.
The yaxis unit is intensity.
The frequency scale runs from
8.9995 to 5e04 ppm
There are 8600 frequency values.
The frequency resolution is
0.001 ppm/point.
This data set is not continuous
along the frequency axis.
Here are the data chunks:
beg.freq end.freq size beg.indx end.indx
1 8.9995 5.0005 3.999 1 4000
2 4.5995 0.0005 4.599 4001 8600
The spectra are divided into 4 groups:
group no. color symbol alt.sym
1 Mut_L2 28 #FB0D16FF 0 m2
2 Mut_L4 33 #FFC0CBFF 15 m4
3 WT_L2 32 #511CFCFF 1 w2
4 WT_L4 40 #2E94E9FF 16 w4
*** Note: this is an S3 object
of class 'Spectra'
If you recall in Part 2 we removed five samples. Let’s rerun PCA without these samples and show the key plots. We will simply report these here without much discussion; they are pretty much as expected.
c_pca < c_pcaSpectra(Worms2, choice = "autoscale")
plotScree(c_pca)
p < plotScores(Worms2, c_pca, pcs = 1:2, ellipse = "rob", tol = 0.02)
p
p < plotScores(Worms2, c_pca, pcs = 2:3, ellipse = "rob", leg.loc = "bottomleft",
tol = 0.02)
p
One thing the published protocol does not explicitly discuss is an inspection of the loadings, but it is covered in the Jupyter notebook. The loadings are useful in order to see if any particular frequencies are driving the separation of the samples in the score plot. Let’s plot the loadings (Figure 4). Remember that these data were autoscaled, and hence all frequencies, including noisy frequencies, will contribute to the separation. If we had not scaled the data, these plots would look dramatically different.
p < plotLoadings(Worms2, c_pca, loads = 1:2)
p
The splot is another very useful way to find peaks that are important in separating the samples (Figure 5); we can see that the peaks around 1.301.32, 1.471.48, and 3.033.07 are important drivers of the separation in the score plot. Having discovered this, one can investigate the source of those peaks.
p < sPlotSpectra(Worms2, c_pca, tol = 0.001)
p
ChemoSpec
carries out exploratory data analysis, which is an unsupervised process. The next step in the protocol is PLSDA (partial least squares  discriminant analysis). I have written about ChemoSpec
+ PLS here if you would like more background on plain PLS. However, PLSDA is a technique that combines data reduction/variable selection along with classification. We’ll need the mixOmics
package (F et al. (2017)) package for this analysis; note that loading it replaces the plotLoadings
function from ChemoSpec
.
library("mixOmics")
Loading required package: MASS
Loading required package: lattice
Loaded mixOmics 6.20.0
Thank you for using mixOmics!
Tutorials: http://mixomics.org
Bookdown vignette: https://mixomicsteam.github.io/Bookdown
Questions, issues: Follow the prompts at http://mixomics.org/contactus
Cite us: citation('mixOmics')
Attaching package: 'mixOmics'
The following object is masked from 'package:ChemoSpec':
plotLoadings
Figure 6 shows the score plot; the results suggest that classification and modeling may be successful. The splsda
function carries out a single sparse computation. One computation should not be considered the ideal answer; a better approach is to use crossvalidation, for instance the bootsPLS
function in the bootsPLS
package (Rohart, Le Cao, and Wells (2018) which uses splsda
under the hood). However, that computation is too timeconsuming to demonstrate here.
X < Worms2$data
Y < Worms2$groups
splsda < splsda(X, Y, ncomp = 8)
plotIndiv(splsda,
col.per.group = c("#FB0D16FF", "#FFC0CBFF", "#511CFCFF", "#2E94E9FF"),
title = "sPLSDA Score Plot", legend = TRUE, ellipse = TRUE)
To estimate the number of components needed, the perf
function can be used. The results are in Figure 7 and suggest that five components are sufficient to describe the data.
perf.splsda < perf(splsda, folds = 5, nrepeat = 5)
plot(perf.splsda)
At this point, we have several ideas of how to proceed. Going forward, one might choose to focus on accurate classification, or on determining which frequencies should be included in a predictive model. Any model will need to refined and more details extracted. The reader is referred to the case study from the mixOmics folks which covers these tasks and explains the process.
This post was created using ChemoSpec
version 6.1.3 and ChemoSpecUtils
version 1.0.0.
@online{hanson2022,
author = {Hanson, Bryan},
title = {Metabolic {Phenotyping} {Protocol} {Part} 3},
pages = {undefined},
date = {20220501},
url = {http://chemospec.org/posts/20220501ProtocolPt3/20220501ProtocolPt3.html},
langid = {en}
}
If you aren’t familiar with ChemoSpec
, you might wish to look at the introductory vignette first.
In this series of posts we are following the protocol as described in the printed publication closely (Blaise et al. 2021). The authors have also provided a Jupyter notebook. This is well worth your time, even if Python is not your preferred lanaguage, as there are additional examples and discussion for study.
I saved the Spectra
object we created in Part 1 so we can read it and remind ourselves of what’s in it. Due to the compression in R’s save
function the data takes up 4.9 Mb on disk. The original csv files total about 62 Mb.
library("ChemoSpec")
load("Worms.Rdata") # restores the 'Worms' Spectra object
sumSpectra(Worms)
C. elegans metabolic phenotyping study (Blaise 2007)
There are 139 spectra in this set.
The yaxis unit is intensity.
The frequency scale runs from
8.9995 to 5e04 ppm
There are 8600 frequency values.
The frequency resolution is
0.001 ppm/point.
This data set is not continuous
along the frequency axis.
Here are the data chunks:
beg.freq end.freq size beg.indx end.indx
1 8.9995 5.0005 3.999 1 4000
2 4.5995 0.0005 4.599 4001 8600
The spectra are divided into 4 groups:
group no. color symbol alt.sym
1 Mut_L2 32 #FB0D16FF 0 m2
2 Mut_L4 33 #FFC0CBFF 15 m4
3 WT_L2 34 #511CFCFF 1 w2
4 WT_L4 40 #2E94E9FF 16 w4
*** Note: this is an S3 object
of class 'Spectra'
We will follow the steps described in the published protocol closely.
Apply PQN normalization; scaling in ChemoSpec
is applied at the PCA stage (next).
Worms < normSpectra(Worms) # PQN is the default
Conduct classical PCA using autoscaling.^{1} Note that ChemoSpec
includes several different variants of PCA, each with scaling options. See the introductory vignette for more details. For more about what PCA is and how it works, please see the LearnPCA package.
c_pca < c_pcaSpectra(Worms, choice = "autoscale") # no scaling is the default
A key question at this stage is how many components are needed to describe the data set. Keep in mind that this depends on the choice of scaling. Figure 1 and Figure 2 are two different types of scree plots, which show the residual variance. This is the R^{2}_{x} value in the protocol (see protocol Figure 7a). Another approach to answering this question is to do a crossvalidated PCA.^{2} The results are shown in Figure 3. These are the Q^{2}_{x} values in protocol Figure 7a. All of these ways of looking at the variance explained suggest that retaining three or possibly four PCs is adequate.
plotScree(c_pca)
plotScree(c_pca, style = "trad")
cv_pcaSpectra(Worms, choice = "autoscale", pcs = 10)
Next, examine the score plots (Figure 4, Figure 5). In these plots, each data point is colored by its group membership (keep in mind this is completely independent of the PCA calculation). In addition, robust confidence ellipses are shown for each group. Inspection of these plots is one way to identify potential outliers. The other use is of course to see if the sample classes separate, and by how much.
Examination of these plots shows that separation by classes has not really been achieved using autoscaling. In Figure 4 we see four clear outlier candidates (samples 37, 101, 107, and 118). In Figure 5 we see some of these samples and should probably add sample 114 for a total of five candidates.
p < plotScores(Worms, c_pca, pcs = 1:2, ellipse = "rob", tol = 0.02)
p
p < plotScores(Worms, c_pca, pcs = 2:3, ellipse = "rob", leg.loc = "topright", tol = 0.02)
p
To label more sample points, you can increase the value of the argument tol
.
The protocol recommends plotting Hotelling’s T^{2} ellipse for the entire data set; this is not implemented in ChemoSpec
but we can easily do it if we are using ggplot2
plots (which is the default in ChemoSpec
). We need the ellipseCoord
function from the HotellingsEllipse
package.^{3}
source("ellipseCoord.R")
xy_coord < ellipseCoord(as.data.frame(c_pca$x), pcx = 1, pcy = 2, conf.limit = 0.95,
pts = 500)
p < plotScores(Worms, c_pca, which = 1:2, ellipse = "none", tol = 0.02)
p < p + geom_path(data = xy_coord, aes(x = x, y = y)) + scale_color_manual(values = "black")
p
We can see many of the same outliers by this approach as we saw in Figure 4 and Figure 5.
Another way to identify outliers is to use the approach described in Varmuza and Filzmoser (2009) section 3.7.3. Figure 7 and Figure 8 give the plots. Please see Filzmoser for the details, but any samples that are above the plotted threshold line are candidate outliers, and any samples above the threshold in both plots should be looked at very carefully. Though we are using classical PCA, Filzmoser recommends using these plots with robust PCA. These plots are a better approach than “eye balling it” on the score plots.
p < pcaDiag(Worms, c_pca, plot = "OD")
p
p < pcaDiag(Worms, c_pca, plot = "SD")
p
Comparison of these plots suggest that samples 37, 101, 107, 114 and 118 are likely outliers. These spectra should be examined to see if the reason for their outlyingness can be deduced. If good reason can be found, they can be removed as follows.^{4}
Worms2 < removeSample(Worms, rem.sam = c("37_", "101_", "107_", "114_", "118_"))
At this point one should repeat the PCA, score plots and diagnostic plots to get a good look at how removing these samples affected the results. Those tasks are left to the reader.
We will continue in the next post with a discussion of loadings.
This post was created using ChemoSpec
version 6.1.3 and ChemoSpecUtils
version 1.0.0.
@online{hanson2022,
author = {Hanson, Bryan},
title = {Metabolic {Phenotyping} {Protocol} {Part} 2},
pages = {undefined},
date = {20220324},
url = {http://chemospec.org/posts/20220324ProtocolPt2/20220324ProtocolPt2.html},
langid = {en}
}
Protip: These pages load slowly in some browsers. I had the best luck with Chrome. Try the reader view for a userfriendly version that prints well (if you are in to that).
@online{hanson2022,
author = {Hanson, Bryan},
title = {Chemometrics in {Spectroscopy:} {Key} {References}},
pages = {undefined},
date = {20220218},
url = {http://chemospec.org/posts/20220218KeyReferences/20220218KeyReferences.html},
langid = {en}
}
R
for over a decade now. When adding new features to a package, I often import functions from another package, and of course that package goes in the Imports:
field of the DESCRIPTION
file. Later, I might change my approach entirely and no longer need that package. Do I remember to remove it from DESCRIPTION
? Generally not. The same thing happens when writing a new vignette, and it can happen with the Suggests:
field as well. It can also happen when one splits a packages into several smaller packages. If one forgets to delete a package from the DESCRIPTION
file, the dependencies become bloated, because all the imported and suggested packages have to be available to install the package. This adds overhead to the project, and increases the possibility of a namespace conflict.
In fact this just happened to me again! The author of a package I had in Suggests:
wrote to me and let me know their package would be archived. It was an easy enough fix for me, as it was a “stale” package in that I was no longer using it. I had added it for a vignette which I later deleted, as I decided a series of blog posts was a better approach.
So I decided to write a little function to check for such stale Suggests:
and Import:
entries. This post is about that function. As far as I can tell there is no builtin function for this purpose, and CRAN does not check for stale entries. So it was worth my time to automate the process.^{1}
The first step is to read in the DESCRIPTION
file for the package (so we want our working directory to be the top level of the package). There is a built in function for this. We’ll use the DESCRIPTION
file from the ChemoSpec
package as a demonstration.
# setwd("...") # set to the top level of the package
desc < read.dcf("DESCRIPTION", all = TRUE)
The argument all = TRUE
is a bit odd in that it has a particular purpose (see ?read.dcf
) which isn’t really important here, but has the side effect of returning a data frame, which makes our job simpler. Let’s look at what is returned.
str(desc)
'data.frame': 1 obs. of 18 variables:
$ Package : chr "ChemoSpec"
$ Type : chr "Package"
$ Title : chr "Exploratory Chemometrics for Spectroscopy"
$ Version : chr "6.1.2"
$ Date : chr "20220208"
$ Authors@R : chr "c(\nperson(\"Bryan A.\", \"Hanson\",\nrole = c(\"aut\", \"cre\"), email =\n\"hanson@depauw.edu\",\ncomment = c(" __truncated__
$ Description : chr "A collection of functions for topdown exploratory data analysis\nof spectral data including nuclear magnetic r" __truncated__
$ License : chr "GPL3"
$ Depends : chr "R (>= 3.5),\nChemoSpecUtils (>= 1.0)"
$ Imports : chr "plyr,\nstats,\nutils,\ngrDevices,\nreshape2,\nreadJDX (>= 0.6),\npatchwork,\nggplot2,\nplotly,\nmagrittr"
$ Suggests : chr "IDPmisc,\nknitr,\njs,\nNbClust,\nlattice,\nbaseline,\nmclust,\npls,\nclusterCrit,\nR.utils,\nRColorBrewer,\nser" __truncated__
$ URL : chr "https://bryanhanson.github.io/ChemoSpec/"
$ BugReports : chr "https://github.com/bryanhanson/ChemoSpec/issues"
$ ByteCompile : chr "TRUE"
$ VignetteBuilder : chr "knitr"
$ Encoding : chr "UTF8"
$ RoxygenNote : chr "7.1.2"
$ NeedsCompilation: chr "no"
We are interested in the Imports
and Suggests
elements. Let’s look more closely.
head(desc$Imports)
[1] "plyr,\nstats,\nutils,\ngrDevices,\nreshape2,\nreadJDX (>= 0.6),\npatchwork,\nggplot2,\nplotly,\nmagrittr"
You can see there are a bunch of newlines in there (\n
), along with some version specifications, in parentheses. We need to clean this up so we have a simple list of the packages as a vector. For clean up we’ll use the following helper function.
clean_up < function(string) {
string < gsub("\n", "", string) # remove newlines
string < gsub("\\(.+\\)", "", string) # remove parens & anything within them
string < unlist(strsplit(string, ",")) # split the long string into pieces
string < trimws(string) # remove any white space around words
}
After we apply this to the raw results, we have what we are after, a clean list of imported packages.
imp < clean_up(desc$Imports)
imp
[1] "plyr" "stats" "utils" "grDevices" "reshape2" "readJDX"
[7] "patchwork" "ggplot2" "plotly" "magrittr"
Next, we can search the entire package looking for these package names to see if they are used in the package. They might appear in import statements, vignettes, code and so forth, so it’s not sufficient to just look at code. This is a job for grep
, but we’ll call grep
from within R
so that we don’t have to use the command line and transfer the results to R
, that gets messy and is errorprone.
if (length(imp) >= 1) { # Note 1
imp_res < rep("FALSE", length(imp)) # Boolean to keep track of whether we found a package or not
for (i in 1:length(imp)) {
args < paste("r e '", imp[i], "' *", sep = "") # assemble arguments for grep
g_imp < system2("grep", args, stdout = TRUE)
if (length(g_imp) > 1L) imp_res[i] < TRUE # Note 2
}
}
g_imp
contains the results of the grep process. If there are imports in the package, each imported package name will be found by grep in the DESCRIPTION
file. That’s not so interesting, so we don’t count it. For a package to be stale, it will be found in DESCRIPTION
but no where else.We can do the same process for the Suggests:
field of DESCRIPTION
. And then it would be nice to present the results in a more useable form. At this point we can put it all togther in an easytouse function.^{2}
# run from the package top level
check_stale_imports_suggests < function() {
# helper function: removes extra characters
# from strings read by read.dcf
clean_up < function(string) {
string < gsub("\n", "", string)
string < gsub("\\(.+\\)", "", string)
string < unlist(strsplit(string, ","))
string < trimws(string)
}
desc < read.dcf("DESCRIPTION", all = TRUE)
# look for use of imported packages
imp < clean_up(desc$Imports)
if (length(imp) == 0L) message("No Imports: entries found")
if (length(imp) >= 1) {
imp_res < rep("FALSE", length(imp))
for (i in 1:length(imp)) {
args < paste("r e '", imp[i], "' *", sep = "")
g_imp < system2("grep", args, stdout = TRUE)
# always found once in DESCRIPTION, hence > 1
if (length(g_imp) > 1L) imp_res[i] < TRUE
}
}
# look for use of suggested packages
sug < clean_up(desc$Suggests)
if (length(sug) == 0L) message("No Suggests: entries found")
if (length(sug) >= 1) {
sug_res < rep("FALSE", length(sug))
for (i in 1:length(sug)) {
args < paste("r e '", sug[i], "' *", sep = "")
g_sug < system2("grep", args, stdout = TRUE)
# always found once in DESCRIPTION, hence > 1
if (length(g_sug) > 1L) sug_res[i] < TRUE
}
}
# arrange output in easy to read format
role < c(rep("Imports", length(imp)), rep("Suggests", length(sug)))
return(data.frame(
pkg = c(imp, sug),
role = role,
found = c(imp_res, sug_res)))
}
Applying this function to my ChemoSpec2D
package (as of the date of this post), we see the following output. You can see a bunch of packages are imported but never used, so I have some work to do. This was the result of copying the DESCRIPTION
file from ChemoSpec
when I started ChemoSpec2D
and obviously I never went back and cleaned things up.
pkg role found
1 plyr Imports TRUE
2 stats Imports TRUE
3 utils Imports TRUE
4 grDevices Imports TRUE
5 reshape2 Imports TRUE
6 readJDX Imports TRUE
7 patchwork Imports TRUE
8 ggplot2 Imports TRUE
9 plotly Imports TRUE
10 magrittr Imports TRUE
11 IDPmisc Suggests TRUE
12 knitr Suggests TRUE
13 js Suggests TRUE
14 NbClust Suggests TRUE
15 lattice Suggests TRUE
16 baseline Suggests TRUE
17 mclust Suggests TRUE
18 pls Suggests TRUE
19 clusterCrit Suggests TRUE
20 R.utils Suggests TRUE
21 RColorBrewer Suggests TRUE
22 seriation Suggests FALSE
23 MASS Suggests FALSE
24 robustbase Suggests FALSE
25 grid Suggests TRUE
26 pcaPP Suggests FALSE
27 jsonlite Suggests FALSE
28 gsubfn Suggests FALSE
29 signal Suggests TRUE
30 speaq Suggests FALSE
31 tinytest Suggests FALSE
32 elasticnet Suggests FALSE
33 irlba Suggests FALSE
34 amap Suggests FALSE
35 rmarkdown Suggests TRUE
36 bookdown Suggests FALSE
37 chemometrics Suggests FALSE
38 hyperSpec Suggests FALSE
@online{hanson2022,
author = {Hanson, Bryan},
title = {Do {You} Have {Stale} {Imports} or {Suggests?}},
pages = {undefined},
date = {20220209},
url = {http://chemospec.org/posts/20220209ImportsSuggests/20220209ImportsSuggests.html},
langid = {en}
}
ChemoSpec
, you might wish to look at the introductory vignette first.
Blaise et al. (2021) have published a detailed protocol for metabolomic phenotyping. They illustrate the protocol using a data set composed of 139 ^{1}H HRMAS SSNMR spectra (Blaise et al. 2007) of the model organism Caenorhabditis elegans. There are two genotypes, wild type and a mutant, and worms from two life stages.
This series of posts follows the published protocol closely in order to illustrate how to implement the protocol using ChemoSpec
. As in any chemometric analysis, there are decisions to be made about how to process the data. In these posts we are interested in which functions to use, and how to examine the results. We are not exploring all possible data processing choices, and argument choices are not necessarily optimized.
The data set is large, over 30 Mb, so we will grab it directly from the Github repo where it is stored. We will use a custom function to grab the data (you can see the function in the source for this document if interested). The URLs given below point to the frequency scale, the raw data matrix and the variables that describe the sample classification by genotype and life stage (L2 are gravid adults, L4 are larvae).
urls < c("https://raw.githubusercontent.com/Gscorreia89/chemometricstutorials/master/data/ppm.csv",
"https://raw.githubusercontent.com/Gscorreia89/chemometricstutorials/master/data/X_spectra.csv",
"https://raw.githubusercontent.com/Gscorreia89/chemometricstutorials/master/data/worm_yvars.csv")
raw < get_csvs_from_github(urls, sep = ",") # a list of data sets
names(raw)
[1] "ppm.csv" "X_spectra.csv" "worm_yvars.csv"
The format of the data as provided in Github is not really suited to using either of the builtin import functions in ChemoSpec
. Therefore we will construct the Spectra
object by hand, a useful exercise in its own right. The requirements for a Spectra
object are described in ?Spectra
.
First, we’ll take the results in raw
and convert them to the proper form. Each element of raw
is a data frame.
# frequencies are in the 1st list element
freq < unlist(raw[[1]], use.names = FALSE)
# intensities are in the 2nd list element
data < as.matrix(raw[[2]])
dimnames(data) < NULL # remove the default data frame col names
ns < nrow(data) # ns = number of samples  used later
# get genotype & lifestage, recode into something more readible
yvars < raw[[3]]
names(yvars) < c("genotype", "stage")
yvars$genotype < ifelse(yvars$genotype == 1L, "WT", "Mut")
yvars$stage < ifelse(yvars$stage == 1L, "L2", "L4")
table(yvars) # quick look at how many in each group
stage
genotype L2 L4
Mut 32 33
WT 34 40
Next we’ll construct some useful sample names, create the groups vector, assign the colors and symbols, and finally put it all together into a Spectra
object.
# build up sample names to include the group membership
sample_names < as.character(1:ns)
sample_names < paste(sample_names, yvars$genotype, sep = "_")
sample_names < paste(sample_names, yvars$stage, sep = "_")
head(sample_names)
[1] "1_WT_L4" "2_Mut_L4" "3_Mut_L4" "4_WT_L4" "5_Mut_L4" "6_WT_L4"
# use the sample names to create the groups vector
grp < gsub("[09]+_", "", sample_names) # remove 1_ etc, leaving WT_L2 etc
groups < as.factor(grp)
levels(groups)
[1] "Mut_L2" "Mut_L4" "WT_L2" "WT_L4"
# set up the colors based on group membership
data(Col12) # see ?colorSymbol for a swatch
colors < grp
colors < ifelse(colors == "WT_L2", Col12[1], colors)
colors < ifelse(colors == "WT_L4", Col12[2], colors)
colors < ifelse(colors == "Mut_L2", Col12[3], colors)
colors < ifelse(colors == "Mut_L4", Col12[4], colors)
# set up the symbols based on group membership
sym < grp # see ?points for the symbol codes
sym < ifelse(sym == "WT_L2", 1, sym)
sym < ifelse(sym == "WT_L4", 16, sym)
sym < ifelse(sym == "Mut_L2", 0, sym)
sym < ifelse(sym == "Mut_L4", 15, sym)
sym < as.integer(sym)
# set up the alt symbols based on group membership
alt.sym < grp
alt.sym < ifelse(alt.sym == "WT_L2", "w2", alt.sym)
alt.sym < ifelse(alt.sym == "WT_L4", "w4", alt.sym)
alt.sym < ifelse(alt.sym == "Mut_L2", "m2", alt.sym)
alt.sym < ifelse(alt.sym == "Mut_L4", "m4", alt.sym)
# put it all together; see ?Spectra for requirements
Worms < list()
Worms$freq < freq
Worms$data < data
Worms$names < sample_names
Worms$groups < groups
Worms$colors < colors
Worms$sym < sym
Worms$alt.sym < alt.sym
Worms$unit < c("ppm", "intensity")
Worms$desc < "C. elegans metabolic phenotyping study (Blaise 2007)"
class(Worms) < "Spectra"
chkSpectra(Worms) # verify we have everything correct
sumSpectra(Worms)
C. elegans metabolic phenotyping study (Blaise 2007)
There are 139 spectra in this set.
The yaxis unit is intensity.
The frequency scale runs from
8.9995 to 5e04 ppm
There are 8600 frequency values.
The frequency resolution is
0.001 ppm/point.
This data set is not continuous
along the frequency axis.
Here are the data chunks:
beg.freq end.freq size beg.indx end.indx
1 8.9995 5.0005 3.999 1 4000
2 4.5995 0.0005 4.599 4001 8600
The spectra are divided into 4 groups:
group no. color symbol alt.sym
1 Mut_L2 32 #FB0D16FF 0 m2
2 Mut_L4 33 #FFC0CBFF 15 m4
3 WT_L2 34 #511CFCFF 1 w2
4 WT_L4 40 #2E94E9FF 16 w4
*** Note: this is an S3 object
of class 'Spectra'
Let’s look at one sample from each group to make sure everything looks reasonable (Figure Figure 1). At least these four spectra look good. Note that we are using the latest ChemoSpec
that uses ggplot2
graphics by default (announced here).
p < plotSpectra(Worms, which = c(35, 1, 34, 2), lab.pos = 7.5, offset = 0.008, amplify = 35,
yrange = c(0.05, 1.1))
p
In the next post we’ll continue with some basic exploratory data analysis.
This post was created using ChemoSpec
version 6.1.3 and ChemoSpecUtils
version 1.0.0.
@online{hanson2022,
author = {Hanson, Bryan},
title = {Metabolic {Phenotyping} {Protocol} {Part} 1},
pages = {undefined},
date = {20220201},
url = {http://chemospec.org/posts/20220201ProtocolPt1/20220201ProtocolPt1.html},
langid = {en}
}
Thanks to Mr. Tejasvi Gupta and the support of GSOC, ChemoSpec
and ChemoSpec2D
were extended to produce ggplot2
graphics and plotly
graphics! ggplot2
is now the default output, and the ggplot2
object is returned, so if one doesn’t like the choice of theme or any other aspect, one can customize the object to one’s desire. The ggplot2
graphics output are generally similar in layout and spirit to the base
graphics output, but significant improvements have been made in labeling data points using the ggrepel
package. The original base
graphics are still available as well. Much of this work required changes in ChemoSpecUtils
which supports the common needs of both packages.
Tejasvi did a really great job with this project, and I think users of these packages will really like the results. We have greatly expanded the prerelease testing of the graphics, and as far as we can see every thing works as intended. Of course, please file an issue if you see any problems or unexpected behavior.
To see more about how the new graphics options work, take a look at GraphicsOptions. Here are the functions that were updated:
plotSpectra
surveySpectra
surveySpectra2
reviewAllSpectra
(formerly loopThruSpectra
)plotScree
(resides in ChemoSpecUtils
)plotScores
(resides in ChemoSpecUtils
)plotLoadings
(uses patchwork
and hence plotly
isn’t available)plot2Loadings
sPlotSpectra
pcaDiag
plotSampleDist
aovPCAscores
aovPCAloadings
(uses patchwork
and hence plotly
isn’t available)Tejasvi and I are looking forward to your feedback. There are many other smaller changes that we’ll let users discover as they work. And there’s more work to be done, but other projects need attention and I need a little rest!
@online{hanson2021,
author = {Hanson, Bryan},
title = {GSOC 2021: {New} {Graphics} for {ChemoSpec}},
pages = {undefined},
date = {20211013},
url = {http://chemospec.org/posts/20211013GSOCCSGraphics/20211013GSOCCSGraphics.html},
langid = {en}
}