Introduction

In this example we will process the example data fa_nmr to build a simple PC-LDA classificator. As this dataset includes only three spectra of each category don’t expect wonders, the aim of this vignette is simply to guide you through a workflow.

Pre processing NMR spectra

Creating a collection from files

This package comes with an example dataset, already loaded as a collection. This dataset contains H1-NMR spectra for fatty acids of egg yolks. The spectra were pre-processed using TopSpin: * Phasing * Baseline removal * Spectral alignement * Export as .txt files

Each exported file looked like this :

# File created = <DATE>
# Data set = <NAME>  10  1  <"Path/to/folder">
# Spectral Region:
# LEFT = 14.931500434875488 ppm. RIGHT = -5.284929038283497 ppm.
#
# SIZE = 262144 ( = number of points)
#
# In the following ordering is from the 'left' to the 'right' limits!
# Lines beginning with '#' must be considered as comment lines.
#
83549.0
83274.0
83057.0
...

Here is how this data was loaded as a collection:

require(purrr)

# First we get a list of the files in the spectra folder
files <- file.path(INPUT_FOLDER, list.files(INPUT_FOLDER))

# Then we create a simple parser to read the data form the files
parse_file <- function(path){
    meta_raw <- readLines(path, n = 10, ok = FALSE)
    dataset <- unlist(strsplit(meta_raw[2], " "))[5]
    left <- as.numeric(unlist(strsplit(meta_raw[4], " "))[4])
    right <- as.numeric(unlist(strsplit(meta_raw[4], " "))[8])
    vals <- read_tsv(path, col_names = c("values"), comment = "#", col_types= "d")
    return(list(id = dataset, left = left, right = right, values = vals$values))
}

# Finally we load each spectra and add it to a collection
fa_nmr <- collection()
    walk(files, ~ {parsed <- parse_file(.x)
                    fa_nmr <<- fa_nmr %>% add_spectrum(parsed$values,
                                                   parsed$left,
                                                   parsed$right,
                                                   parsed$id)})

# We can also load labels for a file to add the to the collection
labs <- read_tsv(LABELS)
fa_nmr <- add_labels(fa_nmr, labs, ids_from = "sample", labels_from = "label")

Preparing the spectra

Let’s set up things here. We will need the tidyverse and tidymodels packages for this example

library(tidyverse)
#> ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
#> ✔ ggplot2 3.3.3     ✔ purrr   0.3.4
#> ✔ tibble  3.1.0     ✔ dplyr   1.0.5
#> ✔ tidyr   1.1.3     ✔ stringr 1.4.0
#> ✔ readr   1.4.0     ✔ forcats 0.5.1
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
library(tidymodels)
#> ── Attaching packages ────────────────────────────────────── tidymodels 0.1.2 ──
#> ✔ broom     0.7.6      ✔ recipes   0.1.15
#> ✔ dials     0.0.9      ✔ rsample   0.0.9 
#> ✔ infer     0.5.4      ✔ tune      0.1.3 
#> ✔ modeldata 0.1.0      ✔ workflows 0.2.2 
#> ✔ parsnip   0.1.5      ✔ yardstick 0.0.8
#> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
#> ✖ scales::discard() masks purrr::discard()
#> ✖ dplyr::filter()   masks stats::filter()
#> ✖ recipes::fixed()  masks stringr::fixed()
#> ✖ dplyr::lag()      masks stats::lag()
#> ✖ yardstick::spec() masks readr::spec()
#> ✖ recipes::step()   masks stats::step()
library(discrim)
#> 
#> Attaching package: 'discrim'
#> The following object is masked from 'package:dials':
#> 
#>     smoothness
library(tidySpectR)
#> 
#> Attaching package: 'tidySpectR'
#> The following object is masked from 'package:tidyr':
#> 
#>     extract

data(fa_nmr)

The fa_nmr toy dataset is provided ywith the package. Let’s have a look:


fa_nmr
#> Spectra collection containing 6 entries.
#> Number of bins: 2250 
#> Limits: -2.123166e-05 8.999979 
#> Labels: conventional organic 
#> 
#> Processing:

autoplot(fa_nmr, offset_y = 1e+09, offset_x = 1) +
    theme(legend.position = "none") +
    scale_x_reverse()

First remove the water and solvent (MeOD in this case) peaks and trim the edges of the spectra to the interessant region.

spectra <- fa_nmr

tidySpectR::extract(spectra, 4.6, 4.66) %>% 
    autoplot(type="average") +
    ggtitle("Water peak")


tidySpectR::extract(spectra, 3.33, 3.39) %>% 
    autoplot(type="average") +
    ggtitle("MeOD peak")

This looks good so we can proceed to masking the peaks, these will be effectivally removed from the dataset (not set to zero).

spectra <- spectra %>% 
    extract( 0.5, 7.2) %>%
    mask(4.6, 4.66) %>% 
    mask(3.33, 3.39)

Normalization

We can now proceed to normalizing the spectra using normalization to a given peak. You could also normalize with external values (Osmolarity, concentrations, …), check the normalize_factor method for that.

First we will check the normalization peak to be sure that we include all of it.

spectra %>%
    tidySpectR::extract(3.57, 3.65) %>%
    autoplot(type="average") +
    ggtitle("Normalization peak (raw)")

Looking good, let’s proceed and check the result.

spectra <- normalize_internalStandard(spectra, 3.57, 3.65)

spectra %>%
    tidySpectR::extract(3.57, 3.65) %>%
    autoplot(type="average") +
    ggtitle("Normalization peak (after normalization)")

Bucketting

We can now bucket (or “bin”) the spectra. In this example we will use a very simple uniform binning but more sophisticated methods are implemented.

spectra <- bucket_uniform(spectra, width = 0.04)

spectra %>% 
    autoplot(type = "label_average", offset_x = 0.5, offset_y = 5 ) +
    scale_x_reverse() +
    ggtitle("Processed spectra")

And here are the processed spectra!

Multivariate data analysis

Now that we have processed our spectra we can go on to the modelling part. For this we will rely on the tidymodels workflow. First we will scale the spectra using Pareto scaling and perform PCA on the predictors.

Note that this package implements only a couple of chemotrics-oriented steps to use with recipes. Check out the tidymodels website to learn more about recipes.

# Export the spectra in a tidy format 
df <- tidy(spectra)
print(df)
#> # A tibble: 6 x 168
#>   id         label      `0.51997876833874… `0.55997876833874… `0.59997876833874…
#>   <chr>      <fct>                   <dbl>              <dbl>              <dbl>
#> 1 201993059… conventio…              0.607              0.632               4.56
#> 2 201993062… conventio…              0.554              0.530               4.03
#> 3 201995046… conventio…              0.679              0.661               5.23
#> 4 201981241… organic                 0.644              0.627               4.42
#> 5 201981241… organic                 0.715              0.704               5.14
#> 6 201981241… organic                 0.560              0.518               3.71
#> # … with 163 more variables: 0.639978768338749 <dbl>, 0.679978768338749 <dbl>,
#> #   0.719978768338749 <dbl>, 0.759978768338749 <dbl>, 0.799978768338749 <dbl>,
#> #   0.839978768338749 <dbl>, 0.879978768338749 <dbl>, 0.919978768338749 <dbl>,
#> #   0.959978768338749 <dbl>, 0.999978768338749 <dbl>, 1.03997876833875 <dbl>,
#> #   1.07997876833875 <dbl>, 1.11997876833875 <dbl>, 1.15997876833875 <dbl>,
#> #   1.19997876833875 <dbl>, 1.23997876833875 <dbl>, 1.27997876833875 <dbl>,
#> #   1.31997876833875 <dbl>, 1.35997876833875 <dbl>, 1.39997876833875 <dbl>,
#> #   1.43997876833875 <dbl>, 1.47997876833875 <dbl>, 1.51997876833875 <dbl>,
#> #   1.55997876833875 <dbl>, 1.59997876833875 <dbl>, 1.63997876833875 <dbl>,
#> #   1.67997876833875 <dbl>, 1.71997876833875 <dbl>, 1.75997876833875 <dbl>,
#> #   1.79997876833875 <dbl>, 1.83997876833875 <dbl>, 1.87997876833875 <dbl>,
#> #   1.91997876833875 <dbl>, 1.95997876833875 <dbl>, 1.99997876833875 <dbl>,
#> #   2.03997876833875 <dbl>, 2.07997876833875 <dbl>, 2.11997876833875 <dbl>,
#> #   2.15997876833875 <dbl>, 2.19997876833875 <dbl>, 2.23997876833875 <dbl>,
#> #   2.27997876833875 <dbl>, 2.31997876833875 <dbl>, 2.35997876833875 <dbl>,
#> #   2.39997876833875 <dbl>, 2.43997876833875 <dbl>, 2.47997876833875 <dbl>,
#> #   2.51997876833875 <dbl>, 2.55997876833875 <dbl>, 2.59997876833875 <dbl>,
#> #   2.63997876833875 <dbl>, 2.67997876833875 <dbl>, 2.71997876833875 <dbl>,
#> #   2.75997876833875 <dbl>, 2.79997876833875 <dbl>, 2.83997876833875 <dbl>,
#> #   2.87997876833875 <dbl>, 2.91997876833875 <dbl>, 2.95997876833875 <dbl>,
#> #   2.99997876833875 <dbl>, 3.03997876833875 <dbl>, 3.07997876833875 <dbl>,
#> #   3.11997876833875 <dbl>, 3.15997876833875 <dbl>, 3.19997876833875 <dbl>,
#> #   3.23997876833875 <dbl>, 3.27997876833875 <dbl>, 3.31997876833875 <dbl>,
#> #   3.39997876833875 <dbl>, 3.43997876833875 <dbl>, 3.47997876833875 <dbl>,
#> #   3.51997876833875 <dbl>, 3.55997876833875 <dbl>, 3.59997876833875 <dbl>,
#> #   3.63997876833875 <dbl>, 3.67997876833875 <dbl>, 3.71997876833875 <dbl>,
#> #   3.75997876833875 <dbl>, 3.79997876833875 <dbl>, 3.83997876833875 <dbl>,
#> #   3.87997876833875 <dbl>, 3.91997876833875 <dbl>, 3.95997876833875 <dbl>,
#> #   3.99997876833875 <dbl>, 4.03997876833875 <dbl>, 4.07997876833875 <dbl>,
#> #   4.11997876833875 <dbl>, 4.15997876833875 <dbl>, 4.19997876833875 <dbl>,
#> #   4.23997876833875 <dbl>, 4.27997876833875 <dbl>, 4.31997876833875 <dbl>,
#> #   4.35997876833875 <dbl>, 4.39997876833875 <dbl>, 4.43997876833875 <dbl>,
#> #   4.47997876833875 <dbl>, 4.51997876833875 <dbl>, 4.55997876833875 <dbl>,
#> #   4.59997876833875 <dbl>, 4.67997876833875 <dbl>, …

rec <- recipe(df, label ~.) %>%
        update_role(id, new_role = "ID") %>% 
        step_pareto(all_predictors()) %>%
        step_nzv(all_predictors()) %>%
        step_pca(all_predictors(), threshold = 0.95)

mod <- discrim_linear(mode = "classification") %>%
        set_engine("mda")

wflow <- workflow() %>%
        add_recipe(rec) %>%
        add_model(mod)
        
fitted <- fit(wflow, df)

print(fitted)
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: discrim_linear()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 3 Recipe Steps
#> 
#> ● step_pareto()
#> ● step_nzv()
#> ● step_pca()
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> Call:
#> mda::fda(formula = ..y ~ ., data = data, method = mda::gen.ridge, 
#>     keep.fitted = FALSE)
#> 
#> Dimension: 1 
#> 
#> Percent Between-Group Variance Explained:
#>  v1 
#> 100 
#> 
#> Degrees of Freedom (per dimension): 1.92434 
#> 
#> Training Misclassification Error: 0.33333 ( N = 6 )

It looks like we managed to missclassify one sample, this would probably improve by increasing the sample number and optimizing the preprocessing.

Process new samples

When you go to production, you’ll want to use the same pre-processing steps on new samples. While this is straigthforward for all the read masking operations it can become difficult when using more sophisticated bucketting or normalization algorithms.

This package provides a processing_template class to deal with reproducible processing. Each collection object contains a processing_template object that automatically saves processing steps and their arguments.

# See what happened to spectra already
print(spectra)
#> Spectra collection containing 6 entries.
#> Number of bins: 166 
#> Limits: 0.4999788 7.203979 
#> Labels: conventional organic 
#> 
#> Processing:
#> Step 1 / 5 : extract 
#> Step 2 / 5 : mask 
#> Step 3 / 5 : mask 
#> Step 4 / 5 : internalStandard_normalization 
#> Step 5 / 5 : uniform_binning

# Exporting the processing template from a processed collection 
processor <- export_processor(spectra)

print(processor)
#> Step 1 / 5 : extract 
#> Step 2 / 5 : mask 
#> Step 3 / 5 : mask 
#> Step 4 / 5 : internalStandard_normalization 
#> Step 5 / 5 : uniform_binning

processing_template objects also have a tidy method with relevant information and allow access to each single step:

tidy(processor)
#> # A tibble: 5 x 4
#>     num name                      method          id                            
#>   <int> <chr>                     <chr>           <chr>                         
#> 1     1 extract                   extract         extract_458834                
#> 2     2 mask                      mask            mask_155001                   
#> 3     3 mask                      mask            mask_936382                   
#> 4     4 internalStandard_normali… normalize_fact… internalStandard_normalizatio…
#> 5     5 uniform_binning           bucket_from_br… uniform_binning_352099

processor$steps[[1]] %>% tidy()
#> # A tibble: 1 x 6
#>   name    method   from    to overlaps id            
#>   <chr>   <chr>   <dbl> <dbl> <chr>    <chr>         
#> 1 extract extract   0.5   7.2 keep     extract_458834

processor$steps[[4]] %>% tidy()
#> # A tibble: 1 x 4
#>   name                   method        factors         id                       
#>   <chr>                  <chr>         <list>          <chr>                    
#> 1 internalStandard_norm… normalize_fa… <tibble[,2] [6… internalStandard_normali…

# One can access function arguments:
processor$steps[[4]] %>% tidy() %>% pull(factors)
#> [[1]]
#> # A tibble: 6 x 2
#>   id           factors
#>   <fct>          <dbl>
#> 1 20199305928 1.06e- 9
#> 2 20199306281 9.48e-10
#> 3 20199504645 1.41e- 9
#> 4 20198124123 1.10e- 9
#> 5 20198124124 1.35e- 9
#> 6 20198124125 9.62e-10

We can now apply the processing to a new collection object (or reuse the fa_nmr dataset) and compare to spectra:

new_data <- process(processor, fa_nmr)
print(new_data)
#> Spectra collection containing 6 entries.
#> Number of bins: 166 
#> Limits: 0.4999788 7.203979 
#> Labels: conventional organic 
#> 
#> Processing:
#> Step 1 / 5 : extract 
#> Step 2 / 5 : mask 
#> Step 3 / 5 : mask 
#> Step 4 / 5 : factor_normalization 
#> Step 5 / 5 : cutsom_bucketting

And done! Let’s check that we got the same thing out of both processes:

autoplot(spectra, type = "average")+ 
    scale_x_reverse()+
    ggtitle("Original data")


autoplot(new_data, type = "average")+ 
    scale_x_reverse() +
    ggtitle("New data")