In this example we will process the example data fa_nmr
to build a simple PC-LDA classificator. As this dataset includes only three spectra of each category don’t expect wonders, the aim of this vignette is simply to guide you through a workflow.
This package comes with an example dataset, already loaded as a collection. This dataset contains H1-NMR spectra for fatty acids of egg yolks. The spectra were pre-processed using TopSpin: * Phasing * Baseline removal * Spectral alignement * Export as .txt files
Each exported file looked like this :
# File created = <DATE>
# Data set = <NAME> 10 1 <"Path/to/folder">
# Spectral Region:
# LEFT = 14.931500434875488 ppm. RIGHT = -5.284929038283497 ppm.
#
# SIZE = 262144 ( = number of points)
#
# In the following ordering is from the 'left' to the 'right' limits!
# Lines beginning with '#' must be considered as comment lines.
#
83549.0
83274.0
83057.0
...
Here is how this data was loaded as a collection:
require(purrr)
# First we get a list of the files in the spectra folder
files <- file.path(INPUT_FOLDER, list.files(INPUT_FOLDER))
# Then we create a simple parser to read the data form the files
parse_file <- function(path){
meta_raw <- readLines(path, n = 10, ok = FALSE)
dataset <- unlist(strsplit(meta_raw[2], " "))[5]
left <- as.numeric(unlist(strsplit(meta_raw[4], " "))[4])
right <- as.numeric(unlist(strsplit(meta_raw[4], " "))[8])
vals <- read_tsv(path, col_names = c("values"), comment = "#", col_types= "d")
return(list(id = dataset, left = left, right = right, values = vals$values))
}
# Finally we load each spectra and add it to a collection
fa_nmr <- collection()
walk(files, ~ {parsed <- parse_file(.x)
fa_nmr <<- fa_nmr %>% add_spectrum(parsed$values,
parsed$left,
parsed$right,
parsed$id)})
# We can also load labels for a file to add the to the collection
labs <- read_tsv(LABELS)
fa_nmr <- add_labels(fa_nmr, labs, ids_from = "sample", labels_from = "label")
Let’s set up things here. We will need the tidyverse
and tidymodels
packages for this example
library(tidyverse)
#> ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
#> ✔ ggplot2 3.3.3 ✔ purrr 0.3.4
#> ✔ tibble 3.1.0 ✔ dplyr 1.0.5
#> ✔ tidyr 1.1.3 ✔ stringr 1.4.0
#> ✔ readr 1.4.0 ✔ forcats 0.5.1
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
library(tidymodels)
#> ── Attaching packages ────────────────────────────────────── tidymodels 0.1.2 ──
#> ✔ broom 0.7.6 ✔ recipes 0.1.15
#> ✔ dials 0.0.9 ✔ rsample 0.0.9
#> ✔ infer 0.5.4 ✔ tune 0.1.3
#> ✔ modeldata 0.1.0 ✔ workflows 0.2.2
#> ✔ parsnip 0.1.5 ✔ yardstick 0.0.8
#> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
#> ✖ scales::discard() masks purrr::discard()
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ recipes::fixed() masks stringr::fixed()
#> ✖ dplyr::lag() masks stats::lag()
#> ✖ yardstick::spec() masks readr::spec()
#> ✖ recipes::step() masks stats::step()
library(discrim)
#>
#> Attaching package: 'discrim'
#> The following object is masked from 'package:dials':
#>
#> smoothness
library(tidySpectR)
#>
#> Attaching package: 'tidySpectR'
#> The following object is masked from 'package:tidyr':
#>
#> extract
data(fa_nmr)
The fa_nmr
toy dataset is provided ywith the package. Let’s have a look:
fa_nmr
#> Spectra collection containing 6 entries.
#> Number of bins: 2250
#> Limits: -2.123166e-05 8.999979
#> Labels: conventional organic
#>
#> Processing:
autoplot(fa_nmr, offset_y = 1e+09, offset_x = 1) +
theme(legend.position = "none") +
scale_x_reverse()
First remove the water and solvent (MeOD in this case) peaks and trim the edges of the spectra to the interessant region.
spectra <- fa_nmr
tidySpectR::extract(spectra, 4.6, 4.66) %>%
autoplot(type="average") +
ggtitle("Water peak")
This looks good so we can proceed to masking the peaks, these will be effectivally removed from the dataset (not set to zero).
We can now proceed to normalizing the spectra using normalization to a given peak. You could also normalize with external values (Osmolarity, concentrations, …), check the normalize_factor
method for that.
First we will check the normalization peak to be sure that we include all of it.
spectra %>%
tidySpectR::extract(3.57, 3.65) %>%
autoplot(type="average") +
ggtitle("Normalization peak (raw)")
Looking good, let’s proceed and check the result.
spectra <- normalize_internalStandard(spectra, 3.57, 3.65)
spectra %>%
tidySpectR::extract(3.57, 3.65) %>%
autoplot(type="average") +
ggtitle("Normalization peak (after normalization)")
We can now bucket (or “bin”) the spectra. In this example we will use a very simple uniform binning but more sophisticated methods are implemented.
spectra <- bucket_uniform(spectra, width = 0.04)
spectra %>%
autoplot(type = "label_average", offset_x = 0.5, offset_y = 5 ) +
scale_x_reverse() +
ggtitle("Processed spectra")
And here are the processed spectra!
Now that we have processed our spectra we can go on to the modelling part. For this we will rely on the tidymodels workflow. First we will scale the spectra using Pareto scaling and perform PCA on the predictors.
Note that this package implements only a couple of chemotrics-oriented steps to use with recipes. Check out the tidymodels website to learn more about recipes.
# Export the spectra in a tidy format
df <- tidy(spectra)
print(df)
#> # A tibble: 6 x 168
#> id label `0.51997876833874… `0.55997876833874… `0.59997876833874…
#> <chr> <fct> <dbl> <dbl> <dbl>
#> 1 201993059… conventio… 0.607 0.632 4.56
#> 2 201993062… conventio… 0.554 0.530 4.03
#> 3 201995046… conventio… 0.679 0.661 5.23
#> 4 201981241… organic 0.644 0.627 4.42
#> 5 201981241… organic 0.715 0.704 5.14
#> 6 201981241… organic 0.560 0.518 3.71
#> # … with 163 more variables: 0.639978768338749 <dbl>, 0.679978768338749 <dbl>,
#> # 0.719978768338749 <dbl>, 0.759978768338749 <dbl>, 0.799978768338749 <dbl>,
#> # 0.839978768338749 <dbl>, 0.879978768338749 <dbl>, 0.919978768338749 <dbl>,
#> # 0.959978768338749 <dbl>, 0.999978768338749 <dbl>, 1.03997876833875 <dbl>,
#> # 1.07997876833875 <dbl>, 1.11997876833875 <dbl>, 1.15997876833875 <dbl>,
#> # 1.19997876833875 <dbl>, 1.23997876833875 <dbl>, 1.27997876833875 <dbl>,
#> # 1.31997876833875 <dbl>, 1.35997876833875 <dbl>, 1.39997876833875 <dbl>,
#> # 1.43997876833875 <dbl>, 1.47997876833875 <dbl>, 1.51997876833875 <dbl>,
#> # 1.55997876833875 <dbl>, 1.59997876833875 <dbl>, 1.63997876833875 <dbl>,
#> # 1.67997876833875 <dbl>, 1.71997876833875 <dbl>, 1.75997876833875 <dbl>,
#> # 1.79997876833875 <dbl>, 1.83997876833875 <dbl>, 1.87997876833875 <dbl>,
#> # 1.91997876833875 <dbl>, 1.95997876833875 <dbl>, 1.99997876833875 <dbl>,
#> # 2.03997876833875 <dbl>, 2.07997876833875 <dbl>, 2.11997876833875 <dbl>,
#> # 2.15997876833875 <dbl>, 2.19997876833875 <dbl>, 2.23997876833875 <dbl>,
#> # 2.27997876833875 <dbl>, 2.31997876833875 <dbl>, 2.35997876833875 <dbl>,
#> # 2.39997876833875 <dbl>, 2.43997876833875 <dbl>, 2.47997876833875 <dbl>,
#> # 2.51997876833875 <dbl>, 2.55997876833875 <dbl>, 2.59997876833875 <dbl>,
#> # 2.63997876833875 <dbl>, 2.67997876833875 <dbl>, 2.71997876833875 <dbl>,
#> # 2.75997876833875 <dbl>, 2.79997876833875 <dbl>, 2.83997876833875 <dbl>,
#> # 2.87997876833875 <dbl>, 2.91997876833875 <dbl>, 2.95997876833875 <dbl>,
#> # 2.99997876833875 <dbl>, 3.03997876833875 <dbl>, 3.07997876833875 <dbl>,
#> # 3.11997876833875 <dbl>, 3.15997876833875 <dbl>, 3.19997876833875 <dbl>,
#> # 3.23997876833875 <dbl>, 3.27997876833875 <dbl>, 3.31997876833875 <dbl>,
#> # 3.39997876833875 <dbl>, 3.43997876833875 <dbl>, 3.47997876833875 <dbl>,
#> # 3.51997876833875 <dbl>, 3.55997876833875 <dbl>, 3.59997876833875 <dbl>,
#> # 3.63997876833875 <dbl>, 3.67997876833875 <dbl>, 3.71997876833875 <dbl>,
#> # 3.75997876833875 <dbl>, 3.79997876833875 <dbl>, 3.83997876833875 <dbl>,
#> # 3.87997876833875 <dbl>, 3.91997876833875 <dbl>, 3.95997876833875 <dbl>,
#> # 3.99997876833875 <dbl>, 4.03997876833875 <dbl>, 4.07997876833875 <dbl>,
#> # 4.11997876833875 <dbl>, 4.15997876833875 <dbl>, 4.19997876833875 <dbl>,
#> # 4.23997876833875 <dbl>, 4.27997876833875 <dbl>, 4.31997876833875 <dbl>,
#> # 4.35997876833875 <dbl>, 4.39997876833875 <dbl>, 4.43997876833875 <dbl>,
#> # 4.47997876833875 <dbl>, 4.51997876833875 <dbl>, 4.55997876833875 <dbl>,
#> # 4.59997876833875 <dbl>, 4.67997876833875 <dbl>, …
rec <- recipe(df, label ~.) %>%
update_role(id, new_role = "ID") %>%
step_pareto(all_predictors()) %>%
step_nzv(all_predictors()) %>%
step_pca(all_predictors(), threshold = 0.95)
mod <- discrim_linear(mode = "classification") %>%
set_engine("mda")
wflow <- workflow() %>%
add_recipe(rec) %>%
add_model(mod)
fitted <- fit(wflow, df)
print(fitted)
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: discrim_linear()
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 3 Recipe Steps
#>
#> ● step_pareto()
#> ● step_nzv()
#> ● step_pca()
#>
#> ── Model ───────────────────────────────────────────────────────────────────────
#> Call:
#> mda::fda(formula = ..y ~ ., data = data, method = mda::gen.ridge,
#> keep.fitted = FALSE)
#>
#> Dimension: 1
#>
#> Percent Between-Group Variance Explained:
#> v1
#> 100
#>
#> Degrees of Freedom (per dimension): 1.92434
#>
#> Training Misclassification Error: 0.33333 ( N = 6 )
It looks like we managed to missclassify one sample, this would probably improve by increasing the sample number and optimizing the preprocessing.
When you go to production, you’ll want to use the same pre-processing steps on new samples. While this is straigthforward for all the read masking operations it can become difficult when using more sophisticated bucketting or normalization algorithms.
This package provides a processing_template
class to deal with reproducible processing. Each collection
object contains a processing_template
object that automatically saves processing steps and their arguments.
# See what happened to spectra already
print(spectra)
#> Spectra collection containing 6 entries.
#> Number of bins: 166
#> Limits: 0.4999788 7.203979
#> Labels: conventional organic
#>
#> Processing:
#> Step 1 / 5 : extract
#> Step 2 / 5 : mask
#> Step 3 / 5 : mask
#> Step 4 / 5 : internalStandard_normalization
#> Step 5 / 5 : uniform_binning
# Exporting the processing template from a processed collection
processor <- export_processor(spectra)
print(processor)
#> Step 1 / 5 : extract
#> Step 2 / 5 : mask
#> Step 3 / 5 : mask
#> Step 4 / 5 : internalStandard_normalization
#> Step 5 / 5 : uniform_binning
processing_template
objects also have a tidy method with relevant information and allow access to each single step:
tidy(processor)
#> # A tibble: 5 x 4
#> num name method id
#> <int> <chr> <chr> <chr>
#> 1 1 extract extract extract_458834
#> 2 2 mask mask mask_155001
#> 3 3 mask mask mask_936382
#> 4 4 internalStandard_normali… normalize_fact… internalStandard_normalizatio…
#> 5 5 uniform_binning bucket_from_br… uniform_binning_352099
processor$steps[[1]] %>% tidy()
#> # A tibble: 1 x 6
#> name method from to overlaps id
#> <chr> <chr> <dbl> <dbl> <chr> <chr>
#> 1 extract extract 0.5 7.2 keep extract_458834
processor$steps[[4]] %>% tidy()
#> # A tibble: 1 x 4
#> name method factors id
#> <chr> <chr> <list> <chr>
#> 1 internalStandard_norm… normalize_fa… <tibble[,2] [6… internalStandard_normali…
# One can access function arguments:
processor$steps[[4]] %>% tidy() %>% pull(factors)
#> [[1]]
#> # A tibble: 6 x 2
#> id factors
#> <fct> <dbl>
#> 1 20199305928 1.06e- 9
#> 2 20199306281 9.48e-10
#> 3 20199504645 1.41e- 9
#> 4 20198124123 1.10e- 9
#> 5 20198124124 1.35e- 9
#> 6 20198124125 9.62e-10
We can now apply the processing to a new collection
object (or reuse the fa_nmr dataset) and compare to spectra
:
new_data <- process(processor, fa_nmr)
print(new_data)
#> Spectra collection containing 6 entries.
#> Number of bins: 166
#> Limits: 0.4999788 7.203979
#> Labels: conventional organic
#>
#> Processing:
#> Step 1 / 5 : extract
#> Step 2 / 5 : mask
#> Step 3 / 5 : mask
#> Step 4 / 5 : factor_normalization
#> Step 5 / 5 : cutsom_bucketting
And done! Let’s check that we got the same thing out of both processes:
autoplot(spectra, type = "average")+
scale_x_reverse()+
ggtitle("Original data")
autoplot(new_data, type = "average")+
scale_x_reverse() +
ggtitle("New data")