Multi-component background learning automates signal detection for spectroscopic data

Select a language for the TTS:
UK English Female
UK English Male
US English Female
US English Male
Australian Female
Australian Male
Language selected: (auto detect) - EN

Play all audios:

ABSTRACT Automated experimentation has yielded data acquisition rates that supersede human processing capabilities. Artificial Intelligence offers new possibilities for automating data

interpretation to generate large, high-quality datasets. Background subtraction is a long-standing challenge, particularly in settings where multiple sources of the background signal

coexist, and automatic extraction of signals of interest from measured signals accelerates data interpretation. Herein, we present an unsupervised probabilistic learning approach that

analyzes large data collections to identify multiple background sources and establish the probability that any given data point contains a signal of interest. The approach is demonstrated on

X-ray diffraction and Raman spectroscopy data and is suitable to any type of data where the signal of interest is a positive addition to the background signals. While the model can

incorporate prior knowledge, it does not require knowledge of the signals since the shapes of the background signals, the noise levels, and the signal of interest are simultaneously learned

via a probabilistic matrix factorization framework. Automated identification of interpretable signals by unsupervised probabilistic learning avoids the injection of human bias and expedites

signal extraction in large datasets, a transformative capability with many applications in the physical sciences and beyond. SIMILAR CONTENT BEING VIEWED BY OTHERS BAYESIAN ACTIVE LEARNING

WITH MODEL SELECTION FOR SPECTRAL EXPERIMENTS Article Open access 14 February 2024 STRETCHED NON-NEGATIVE MATRIX FACTORIZATION Article Open access 27 August 2024 APPLICATION OF

SELF-SUPERVISED APPROACHES TO THE CLASSIFICATION OF X-RAY DIFFRACTION SPECTRA DURING PHASE TRANSITIONS Article Open access 09 June 2023 INTRODUCTION Data analysis and interpretation are

pervasive in physical sciences research and typically involve information extraction from noisy and background-containing signals.1,2,3 Examples from materials science include the

identification of crystal structures from X-ray diffraction patterns4 and chemical species from X-ray photoelectron spectra.5 Distinguishing the signal of interest from background signals

comprises a major hurdle, and any errors in making these distinctions can alter data interpretation.6,7 The identification of the signal of interest often requires expert knowledge8,9 and/or

application of empirical algorithms, motivating the establishment of a more principled approach. An example of principled background removal in physical sciences concerns the Bremsstrahlung

radiation observed in energy-dispersive X-ray spectroscopy (EDS),10,11 which provides an ideal situation for background identification because there is a single primary background source

whose shape can be derived from fundamental physics.10,11,12,13 On the other hand, measurements such as X-ray diffraction (XRD) typically involve a variety of background sources. The

background sources of measured X-ray intensities can include scattering by air, elastic scattering by the sample, and scattering by the substrate or sample support, which appear in the

detector signal in combination with the desired inelastic scattering from the sample of interest. Furthermore, a given background signal may be attenuated differently over a set of

measurements, but it always provides a non-zero contribution to the measured signal. Since the level of these different background signals can vary independently, it is not possible to

identify a single characteristic background pattern, motivating the establishment of a multi-component model. Raman spectroscopy similarly involves a variety of background sources. Herein,

XRD and Raman data are used as specific examples in which the measured signal is the combination of positive intensities including the signal of interest and any number of background

signals. Empirical background subtraction models6,7,14,15 typically require manual fine tuning of parameters. For example, the XRD background subtraction algorithm from Sonneveld and Visser6

requires parameters for the smoothness of the data and the magnitude of the intensity gradients for peaks of interest. Though the algorithm can be implemented effectively, as reflected by

its incorporation into several commercial software packages for XRD analysis, users still need to fine-tune the parameters to avoid distortion of the peaks of interest and overestimation of

the background signal. Further, as is shown in the current work, there are complex background signals which defy approaches based on fitting a background model to a single spectrogram at a

time. More recently, background identification through analysis of a collection of measurements has been performed using methods such as principal component analysis (PCA)16 or polynomial

fitting,15 which still require expert knowledge in discriminating background from signal and do not guarantee non-negativity of the extracted signal. We introduce Multi-Component Background

Learning (MCBL), a fundamentally new approach to background subtraction and signal identification. MCBL leverages the power of big data by inferring background and signals of interest from

an entire dataset of spectrograms. Second, MCBL’s inference task is enabled by a novel probabilistic generative model of the spectroscopic data where the background components, the noise

variance, and the level of spectroscopic activity are all concomitantly learned from the data. The comprehensiveness of the learning model is key for achieving autonomous interpretation of

spectroscopic data, a goal of increasing practical importance for emerging technologies such as materials acceleration platforms.3 Third, MCBL provides the probability that any given data

point contains a (non-background) signal of interest. This probability is automatically inferred by the algorithm based on its unified probabilistic framework, and does not rely on human

parameter estimates. Furthermore, the MCBL model is flexible enough to incorporate prior knowledge of different types of background sources. For example, a common assumption is the

smoothness of the background signals, which the algorithm can incorporate by enforcing a user-defined smoothness constraint. Note however, that the algorithm is less sensitive to these types

of human inputs than other algorithms, especially when the algorithm is given a large number of spectrograms. Providing prior knowledge is especially important in challenging cases where

there are many complex background signals and data are scarce. Last, the MCBL algorithm requires a noise model. We describe its principled design for XRD and Raman data, as well as the

physical meaning of each parameter, in the Methods section. In addition, the noise model’s parameters are not required to be chosen manually but can be learned from the data. MCBL is

demonstrated using large datasets from two common techniques in materials characterization: XRD and Raman spectroscopy. In both cases, the data were acquired using composition libraries that

were synthesized to measure and identify composition-structure-property relationships,17 a central tenet of combinatorial materials science.18 Automated inference of the crystal structures

from XRD or Raman characterization of the composition library, i.e. “Phase Mapping”, is a long-standing bottleneck in materials discovery.8 Phase Mapping algorithms have been plagued by both

insufficient background removal and incorrect labeling of signals of interest as background or noise. Unsupervised, principled background removal circumvents these issues to increase both

the speed and the quality of data interpretation. RESULTS X-RAY DIFFRACTION To demonstrate the performance of MCBL and illustrate some of the more subtle aspects of the model and its

deployment, we apply it to a particularly challenging XRD example in which there are multiple background sources, including a background source whose intensity is substantially higher than

the signal of interest. In this case, the strong background signal is from diffraction of the SnO2 in the substrate, introducing unwanted peaks into the dataset that are quite similar in

shape to those in the desired signal from the thin film sample. Furthermore, over a series of 186 reflection-geometry measurements on different thin film compositions, the variable density

and thickness of the thin film of interest alters the shape and intensity of the substrate signal. Provided that the set of 186 samples contains more variability in the signal of interest

than the background signal, which it does due to the variety of crystal structures in the 186 unique compositions, MCBL identifies the unique combination of background signals for each of

the 186 measured diffraction patterns. Note that we have prior knowledge that there are two distinct types of background sources: diffraction signals from the crystalline substrate and

smoothly varying signals from other sources including elastic scattering and air scattering. We inject this knowledge into the model by allowing one type of background component to have

intensity only in the vicinity of known substrate diffraction peaks (scattering vector magnitudes 18.5–19.2, 23.6–24.1, and 26.2–26.8 nm−1), while the other type of background component is

enforced to be smoothly varying. As shown in Fig. 1, the MCBL model identifies the background signal, enabling retention of the desired signal even when the Bragg peak from the sample

strongly overlaps that of the substrate. The recovery of the desired signal from the shoulder of the much more intense background signal, as exemplified by the peak near 23 nm−1 in Fig. 1b,

is uniquely enabled by the model’s ability to learn the background signal from the collection of measurements. It is also worth noting that the background models in these 2 examples are

different in slight but important ways because the total background signal is unique to each measurement, which is illustrated further in the Raman example below. RAMAN SPECTROSCOPY

Continued demonstration of MCBL proceeds with a Raman spectroscopy dataset where 2121 metal oxide samples spanning 15 pseudo-quaternary metal oxide composition spaces (5 elements including

oxygen but systematic variation of the concentrations of only the 4 metals yields dimensionality of a quaternary composition space) were measured using a rapid Raman scanning technique

described previously.19 Similar to the XRD dataset, the Raman signal from the substrate varies in intensity with sample composition, and the high sensitivity of Raman detectors to

environmental factors such as room temperature introduces additional variability in background signal. Data acquisition proceeded over a week, during which time-dependent variation in signal

levels were observed. These occur, for example, due to day to night temperature variation in the laboratory. While we expect the background to be smooth, a closed mathematical expression is

not available, making this dataset well matched to the capabilities of the MCBL model. As discussed in the Methods section, limiting each of the background signals to be smooth makes the

results relatively insensitive to the number of background sources included in the model, provided this number is at least as large as the true number of background sources. Since we expect

that several sources may be present, 16 is a convenient upper bound and is a standard value to use for datasets where more specific knowledge of the background sources is unavailable. Since

peak shapes, in particular peak widths, are more variable in Raman measurements compared to XRD measurements, and the intensity of the Raman signal of interest is often comparable to the

measurement noise, even background—Raman signals are not readily interpretable without additional information. MCBL provides such additional information, in particular the probability that

each individual data point contains signal from the sample, i.e., intensity that is not explainable by the background and noise models. For each measured signal, the algorithm produces a

probability signal that can be used to reason about the data in subsequent analysis. Since single-point outliers in the measured signals can cause single point outliers in the probablity

signal, MCBL factors in the prior knowledge that any Raman feature of interest will span several data points by smoothing the probability signal via kernel regression20 with a Gaussian

kernel of (_σ_) three data points. Thresholding the smoothed probability signals at 50% provides identification of each data point that likely contains signal from sample of interest.

Representative examples of background identification and removal are shown for three Raman measurements in Fig. 2a–c. Using MCBL with 16 background components yields background-subtracted

signals with a flat, near-zero baseline atop which the small signal peaks are far more evident than in the raw data. Since each net signal contains measurement noise, the visual

identification of peaks can be assessed in the context of this noise. The results of the probability signal analysis are shown with demarcation of each data point that likely contains signal

from the substrate. It is worth noting that researchers often apply smoothing to assist in identification of such small peaks in the signal, although the propensity for modification of the

true signal and possibilities for both false positive and false negative peak detection highlights the benefits of the identifying the background signal using a probabilistic model that

considers both the noise and the signal from the sample. Figure 2a includes examples of peaks from the sample that are notoriously difficult to identify. The peak near 480 cm−1 appears atop

of a larger peak in the background signal, and the peak near 510 cm−1 lies on a strongly sloped portion of the background signal. The intensity at the right edge of the measured signal in

Fig. 2a is increasing, so inspecting this individual pattern could not definitively identify that portion of the signal as being absent or inclusive of signal from the sample. The

probabilistic model makes this assessment, where no signal from the sample is identified in this portion of Fig. 2a. A sample peak is detected in the analogous portion of the measurement in

Fig. 2b where the partial measurement of the peak atop a sloped background would be problematic for any peak fitting (regression)-based search for sample peaks. Figure 2b also demonstrates

the importance of the multi-component aspect of the background model. While the background signal is qualitatively similar to the other samples, the quantitative differences that are

emblematic of the unique mixture of the background sources render the single-component model unable to provide a clean background-subtracted signal. The model’s detection of two peaks in the

measured signal of Fig. 2c (near 480 and 680 cm−1) is particularly impressive as even expert manual analysis may hesitate to label these features as sample peaks due to the poor

signal-to-noise ratio. Their detection in the probabilistic model is aided by the appearance of the peaks in other measurements, including that of Fig. 2a. To highlight the quality of the

net signals produced by the MCBL model, the measurement of Fig. 2c is shown in Fig. 2d along with traditional polynomial baseline modeling. The lower-order polynomial yields a net signal

where the largest peak is actually from a background source, and increasing the polynomial order to capture this feature in the background model results in removal of practically all signal

from the sample. To further illustrate the background removal and peak identification process, Fig. 3 includes a series of ten of the Raman measurements with a variety of peak locations,

shapes, and relationships to the background signal. Since the signal probability is calculated for every data point, the probability signals can be plotted in the same manner as the measured

signals, as shown in Fig. 3b. The background-subtracted samples in Fig. 3c are shown with partial transparency where the probability signal is below the 50% threshold so that the regions of

each pattern that likely contain signal from the sample are highlighted. The sharp, intense peaks in the top two patterns may be easily identified by a variety of algorithms, although

identification of many of the broader, weaker features from each sample require the excellent background identification and probabilistic reasoning of the MCBL model. PROBABILISTIC

CLASSIFICATION OF SAMPLE SIGNAL AND ENUMERATION OF BACKGROUND SOURCES Figure 2b also illustrates a subtle consequence of the model’s collective learning of the background signals,

measurement noise, and probabilities via the probabilistic framework. The rank 16 model identifies the appropriate background and consequently correctly learns the measurement noise to

identify three small peaks between 220 and 420 cm−1. The rank 1 background model is imperfect, and the collection of samples with incorrect background signals inflates the model’s estimation

of the noise level such that the resulting probability signals do not identify any of these three peaks as likely containing signal from the sample. The comprehensive probabilistic

framework enables simultaneous learning of multiple properties of the measured signals, but using a background rank smaller than the true number of background sources is deleterious not only

to background removal but also to automated detection of signals of interest. The classification of measured signals as lacking or containing a signal of interest has a variety of

applications ranging from materials discovery to characterization of the background sources. Using the rank 16 background model, 743 of the 2121 measured signals contain at least one

datapoint that is likely to contain signal of interest. Using this as the baseline classification of absence or presence of signal from the sample, the performance of lower-rank models can

be assessed via the recall (the fraction of the 743 patterns with signal that are correctly identified as having signal) and the precision (the fraction of signals with detected signal that

actually have signal). The results are summarized in Fig. 4a and demonstrate the poor performance of the rank 1 model for this classification task, which is due to a confluence of phenomena

including that noted above; non-removed background signal can be interpreted as signal of interest (false positive), and the inflated noise level in the noise model can fail to identify

small signals of interest (false negative). Increasing to rank 2 greatly improves the recall but not the precision, and increasing to rank 4 largely removes the disparity between recall and

precision. Since there is no substantial change upon increasing to rank 8, these results collectively indicate that the number of background sources is three or four. It is worth noting that

multiple components are needed to model a single background source if its signal varies in shape over the dataset, so this interpretation of rank as determine the number of sources includes

the number of unique physical phenomena that alter the shape of a background signal. The background sources can be further characterized using the wealth of information provided by the MCBL

model, such as the spatial or temporal variation in the intensity of each background source. Figure 4b includes a similar analysis for how the background-subtracted signals vary with rank.

Once again using the rank 16 results as the baseline for comparison, the difference of each background-subtracted signal is measured using both the $\ell _1$ and root mean squared (RMS)

loss. The average per-signal loss appears to follow a power law relationship with the model rank. Each pattern contains 1023 data points, so starting at rank 4 the $\ell _1$ value per data

point is about 1 CPS or lower, and comparison to the signals in Fig. 2 demonstrate that this is within the measurement noise, in agreement with the above observation that rank 3 or 4 is

sufficient to model the background in this dataset. Using a larger rank has no substantial influence on the resulting signals of interest. This stability in the model’s solution is an

important feature for unsupervised deployment. DISCUSSION The results above demonstrate not only successful background removal but also the generation of insightful probabilistic models for

both XRD and Raman data. While background removal is often considered a non-scientific aspect of data interpretation, consider instead the concept that the scientific merit of a chain of

analyses is only as strong as its weakest link. Artifacts injected from non-principled background subtraction are inherited by subsequent analyses and can contaminate the scientific

interpretation of the data. In general, any modification to measured data should be performed in a manner that reflects a fundamental understanding of the underlying physical processes that

give rise to the measured signals. In the present work, this understanding is incorporated with specificity through the establishment of a probability density model for the signal of

interest, yet through its parameterization the model retains generality for any measurements involving addition of non-negative sources. A desirable consequence of this principled

parameterization of the background model is that the learned parameters provide statistical characterizations of the data, which was demonstrated with analysis of the probability signals and

the identification of the number of background sources in the Raman dataset. While not discussed in the present work, after identification of the number of background sources, the

individual background signals can be analyzed to study the background sources themselves, and the activations of each of these sources in a dataset enables quantification of the variability

in each background source’s intensity. While one goal of the algorithm is the generation of background-free signals, these examples illustrate the broader application of the probabilistic

learning approach, that the optimized probabilistic model contains deep information about every component of the measured signal. A principled approach to the identification, removal and

statistical evaluation of background signals is established for any measurement type where each measured signal is a combination of non-negative contributions from multiple sources. Through

design of a parameterized probability density function for the measured intensities of a signal of interest, a probabilistic framework is established for unsupervised learning of background

signals, in particular when there are multiple sources of background whose contributions to the measured signal vary among the set of measurements. In addition to unsupervised operation, the

model provides a variety of methods for incorporating prior knowledge, which is demonstrated with an example XRD dataset in which the crystalline substrate produces more intense diffraction

patterns than the sample of interest. The probability signals, which indicate where the signal of interest is likely present, are demonstrated using a Raman dataset in which the ~4

background sources are identified and modeled for each measurement, providing signals for further analysis that contain negligible contributions from the background. The probability signals

and other parameters can be employed by subsequent reasoning and learning algorithms, making the algorithm a foundational advancement in the automation of data interpretation. METHODS MCBL

MODEL In a dataset with _N_ signals that were measured on a variety of samples, each signal _S__i_ is modeled as the sum of the signal _P__i_ from the sample, which typically involves a

series of peaks, and the total background signal _B__i_. For each data point _j_ in measurement _i_, $$S_{i,j} = P_{i,j} + B_{i,j}.$$ (1) Since in general _B__i_ is composed of a unique

mixture of _K_ background signals, the background patterns and sample-specific weights are determined using a matrix factorization (MF) approach. The MF construction of the background model

involves the matrix _V_, containing the collection of _K_ signals from the background sources, and the matrix _U_, containing the amount of each background signal in each measured signal.

The matrix product _UV_ is thus the collection of total background signals for each measurement: _B_ ≈ _UV_. To create a model that does not require measurement of each background signal,

which is typically not possible, the matrices _U_ and _V_ are learned from the measured spectra by considering the MF problem $$S \approx UV.$$ (2) In traditional implementations of matrix

factorization, the residuals of the model, _R_ = _S_ − _UV_, are minimized with respect to $\ell _2$ or similar loss metric. However, given that _UV_ is the background model, _R_ contains

_P_, the signals of interest. In spectroscopic data, _P_ is positive and can be large. Critically, large deviations are penalized heavily by traditional loss functions like $\ell _2$.

Therefore, this problem requires a novel approach to solving the matrix factorization problem, which allows for large deviations from the background model _UV_ where signals of interest are

present. If the signal of interest includes a peak (non-background signal) at the _j_th data point in measurement _i_, then _R__i_,_j_ will be large and positive. While _R__i_,_j_ should be

near zero for data points containing only background signal, the measurements _S_ and thus the residuals _R_ contain measurement noise. As a result, the distribution of _R__i_,_j_ values

will be different when the signal of interest is absent or present. When absent, the measurement noise is typically well modeled by a Gaussian distribution, ${\cal{N}}_{\mu ,\sigma }$.

When present, the large residual intensities (peaks) are modeled by an exponential distribution, which when combined with the Gaussian distribution for noise yields the exponentially

modified Gaussian (EMG) distribution: $${\mathrm{EMG}}_{\mu ,\sigma ,\lambda }(R_{ij}) = \frac{\lambda }{2}e^{\frac{\lambda }{2}(2\mu + \lambda \sigma ^2 - 2R_{ij})}\,{\mathrm{erfc}}\,\left(

{\frac{{\mu + \lambda \sigma ^2 - R_{ij}}}{{\sqrt 2 \sigma }}} \right),$$ (3) where erfc is the complementary error function, _λ_ is the rate parameter of the exponential random variable,

and _μ_ and _σ_ are the location and scale parameters of the Gaussian random variable, respectively. This distribution was previously used in biology,21 psychology,22 and finance.23

Furthermore, the values _λ_, _μ_, and _σ_ can vary along the measurement axis to increase the flexibility of the model, if required. In the present work we consider only a single _σ_ and _λ_

for a given dataset and fix the mean _μ_ of the Gaussian noise to zero. Since ${\cal{N}}_{\mu ,\sigma }$ and the EMG distribution of Eq. (3) describe the distribution of residual

intensities when the signal of interest is absent and present, respectively, a general expression for the distribution of residual intensities is their mixture: $$\begin{array}{*{20}{l}}

{{\mathrm{EMGM}}_{\mu ,\sigma ,\lambda ,Z_{ij}}(R_{ij}): = (1 - Z_{ij}){\cal{N}}_{\mu ,\sigma }(R_{ij}) + Z_{ij}\,{\mathrm{EMG}}_{\mu ,\sigma ,\lambda }(R_{ij}),} \hfill \end{array}$$ (4)

where _Z__ij_ indicates whether signal of interest is absent (_Z__ij_ = 0) or present (_Z__ij_ = 1) in the residual _R__ij_. Optimization of the matrix factorization model Eq. (2)

corresponds to finding the background patterns, weights and distribution parameters such that the likelihood, corresponding to the product of Eq. (4) for all data points, is maximized. This

enables _U_, _V_, _Z_, _λ_, _μ,_ and _σ_ to be learned concomitantly. A standard procedure in machine learning is to regularize optimization problems to make them well posed. In particular,

to encourage the algorithm to find solutions with a small noise variance, we added a half normal prior on _σ_ to regularize the optimization with respect to the parameter. The prior

distribution has a variance $\sigma _0^2$ which can be used to control the strength of the regularization. This is necessary for the XRD dataset, since it does not include any substrate

measurements. Therefore, we used $\sigma _0^2 = 0.01$ for the XRD dataset. Because the exact optimization of the binary variables _Z__i_,_j_ is computationally intractable, we employ an

expectation-maximization algorithm.24,25 Instead of inferring _Z__i_,_j_ directly, the algorithm computes the expected value ${\mathbb{E}}(Z_{ij})$, which is a continuous variable in the

interval [0, 1]. From the equality $${\mathbb{E}}[Z_{ij}]={\mathbb{P}}(Z_{ij} = 1),$$ (5) we also obtain the probability that the measured data point _S__i_,_j_ contains non-background

signal. The algorithm for solving this implementation of probabilistic matrix factorization is described in ref. 26. This approach to background identification enables unsupervised learning

of the background model after choosing the value of a single parameter, the rank _K_ of _V_, which corresponds to the number of background sources. While unsupervised methods for determining

an appropriate value of _K_ can be deployed,27 we instead further constrain the matrix factorization model such that the results are relatively insensitive to _K_. This enables users to

choose an upper bound for _K_ and retain unsupervised operation. The constraints to the matrix factorization also enable semi-supervised operation, which enables both injection of prior

knowledge of the background sources and deployment in data-starved situations where there are not enough examples of the background signals for the unsupervised model to robustly learn them.

The constraints are implemented by defining kernel functions for each component of _V_. The most commonly used kernel is the squared exponential (SE) kernel, which enforces smoothness of

each background signal. For example, the background model for the Raman data was obtained by using the SE kernel for all background components. Further, if the underlying physics of a given

background signal give rise to a functional form or another physics-based constraint, this too can be used to constrain components in _V_. In fact, for the XRD dataset, the SE kernel was

only used for two background signals. The other two background signals were constrained based on prior knowledge of the background signal from the crystalline substrate; the intensity of

these background signals was constrained to zero except for the regions indicated above in the X-ray diffraction section. This is done with a simple projection: All values outside of the

allowed ranges are set to zero in every gradient step of the optimization algorithm. Despite there only being one crystalline substrate, we used two vectors to express its signature to

accommodate for any variations in this background signal over the set of measurements. LIBRARY SYNTHESIS The pseudo-ternary metal oxide composition gradient was fabricated using reactive

direct current magnetron co-sputtering of Cu, Ca, and V metal targets in a non-confocal geometry onto a 100 mm diameter × 2.2 mm thick soda lime glass substrate with FTO coating (Tec15,

Hartford Glass Company) in a sputter deposition system (Kurt J. Lesker, PVD75) at 10−5 Pa base pressure. The partial pressures of the deposition atmosphere containing inert sputtering gas Ar

and reactive gas O2 were 0.072 Pa and 0.008 Pa respectively. Deposition proceeded without active substrate heating, with the source powers set to 150 W, 11 W, and 95 W for the V, Cu, and Ca

sources respectively. Deposition time per source was varied in order to achieve a total film thickness of 200 nm. The as-deposited composition library was annealed in a Thermo Scientific

box oven in flowing air, with a 2 h ramp and 3 h soak at 550 °C, followed by passive cooling. The 2121 samples forming the 15 pseudo-quaternary space composition library were deposited via

inkjet printing onto 100 × 150 × 1.0 mm fluorine-doped tin oxide (FTO) coated boro-aluminosilicate glass (Corning Eagle XG Glass). The array of samples containing Mn, Fe, Ni, Cu, Co, and Zn

was synthesized as a discrete library with 10 atom% composition steps in each element, using a print resolution of 2880 × 1440 dpi, as described previously.28 Elemental precursor inks were

prepared by mixing 3.33 mmoles of each metal precursor with 20 mL of stock solution. The stock solution of 500 mL 200 proof ethanol (Koptec), 16 mL glacial acetic acid (T.J. Baker, Inc.), 8

mL concentrated HNO3 (EMD), and 13 g Pluronic F127 (Aldrich) was prepared beforehand. The metal precursors Mn(NO3)2 4⋅H2O (0.88 g, 99.8%, Alfa Aesar), Fe(NO3)3 9⋅H2O (1.43 g, 99.95%, Sigma

Aldrich), Co(NO3)2 6⋅H2O (0.93 g, 98%, Sigma Aldrich), Ni(NO3)2 6⋅H2O (1.09 g, 98.5%, Sigma Aldrich), Cu(NO3)2 3⋅H2O (0.83 g 99–104%, Sigma Aldrich), and Zn(NO3)2 6⋅H2O (1.00 g 98%, Sigma

Aldrich) were used as-received from the distributor. After inkjet printing, the inks were dried and converted to metal oxides by calcination in 0.395 atm O2 at 450 °C for 10 h, followed by

0.395 atm O2 at 750 °C for 10 h. X-RAY DIFFRACTION XRD was performed on the pseudo-ternary metal oxide composition gradient using a Bruker DISCOVER D8 diffractometer, with a Bruker IμS

source emitting Cu K_α_ radiation. Using a 0.5 mm collimator, the measurement area was approximately 0.5 mm × 1 mm. Within this measurement area the composition is uniform to with about 1

at.%. Measurements were taken on an array of 186 evenly spaced positions across the continuous composition library. Two-dimensional diffraction images taken by the VÅNTEC-500 detector were

integrated into one-dimensional patterns using DIFFRAC.SUITETM EVA software. RAMAN SPECTROSCOPY The 15 pseudo-quaternary composition space metal oxide sample library, was characterized using

a Renishaw inVia Reflex Micro Raman spectrometer with Wire 4.1 software as described previously.19 The instrument’s laser wavelength was 532 nm, and the diffraction grating resolution 2400

lines mm−1 (visible). Spectra were taken over the range 67–1339.9 cm−1 using a ×20 objective. The Renishaw StreamlineTM mapping system was used to automate spectral image collection in which

a cylindrical lens-expanded 26 × 2 μm laser line was rastered over the measurement area. Spectra were acquired at 65 μm spatial resolution and 0.75 s exposure time. DATA AVAILABILITY The

datasets analyzed during the current study are available in the Caltech Data repository: XRD at https://doi.org/10.22002/D1.1178, https://data.caltech.edu/records/1178 and Raman at

https://doi.org/10.22002/D1.1179, https://data.caltech.edu/records/1179. CODE AVAILABILITY The codes pertaining to the current study will be available at

http://www.cs.cornell.edu/gomes/udiscoverit/. REFERENCES * Alberi, K. et al. The 2019 materials by design roadmap. _J. Phys. D_ 52, 013001 (2019). Article Google Scholar * Aspuru-Guzik, P.

K. A. Alán. Report of the Clean Energy Materials Innovation Challenge Expert Workshop January 2018, Mission Innovation

http://mission-innovation.net/wp-content/uploads/2018/01/Mission-Innovation-IC6-Report-Materials-Acceleration-Platform-Jan-2018.pdf. * Tabor, D. P. et al. Accelerating the discovery of

materials for clean energy in the era of smart automation. _Nat. Rev. Mater._ 3, 5 (2018). Article CAS Google Scholar * Laue, M. Über die Interferenzerscheinungen an planparallelen

Platten. _Ann. der Phys._ 318, 163–181 (1904). Article Google Scholar * Seah, M. P. The quantitative analysis of surfaces by xps: a review. _Surf. Interface Anal._ 2, 222–239 (1980).

Article CAS Google Scholar * Sonneveld, E. J. & Visser, J. W. Automatic collection of powder data from photographs. _J Appl. Crystallograph_. 8, 1–7 (1975). * Tougaard, S. Algorithm

for automatic X-ray photoelectron spectroscopy data processing and x-ray photoelectron spectroscopy imaging. _J. Vac. Sci. Technol._ 23, 741–745 (2005). Article CAS Google Scholar *

Hattrick-Simpers, J. R., Gregoire, J. M. & Kusne, A. G. Perspective: composition–structure–property mapping in high-throughput experiments: turning data into knowledge. _APL Mater._ 4,

053211 (2016). Article Google Scholar * Stein, H. S., Jiao, S. & Ludwig, A. Expediting combinatorial data set analysis by combining human and algorithmic analysis. _ACS Comb. Sci._ 19,

1–8 (2017). Article CAS Google Scholar * Tessier, F. & Kawrakow, I. Calculation of the electron–electron bremsstrahlung cross-section in the field of atomic electrons. _Nucl. Instr.

Meth. Phys. Res. B_ 266, 625–634 (2008). * Kramers, H. A. Xciii. on the theory of x-ray absorption and of the continuous x-ray spectrum. _Lond. Edinb. Dublin Philos. Mag. J. Sci._ 46,

836–871 (1923). Article CAS Google Scholar * Davies, H., Bethe, H. A. & Maximon, L. C. Theory of Bremsstrahlung and pair production. II. Integral cross section for pair production.

_Phys. Rev._ 93, 788–795 (1954). Article CAS Google Scholar * Bethe, H. A. & Maximon, L. C. Theory of Bremsstrahlung and pair production. I. Differential cross section. _Phys. Rev._

93, 768–784 (1954). Article CAS Google Scholar * Tougaard, S. & Jorgensen, B. Inelastic background intensities in XPS spectra. _Surface Sci_. 143, 482–494 (1984). * Zhao, J., Lui, H.,

McLean, D. I. & Zeng, H. Automated autofluorescence background subtraction algorithm for biomedical raman spectroscopy. _Appl. Spectrosc._ 61, 1225–1232 (2007). Article CAS Google

Scholar * Markus, G., Konstantinos, N., Frank, P., Christian, M. & Andreas, O. Multivariate characterization of a continuous soot monitoring system based on Raman spectroscopy. _Aerosal

Sci. Technol_. 49, 997–1008 (2015). * Li, Z., Ludwig, A., Savan, A., Springer, H. & Raabe, D. Combinatorial metallurgical synthesis and processing of high-entropy alloys. _J. Mater.

Res._ 33, 3156–3169 (2018). Article CAS Google Scholar * Zhao, J. Combinatorial approaches as effective tools in the study of phase diagrams and composition–structure–property

relationships. _Prog. Mater. Sci._ 51, 557–631 (2006). Article Google Scholar * Newhouse, P. F. et al. Solar fuel photoanodes prepared by inkjet printing of copper vanadates. _J. Mater.

Chem. A_ 4, 7483–7494 (2016). Article CAS Google Scholar * Wand, M. & Jones, M. Kernel Smoothing. New York: Chapman and Hall/CRC (1995). * Golubev, A. Exponentially modified gaussian

(emg) relevance to distributions related to cell proliferation and differentiation. _J. Theor. Biol._ 262, 257–266 (2010). Article CAS Google Scholar * Palmer, E. M., Horowitz, T. S.,

Torralba, A. & Wolfe, J. M. What are the shapes of response time distributions in visual search? _J. Exp. Psychol. Hum. Percept. Perform._ 37, 58–71 (2011). Article Google Scholar *

Carr, P., Madan, D. & Smith, H. R. Saddle point methods for option pricing. _J. Comput. Financ._ 13, 49–61 (2009). Article Google Scholar * Dempster, A. P., Laird, N. M. & Rubin,

D. B. Maximum likelihood from incomplete data via the em algorithm. _J. R. Stat. Soc. Ser. B_ 39, 1–38 (1977). Google Scholar * Neal, R. M. & Hinton, G. E. _Learning in Graphical

Models._ (MIT Press, Cambridge, 1999). Google Scholar * Ament, S., Gregoire, J. & Gomes, C. Exponentially-modified Gaussian mixture model: applications in spectroscopy. Preprint at

arXiv:1902.05601 (2019). * Neal, R. M. Markov chain sampling methods for dirichlet process mixture models. _J. Comput. Graph. Stat._ 9, 249–265 (2000). Google Scholar * Haber, J. A. et al.

Discovering ce-rich oxygen evolution catalysts, from high throughput screening to water electrolysis. _Energy Environ. Sci._ 7, 682–688 (2014). Article CAS Google Scholar Download

references ACKNOWLEDGEMENTS The development of the MCBL algorithm, inkjet printing synthesis, and Raman measurements were supported by a an Accelerated Materials Design and Discovery grant

from the Toyota Research Institute. Initial design of the algorithm and data procurement were supported by the NSF Expedition award for Computational Sustainability CCF-1522054 and by Army

Research Office (ARO) award W911-NF-14-1-0498. The implementation of the algorithm for automated, unsupervised operation was supported by MURI/AFOSR grant FA9550. Compute infrastructure was

provided by NSF award CNS-0832782 and by ARO DURIP award W911NF-17-1-0187. The sputter deposition and XRD measurements were supported through the Office of Science of the U.S. Department of

Energy under Award No. DE-SC0004993. The authors thank Edwin Soedarmadji for assistance with data management. AUTHOR INFORMATION AUTHORS AND AFFILIATIONS * Department of Computer Science,

Cornell University, Ithaca, NY, 14850, USA Sebastian E. Ament & Carla P. Gomes * Joint Center for Artificial Photosynthesis, California Institute of Technology, Pasadena, CA, 91125, USA

Helge S. Stein, Dan Guevarra, Lan Zhou, Joel A. Haber, David A. Boyd, Mitsutaro Umehara & John M. Gregoire * Future Mobility Research Department, Toyota Research Institute of North

America, Ann Arbor, MI, 48105, USA Mitsutaro Umehara Authors * Sebastian E. Ament View author publications You can also search for this author inPubMed Google Scholar * Helge S. Stein View

author publications You can also search for this author inPubMed Google Scholar * Dan Guevarra View author publications You can also search for this author inPubMed Google Scholar * Lan Zhou

View author publications You can also search for this author inPubMed Google Scholar * Joel A. Haber View author publications You can also search for this author inPubMed Google Scholar *

David A. Boyd View author publications You can also search for this author inPubMed Google Scholar * Mitsutaro Umehara View author publications You can also search for this author inPubMed

Google Scholar * John M. Gregoire View author publications You can also search for this author inPubMed Google Scholar * Carla P. Gomes View author publications You can also search for this

author inPubMed Google Scholar CONTRIBUTIONS C.G. and J.G. identified the problem to be solved. S.A. and C.G. conceptualized the model. S.A. developed the mathematical framework, designed

the algorithm, and implemented it. J.G., H.S. and D.G. inspected results. S.A., D.G. and J.G. created visualizations of the results. L.Z. performed materials synthesis and data acquisition

for XRD data. J.H. synthesized materials for Raman measurements. D.B. and M.U. acquired and provided the Raman data. S.A., J.G., C.G., H.S. and D.G. wrote the paper. CORRESPONDING AUTHORS

Correspondence to John M. Gregoire or Carla P. Gomes. ETHICS DECLARATIONS COMPETING INTERESTS The authors declare no competing interests. ADDITIONAL INFORMATION PUBLISHER’S NOTE: Springer

Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. RIGHTS AND PERMISSIONS OPEN ACCESS This article is licensed under a Creative

Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the

original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in

the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended

use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit

http://creativecommons.org/licenses/by/4.0/. Reprints and permissions ABOUT THIS ARTICLE CITE THIS ARTICLE Ament, S.E., Stein, H.S., Guevarra, D. _et al._ Multi-component background learning

automates signal detection for spectroscopic data. _npj Comput Mater_ 5, 77 (2019). https://doi.org/10.1038/s41524-019-0213-0 Download citation * Received: 18 February 2019 * Accepted: 26

June 2019 * Published: 19 July 2019 * DOI: https://doi.org/10.1038/s41524-019-0213-0 SHARE THIS ARTICLE Anyone you share the following link with will be able to read this content: Get

shareable link Sorry, a shareable link is not currently available for this article. Copy to clipboard Provided by the Springer Nature SharedIt content-sharing initiative

Multi-component background learning automates signal detection for spectroscopic data

Play all audios:

Trending News

Latest News