Hide Menu
Hide Menu   Home   |     About Us   |   Contact   |   Imprint   |   Privacy   |   Sitemap
Hide Menu   Chemistry Index   |   Chemicals   |   Elemente
Hide Menu   Lab Instruments   |  
Hide Menu   Job Vacancies   |  
Hide Menu   Chemistry Forum   |  
Chemistry A - Z
Equipment for Lab and Industry
Chemicals and Compounds
Job Vacancies
Imprint, Contact
Chemistry Forum


Journal of Chemometrics

Current research reports and chronological list of recent articles..

The international scientific Journal of Chemometrics is devoted to the rapid publication of original scientific papers, reviews and short communications on fundamental and applied aspects of chemometrics.

The publisher is Wiley. The copyright and publishing rights of specialized products listed below are in this publishing house. This is also responsible for the content shown.

To search this web page for specific words type "Ctrl" + "F" on your keyboard (Command + "F" on a Mac). Then: type the word you are searching for in the window that pops up!

Additional research articles see Current Chemistry Research Articles. See also: information resources on chemometrics.

Journal of Chemometrics - Abstracts

Projection to latent structures with orthogonal constraints for metabolomics data

Multivariate techniques based on projection methods such as Principal Component Analysis and Partial Least Squares (PLS) regression are widely applied in metabolomics. However, the effects of confounding factors and the presence of specific clusters in the data could force the projection to produce inefficient representations in the latent space, preventing the identification of the most relevant data variation. To overcome this issue, we introduce a general framework for projection methods, allowing an easy integration of orthogonal constraints, which help in reducing the effect of uninformative variations. In particular, the discussed algorithms address different scenarios. When known confounding factors can be explicitly encoded into a proper constraint matrix, orthogonally Constrained Principal Component Analysis (oCPCA) and orthogonally Constrained PLS2 (oCPLS2) can be used. Orthogonal PLS (OPLS) and post-transformation of PLS2 (ptPLS2), instead, are suited to problems in which a constraint matrix cannot be defined. Finally, a data integration task is considered: Orthogonal two-block PLS (O2PLS) and Orthogonal Wold's two-block Mode A PLS (OPLS-W2A) are used to identify the common variation between two data sets.
Datum: 20.02.2018

Correlation and redundancy on machine learning performance for chemical databases

Variable reduction is an essential step for establishing a robust, accurate, and generalized machine learning model. Variable correlation and redundancy/total correlation are the primary considerations in many variable reduction methods given that they directly impact model performances. However, their effects vary from one class of databases to another. To clarify their effects on regression models on the basis of small chemical databases, a series of calculations are performed. Regression models are built on features with various correlation coefficients and redundancies by 4 machine learning methods: random forest, support vector machine, extreme learning machine, and multiple linear regression. The results suggest that the correlation is, as expected, closely related to the prediction accuracy; ie, generally, the features with large correlation coefficients regarding to response variables achieve better regression models than those with lower ones. However, for the redundancy, no trends on the performances of regression models are disclosed. This may indicate that for these chemical molecular databases, the redundancy might not be a primary concern.
Datum: 20.02.2018

Understanding the importance of process alarms based on the analysis of deep recurrent neural networks trained for fault isolation

The identification of process faults is a complex and challenging task due to the high amount of alarms and warnings of control systems. To extract information about the relationships between these discrete events, we utilise multitemporal sequences of alarm and warning signals as inputs of a recurrent neural network–based classifier and visualise the network by principal component analysis. The similarity of the events and their applicability in fault isolation can be evaluated based on the linear embedding layer of the network, which maps the input signals into a continuous-valued vector space. The method is demonstrated in a simulated vinyl acetate production technology. The results illustrate that with the application of recurrent neural network–based sequence learning not only accurate fault classification solutions can be developed, but the visualisation of the model can give useful hints for hazard analysis.
Datum: 20.02.2018

Uniform experimental design in chemometrics

Experimental designs and modeling are very important in chemometrics and chemical engineering. There are many kinds of experimental designs, which include the fractional factorial design (including the orthogonal design), the optimal regression design, and the uniform design. The uniform experimental design can be regarded as a fractional factorial design with model uncertainty, a space filling design for computer experiments, a robust design against model specification, and a supersaturated design. This paper gives a brief introduction to the recent theoretical developments on uniform experimental design as well as its applications in chemometrics.
Datum: 14.02.2018

Systematic comparison and potential combination between multivariate curve resolution–alternating least squares (MCR-ALS) and band-target entropy minimization (BTEM)

This work does a systematic comparative evaluation of 2 methods originating from different fields, both dedicated to the problem of curve resolution/unmixing: multivariate curve resolution–alternating least squares (MCR-ALS) and band-target entropy minimization (BTEM). The MCR-ALS factorizes the data matrix into spectral and concentration profiles that satisfy constraints expressing physicochemical knowledge on the analyzed system. The BTEM reconstructs the pure components' spectral profiles as linear combinations of singular vectors that minimize the spectral entropy and contain specific peaks. Both methods were applied to 40 simulated and one real data set. The simulated data were generated from real spectral and concentration profiles that include different types of spectroscopy (mass spectrometry, Raman, and UV-visible), data structures (random mixtures, images, and reaction system), and noise levels; the real data set was a Raman image of kidney calculus. For most data sets, both methods yielded accurate solutions, with a correlation between reference and resolved profiles >0.99. However, MCR-ALS (here used with nonnegativity constraint only) was affected by rotational ambiguity in the recovery of spectral profiles coming from systems with high correlation or overlap in the concentration direction, whereas BTEM tended to distort UV-visible spectra, a kind of measurement far in nature from low entropy conditions. MCR-ALS solutions were more stable than BTEM to the increase in noise level. This work also explores the possibility of combining the 2 methods by performing them in sequence. The results show that this combination can significantly improve the outcome as compared to either method applied alone.
Datum: 13.02.2018

Data augmentation in food science: Synthesising spectroscopic data of vegetable oils for performance enhancement

Generating more accurate, efficient, and robust classification models in chemometrics, able to address real-world problems in food analysis, is intrinsically related with the amount of available calibration samples. In this paper, we propose a data augmentation solution to increase the performance of a classification model by generating realistic data augmented samples. The feasibility of this solution has been evaluated on 3 main different experiments where Fourier transform mid infrared (FT-IR) spectroscopic data of vegetable oils were used for the identification of vegetable oil species in oil admixtures. Results demonstrate that data augmented samples improved the classification rate by around 19% in a single instrument validation and provided a significant 38% improvement in classification when testing in more than 10 different spectroscopic instruments to the calibration one.
Datum: 09.02.2018

Decision risk approach to the application of biological indicators in vapor phased hydrogen peroxide bio-decontamination

The ensuring of sterile conditions during pharmaceutical production is of vital importance. Restricted access barrier systems may not be sterilized in the usual way. Instead, vaporous hydrogen peroxide decontamination is used. The success of the decontamination is checked by placing biological indicators (BIs), and upon their removal, putting them individually into broth for incubation. This technique does not allow counting the number of microorganisms but to detect the survival of any of them and thus counting the negative/positive answers. The Poisson distribution gives the connection between the number of viable spores and the number of positive (growth) BIs. Evaluation is based on the number of surviving BIs, which is probabilistic by nature. Acceptance limits are based on good engineering assumptions (“current industrial experience”), not on sound decision theory statistics (probability of error of the first and second kind). The reason of existence of surviving BI is either the process lethality failure or the limited penetrative capability of vaporous hydrogen peroxide. The latter phenomenon is called “rogue” BIs, and it should be distinguished. The method of the evaluation of the success of the decontamination is consistent with attribute sampling plans. In this paper, relations are explored to calculate the 2 types of error that can be applied in general during the qualification of a bio-decontamination procedure. This is, to the best knowledge of the authors, the first ever application of attribute sampling plans to evaluate the results of a BI used bio-decontamination process.
Datum: 08.02.2018

Trajectory-based phase partition and multiphase multilinear models for monitoring and quality prediction of multiphase batch processes

New process monitoring and quality prediction methods are proposed for the batch process with multiple operation phases. First, a trajectory-based phase partition method is developed to divide a batch process into different operation phases by clustering the time slices of reference batches using the warped K-means algorithm. Multilinear modeling methods, eg, parallel factor analysis and N-way partial least squares (NPLS), are then used to model the 3-way batch data in each operation phase. An online process monitoring method is proposed based on the multiphase parallel factor analysis models. An online quality prediction method is developed based on 2-level quality prediction models, consisting of the first-level multiphase multiway partial least squares models and the second-level multiphase NPLS models. The first-level multiway partial least squares models carry out real-time quality prediction at each sampling time in different operation phases. At the end of each operation phase, the second-level multiphase NPLS model is used to compute a more accurate quality prediction by taking into account the phase accumulative effect on the final produce quality. The implementation, effectiveness, and advantages of the proposed methods are illustrated with a case study on a penicillin fermentation process.
Datum: 08.02.2018

Detectability of concentration-dependent factors by application of PCA. An indicator curve for the determination of important principal components and a post-correction for transformation of principal components to factors

A semi empirical model of light absorption for binary liquid mixtures, which includes linear, parabolic, and periodic terms of concentration-dependent factors, has been developed and applied for investigating the revealability of the factors. Concentration-dependent near infrared spectra of ethanol-water mixtures and a two-component model were decomposed by principal component analysis. Generated from the principal component analysis results and called the mean coefficient of determination, an indicator is introduced for separating the important or systematic principal components with deterministic information (factor PCs) from stochastic principal components originating from spectral noise (error PCs). Moreover, a post-correction method is proposed to pull the concentration-dependent factor effects out of the systematic principal components. The first PC of ethanol-water NIR mixture spectra defines the contributions of clusters of both water and ethanol molecules in the solution to resultant absorbance signals. The second PC includes partial absorptions from ethanol-water dimers and ethanol-water-ethanol trimers. The third PC is assumed to reflect concentration-dependent restructuring of mixture structure.
Datum: 08.02.2018

Screening for linearly and nonlinearly related variables in predictive cheminformatic models

For a long time, feature selection has been a hot topic in the statistical-related literature and has become increasingly frequent and important in various research fields. Feature screening methods using marginal correlation show potential problems. Another issue that hinders to select an important variable is the shading effect of a highly influential variable on the variables of lower importance. Feature selection can be even more complex in the presence of nonlinear relations. To overcome these limitations, an innovative method for selecting linearly and nonlineary correlated variables is presented. It works based on the hyphenation of nonparametric variable ranking methods with nonparametric regression methods through an iterative regression based on residuals. Here, maximal information coefficient and distance correlation are used to rank the variables. The algorithm starts with modeling the relationship between response and the top-ranking variable by using multivariate adaptive regression splines method. In the next iterations, the top-ranking variables are selected based on relationship with subsequent residuals. The validation of the method is discussed by using 2 nonlinear simulated data. The method is further validated by analysis of 2 real cheminformatic data sets including toxicity of 1571 industrial chemicals and aqueous solubility of a diverse set of 1708 organic molecules.
Datum: 08.02.2018

Seriation, the method out of a chemist's mind

Seriation is the practice of performing row and column permutations on data matrices to reveal clusters and hidden patterns within and between them. Seriation has been a known problem to tackle in the literature for over a century except chemical data where there are a fewer than a handful of researchers who have knowingly worked on this technique. The aim of this paper is to give seriation examples on chemical data and to propose the systematic use of seriation as a possible intuitive tool between data preprocessing and explanatory data analysis. Seriation itself performs the descriptive and exploratory data evaluations at a qualitative level with more limited efficiency than the specialized methods, but it may suggest to use clustering, variable selection, principal component analysis, and modeling. Different databases were used to check the versatility of seriation methods: benchmark ones on iris and wine, radioactivity data of sand mines, coin metal compositions, bioactivity of anticancer compounds, features of essential oils, and a reaction kinetic mechanism of biofuel combustion.
Datum: 06.02.2018

A novel ranking distance measure combining Cayley and Spearman footrule metrics

Defining the appropriate ranking distance measures among rankings is a classic area of study. The goal of our work is to identify a combination of methodologies, which is proven to be capable for the determination of a proper ranking system. In our study, we used 3 well-established metrics: Kendall tau, Spearman footrule, and Cayley distance and a novel metric created by the combination of Cayley and Spearman footrule metrics. The results of the newly introduced metric depend on how fast we can trade a permutation of items to the reference permutation according to the Spearman footrule. On the other hand, the distance also depends on the number of cycles and the inversions in the cycle. Two case studies—chemometric data of phytonutrients of tomato varieties and sensometric data of orange juices—were used to test the performance of the studied ranking distance metrics. The properties of the new metric were compared to the traditional metrics regarding the normality of their distributions, significant number of differences between the rating objects, and the quality of the rankings. Results were validated by leave-one-out cross-validation and significant differences by Wilcoxon matched pairs test.
Datum: 01.02.2018

Ensemble learning model based on selected diverse principal component analysis models for process monitoring

Principal component analysis (PCA) is extensively applied in industrial process monitoring. For optimal performance monitoring, different faults may require different principal components (PCs). However, at present, it only selects PCs of the highest variance to create a single PCA model, thereby leading to information loss and poor monitoring performance. For the solution of this problem, a method based on ensemble learning and Bayesian inference is presented in this paper. First, numerous models are generated according to the randomly selected PCs. Next, the model with the lowest false alarm rate is retained to ensure good model performance. A novel pruning algorithm is then employed to obtain several models comprising great difference (“great difference” means the smaller similarity of the selected PCs when building different PCA models). This method enables the identification of models that can effectively predict various faults, thereby improving the monitoring performance of the ensemble model. Bayesian inference is adopted to determine the final monitoring indicator. Finally, a numerical example was used, and the Tennessee Eastman benchmark process was applied to evaluate monitoring effectiveness and illustrate the excellent performance of ensemble learning and Bayesian inference.
Datum: 01.02.2018

Linear discriminant analysis, partial least squares discriminant analysis, and soft independent modeling of class analogy of experimental and simulated near-infrared spectra of a cultivation medium for mammalian cells

Currently, the qualification and control of medium formulations are performed based on simple methods (eg, pH and osmolality measurement of medium solutions), expensive and time-consuming cell culture tests, and the quantification of certain critical compounds by liquid chromatography. In addition to traditional medium qualification tools, relatively new spectroscopic techniques, such as fluorescence spectroscopy, nuclear magnetic resonance, Raman and near-infrared spectroscopies, and combinations of these techniques are increasingly being applied to medium powder investigation. A chemically defined medium powder for Chinese hamster ovary cell cultivation was investigated in this study to determine its response to heat treatments at different temperatures (30°C, 50°C, and 70°C). Because the low availability and high costs of medium powders limit the sample sizes for such experiments, 5 groups of simulated data sets were generated based on the experimental spectra to compare the efficiencies of 3 classification methods: linear discriminant analysis (LDA) based on principal component analysis (PCA), partial least squares discriminant analysis (PLS-DA), and soft independent modeling of class analogy (SIMCA). In case of these data sets, PCA-LDA showed better results for the classification of experimental spectra than PLS-DA and SIMCA. Moreover, the PLS-DA and SIMCA models yielded different results for different training set groups, while the PCA-LDA model yielded similar results for all training sets.
Datum: 01.02.2018

Spectral clustering in eye-movement researches

Eye tracking is a widely used technology to capture the eye movements of participants completing different tasks. Several eye-tracking parameters are measured, which later can be used to characterize the gazing pattern of the individuals. Clustering based on the path walked on by the participants may enable the researchers to create clusters based on the unconscious personality and thinking style. Common clustering methods generally are unable to handle path data; hence, new dynamic variables are needed. Spectral clustering can handle these types of data well. Spectral clustering handles clustering as a graph partitioning problem without making specific assumptions on the form of the clusters and uses eigenvectors of matrices derived from the data. This way, data are mapped to a low-dimensional space, which can be easily clustered. Different food choice tasks were presented, and each of the 149 participants had to choose 1 product of the presented 4 and later from 8 alternatives. A new measure was introduced based on all 3 consecutive points from the fixations, and the areas of the triangles formed by these 3 points were computed. The new eye-movement index captures the temporal variation and also considers the orientation of the fixation points. Spectral clustering resulted 5 balanced clusters defined by Dunn, Silhouette, and C-indices. Results were compared to the most widely applied hierarchical and centroid-based clustering (k-means) methods. Spectral clustering achieved the best results in clustering indices and cluster sizes proved to be more balanced; hence, it outperforms the commonly used applied hierarchical and k-means.
Datum: 01.02.2018

Application of SAR methods toward inhibition of bacterial peptidoglycan metabolizing enzymes

Structure activity relationship (SAR) methods are applied for a study of inhibition of peptidoglycan metabolizing enzymes, which could represent new antibacterial targets. In this study, we exploit experimental data of inhibition of Mur A and Mur B enzymes for classification of large set of chemicals. Based on inhibitory potency of compounds and their structures from the literature, we developed classification models for new, potential inhibitors of Mur A and Mur B enzymes. The best model for Mur A has the following performance measures for the validation set: 0.85, 0.75, and 0.80, for sensitivity, specificity, and normalized Matthews correlation coefficient, respectively. The same measures of the best Mur B model are 0.94, 0.75, and 0.86. Such models could represent valuable computational tools for theoretic predictions of compounds' activities against specific targets. Additionally, application of such models, like any other computational tools, significantly reduces time and costs in the early phase of drug design.
Datum: 31.01.2018

Is it possible to improve the quality of predictions from an “intelligent” use of multiple QSAR/QSPR/QSTR models?

Quantitative structure-activity/property/toxicity relationship (QSAR/QSPR/QSTR) models are effectively employed to fill data gaps by predicting a given response from known structural features or physicochemical properties of new query compounds. The performance of a model should be assessed based on the quality of predictions checked through diverse validation metrics, which confirm the reliability of the developed QSAR models along with the acceptability of their prediction quality for untested compounds. There is an ongoing effort by QSAR modelers to improve the quality of predictions by lowering the predicted residuals for query compounds. In this endeavor, consensus models integrating all validated individual models were found to be more externally predictive than individual models in many previous studies. The objective of this work has been to explore whether the quality of predictions of external compounds can be enhanced through an “intelligent” selection of multiple models. The consensus predictions used in this study are not simple average of predictions from multiple models. It has been considered in the present study that a particular QSAR model may not be equally effective for prediction of all query compounds in the list. Our approach is different from the previous ones in that none of the previously reported methods considered selection of predictive models in a query compound specific way while at the same time using all or most of the valid models for the total set of query chemicals. We have implemented our approach in a software tool that is freely available via the web http://teqip.jdvu.ac.in/QSAR_Tools/ and http://dtclab.webs.com/software-tools.
Datum: 30.01.2018

Significance of variables for discrimination: Applied to the search of organic ions in mass spectra measured on cometary particles

The instrument Cometary Secondary Ion Mass Analyzer (COSIMA) on board of the European Space Agency mission Rosetta to the comet 67P/Churyumov-Gerasimenko is a secondary ion mass spectrometer with a time-of-flight mass analyzer. It collected near the comet several thousand particles, imaged them, and analyzed the elemental and chemical compositions of their surfaces. In this study, variables have been generated from the spectral data covering the mass ranges of potential C-, H-, N-, and O-containing ions. The variable importance in binary discriminations between spectra measured on cometary particles and those measured on the target background has been estimated by the univariate t test and the multivariate methods discriminant partial least squares, random forest, and a robust method based on the log ratios of all variable pairs. The results confirm the presence of organic substances in cometary matter—probably a complex macromolecular mixture.
Datum: 25.01.2018

A frequency-localized recursive partial least squares ensemble for soft sensing

We report the use of a frequency-localized adaptive soft sensor ensemble using the wavelet coefficients of the responses from the physical sensors. The proposed method is based on building recursive, partial least squares soft sensor models on each of the wavelet coefficient matrices representing different frequency content of the signals from the physical sensors, combining the predictions from these models via static weights determined from an inverse-variance weighting approach, and recursively adapting each of the soft sensor models in the ensemble when new data are received. Wavelet-induced boundary effects are handled by using the undecimated wavelet transform with the Haar wavelet, an approach that is not subject to wavelet boundary effects that would otherwise arise on the most recent sensor data. An additional advantage of the undecimated wavelet transform is that the wavelet function is defined for a signal of arbitrary length, thus avoiding the need to either trim or pad the training signals to dyadic length, which is required with the basic discrete wavelet transform. The new method is tested against a standard recursive partial least squares soft sensor on 3 soft-sensing applications from 2 real industrial processes. For the datasets we examined, we show that results from the new method appear to be statistically superior to those from a soft sensor based only on a recursive partial least squares model with additional advantages arising from the ability to examine performance of each localized soft sensor in the ensemble.
Datum: 25.01.2018

Nonparametric algorithm for identification of outliers in environmental data

Outliers that can significantly affect data analysis are frequently present in environmental data sets. Most methods suggested for the detection of outliers impose restrictions on the distribution of analysed variables. However, in many environmental areas, the observed variable is influenced by a lot of different factors and its distribution is often difficult to find or cannot be estimated. Therefore, an approach for the identification of outliers in environmental time series based on nonparametric statistical techniques is presented. The core principle of the algorithm is to smoothen the data using nonparametric regression with variable bandwidth and subsequently analyse the residuals by nonparametric statistical methods. In the case that the distribution of the analysed variable is normal an efficient statistical method based on normality assumptions is presented as well. The proposed procedure is applied for the identification of outliers in hourly concentrations of particulate matter and verified by simulations.
Datum: 25.01.2018

Chemoinformatic design of amphiphilic molecules for methane hydrate inhibition

Cationic surfactants and other low molecular weight compounds are known to inhibit nucleation and agglomeration of methane hydrates. In particular, tetralkylammonium salts are kinetic hydrate inhibitors; ie, they reduce the rate of hydrate formation. This work relates to the in-silico determination of structural features of molecules modulating methane hydrate formation, as found experimentally, and the prediction of novel structures to be tested as candidate inhibitors. Experimental data for each molecule are the amount of absorbed methane. By inserting these numerical values into a chemoinformatic model, it was possible to find a mutual correlation between structural features and inhibition properties. A maximum amount of information is extracted from the structural features and experimental variables, and a model is generated to explain the relationship therebetween. Chemometric analysis was performed by using the software package Volsurf+ with the aim of finding a primary correlation between surfactant structures and their properties. Experimental parameters (pressure, temperature, and concentration) were further processed through an optimization procedure. A careful study of the chemometric analysis responses and the numerical descriptors of tested surfactants allowed to define the features of a good inhibitor, as far as the amount of absorbed gas is concerned. An external prediction is finally made to project external compounds, whose structures and critical micellar concentration are known, in a statistical model, to predict the inhibition properties of a particular molecule in advance of synthesis and testing. This method allowed to find novel amphiphilic molecules for testing as candidate inhibitors in flow-assurance.
Datum: 25.01.2018

Determination of optimum number of components in partial least squares regression from distributions of the root-mean-squared error obtained by Monte Carlo resampling

Monte Carlo resampling is utilized to determine the number of components in partial least squares (PLS) regression. The data are randomly and repeatedly divided into calibration and validation samples. For each repetition, the root-mean-squared error (RMSE) is determined for the validation samples for a = 1, 2, … , A PLS components to provide a distribution of RMSE values for each number of PLS components. These distributions are used to determine the median RMSE for each number of PLS components. The component (Amin) having the lowest median RMSE is located. The fraction p of the RMSE values of Amin exceeding the median RMSE for the preceding component is determined. This fraction p represents a probability measure that can be used to decide if the RMSE for the Amin PLS component is significantly lower than the RMSE for the preceding component for a preselected threshold (pupper). If so, it defines the optimum number of PLS components. If not, the process is repeated for the previous components until significance is achieved. The pupper = 0.5 implies that the median is used for selecting the optimum number of components. The RMSE is approximately normally distributed on the smallest components. This can be utilized to relate p to a fraction of a standard deviation. For instance, p = 0.308 corresponds to half a standard deviation if RMSE is normally distributed. The approach is demonstrated for calibration of metabolomics measurements and spectroscopic mixture data.
Datum: 25.01.2018

Variable selection and chemometric models for discriminating symptomatic gout based on a metabolic target analysis

In clinical practice, uric acid is frequently used as a diagnostic criterion in gout. However, gout is commonly confused with other diseases, including rheumatoid arthritis, soft tissue joint injury, and hyperuricosuric calcium oxalate urolithiasis. Two new strategies—graphical index of separation and subwindow permutation analysis—were applied to understand the metabolic changes induced by gout. Metabolic target analysis was performed using high performance liquid chromatography with a diode array detector. Compared with the nongout samples, the concentrations of uric acid, uracil, inosine, adenosine, and tryptophan are different in gout samples, and these metabolites could be used as important diagnostic markers. However, the uric acid, uracil, phenylalanine, tryptophan, and adenine concentrations differed between acute and chronic gout. We confirmed the metabolic disorder of uracil during the basic development of gout. In the gout and nongout groups, the recognition rate of the model reached 0.98, whereas the value of recognition ability was only 0.79 when uric acid was used as a single variable. In the acute and chronic class of gout, the recognition rate of the model was 0.90 and that of uric acid was only 0.62. Variable selection combined with chemometric models can be used as a supplementary method for the diagnosis and prognosis of gout in clinical practice.
Datum: 25.01.2018

Issue Information

No abstract is available for this article.
Datum: 19.01.2018

Mapping of Activity through Dichotomic Scores (MADS): A new chemoinformatic approach to detect activity-rich structural regions

A new chemoinformatic approach, called Mapping of Activity through Dichotomic Scores, is introduced. Its goal is the supervised projection of molecules, represented with strings of binary digits expressing the presence or absence of selected structural features, onto a novel 2-dimensional space, which highlights regions of active (inactive) molecules of interest. At the same time, variables are projected onto a second 2-dimensional space, which highlights those structural features that are more related to the molecular activity of interest. Unlike the classical weighting schemes used in substructural analysis, which consider the substructures independently of each other, the Mapping of Activity through Dichotomic Scores approach considers the interactions between pairs of substructures, that is, their frequencies of cooccurrence in the molecules. In this work, the theory is presented and elucidated, with an example dataset and in comparison with a benchmark fragment-based scoring scheme.
Datum: 19.01.2018

Fault detection and diagnosis strategy based on a weighted and combined index in the residual subspace associated with PCA

Process monitoring and diagnosis are crucial for efficient and optimal operation of a chemical plant. Most multivariate statistical process monitoring strategies, such as principal component analysis, kernel principal component analysis, and dynamic principal component analysis, take advantage of the squared prediction error statistic to monitor the state of samples in a residual subspace (RS). Squared prediction error is defined as the square of the 2-norm of a residual vector, and it is calculated as the squared norm of the residual components. When the distributions of variables in an RS are quite different from one another, the detection ability of squared prediction error visibly declines. To accurately monitor the faults occurring in the RS, a new fault detection index based on a weighted combination of Hotelling's T2 and squared Euclidean distance is developed in this paper. Principal component analysis is first introduced for dividing the original input space into a principal component subspace and an RS. Next, a weighted and combined index is implemented to monitor the variability of samples in the RS. In addition, a corresponding fault diagnosis strategy based on the contribution plot is also developed in this paper. The proposed method is tested on a numerical example and the Tennessee Eastman process. Simulation results show that the new index is effective in both fault detection and diagnosis.
Datum: 12.01.2018

Modeling of Hansen's solubility parameters of aripiprazole, ziprasidone, and their impurities: A nonparametric comparison of models for prediction of drug absorption sites

Aripiprazole and ziprasidone are atypical antipsychotic drugs with the effect on positive and negative symptoms of schizophrenia, mania, and mixed states of bipolar disorder. Hansen's solubility parameters, δd, δp, and δh, which account for dispersive, polarizable, and hydrogen bonding contributions to the overall cohesive energy of a compound, are often used to assess pharmacokinetic properties of drugs. However, no data exist of solubility parameters for the drugs of interest in this study. Therefore, in the present study, partial least square regression (PLS), artificial neural networks (ANNs), regression trees (RT), boosted trees (BT), and random forests (RF) were applied to estimate Hansen's solubility parameters of ziprasidone, aripiprazole, and their impurities/metabolic derivatives, targeting their biopharmaceutical classes and absorption routes. A training set of 47 structurally diverse and pharmacologically active compounds and 290 molecular descriptors and pharmaceutically important properties were used to build the prediction models. The modeling approaches were compared by the sum of ranking differences, using the consensus values as a reference for the unknowns and the experimentally determined values as a gold standard for the calibration set. In both instances, the PLS models, together with ANNs, demonstrated better performance than RT, BT and especially RF. Based on the best scored models, we were able to pinpoint the most probable absorption sites for each drug and the corresponding metabolite, i.e., the upper parts of the gastrointestinal tract, small intestine, or absorption along entire length of gastrointestinal tract.
Datum: 12.01.2018

Chemometrics in laser-induced breakdown spectroscopy

Laser-induced breakdown spectroscopy (LIBS) is a new type of elemental analytical technology with the advantages of real-time, online, and noncontact as well as enabling the simultaneous analysis of multiple elements. It has become a frontier analytical technique in spectral analysis. However, the issue of how to improve the accuracy of qualitative and quantitative analyses by extracting useful information from a large amount of complex LIBS data remains the main problem for the LIBS technique. Chemometrics is a chemical subdiscipline of multi-interdisciplinary methods; it offers advantages in data processing, signal analysis, and pattern recognition. It can solve some complicated problems that are difficult for traditional chemical methods. In this paper, we reviewed the research progress of chemometrics methods in LIBS for spectral data preprocessing as well as for qualitative and quantitative analyses in the most recent 5 years (2012-2016).
Datum: 12.01.2018

Confidence ellipsoids for ASCA models based on multivariate regression theory

In analysis of variance simultaneous component analysis, permutation testing is the standard way of assessing uncertainty of effect level estimates. This article introduces an analytical solution to the assessment of uncertainty through classical multivariate regression theory. We visualize the uncertainty as ellipsoids, contrasting these to data ellipsoids. This is further extended to multiple testing of effect level differences. Confirmatory and intuitive results are observed when applying the theory to previously published data and simulations.
Datum: 12.01.2018

Chemometric methods applied in the image plane to correct striping noise in hyperspectral chemical images of biomaterials

Array detectors improve data collection speed in hyperspectral chemical imaging, yet are prone to striping noise patterns in the image plane, which is difficult to remove. This type of noise affects spectral features and disturbs visual impression of the scene. We found that this type of noise depends on the material composition and setting parameters, ie, pixel size, and it also varies in accordance with the signal intensity of the observed wavelength. To address this, we proposed a new correction method on the basis of the application of chemometric techniques in the image plane of each wavelength. To verify the effectiveness of this method, infrared transmission images of the 2″ × 2″ positive, 1951 USAF Hi-Resolution Target and biomaterial samples were obtained with a 16-element (8 × 2) pixel array detector. Point detector images of some samples were also acquired and used as reference images. The proposed correction method produced substantial improvements in the visual impression of intensity images. Principal components analysis was performed to inspect spectral changes after preprocessing, and the results suggested that the major spectral features were not altered while the stripes on intensity images were removed. Spectral profiles and principal components analysis loadings inspection confirmed the smoothing ability of this correction method. As traditional preprocessing techniques, standard normal variate and derivative transformation were not able to remove line artifacts, especially on the biomaterial images. Overall, the proposed method was effective for removing striping noise patterns from infrared images with a minimal alteration of the valuable hyperspectral image information.
Datum: 28.12.2017

Molecular modeling and chemometric analysis of osteoporosis calmodulin-TRPV1 binding affinity by statistically characterizing complex interactions

The intermolecular interaction among the calcium sensor, calmodulin (CaM), and the C-terminal domain of its cognate partner, transient receptor potential vanilloid 1 (TRPV1), plays a potential role in bone absorption and loss, which has been recognized as a new druggable target for osteoporosis therapy. Here, a synthetic strategy that integrates molecular modeling and chemometric analysis is used to statistically model and quantitatively predict the binding behavior of TRPV1 to CaM. The atomic-level complex structures of CaM protein with the congeneric sequences of TRPV1 C-terminus are modeled by virtual mutagenesis and structural refinement. Inter-residue nonbonded interactions in the computationally modeled complex structures are analyzed, characterized, and correlated with experimentally measured affinity of CaM binding to a series of TRPV1 C-terminal derivatives via both linear and nonlinear regression approaches. The built statistical predictors are then used to systematically investigate the independent residue-pair interactions across CaM-TRPV1 complex interface. Consequently, few TRPV1 C-terminal residues are identified as potential hot spots that are primarily responsible for the CaM-TRPV1 binding. Visual examination of CaM-TRPV1 complex architecture reveals that these hot spot residues are evenly distributed through the core helical region of TRPV1 C-terminus and involved in short-range nonbonded interactions such as hydrogen bonds and salt bridges, which confer specificity to the complex recognition and association.
Datum: 28.12.2017

Chromatographic and in silico assessment of logP measures for new spirohydantoin derivatives with anticancer activity

Lipophilicity has, for a long time, been recognized as a meaningful parameter in structure-activity relationships. It is also the single most informative physicochemical property that reveals a wealth of information on intermolecular forces, intramolecular interactions, and molecular structure in the broadest sense. In this paper, a total of 14 chromatographic measures of lipophilicity (thin-layer chromatography and high-performance liquid chromatography) and 11 computationally estimated logP-s for 21 newly synthesized 3-(4-substituted benzyl)-cycloalkylspiro-5-hidantoin derivatives have been investigated. Similarities among the investigated compounds as well as lipophilicity measures were examined by the multivariate exploratory analysis, principal component analysis, hierarchical cluster analysis, and sum of ranking differences. These chemometric approaches reveal the arrangement of investigated compounds into clusters according to lipophilicity. Chemometric consideration of lipophilicity renders principal component scores as entirely unsuitable lipophilicity measures. Furthermore, the logP values estimated from calibration graph by using a set of standard reference compounds were equivalent to the corresponding chromatographic descriptors of hydantoins extrapolated from linear relationship between retention parameters and mobile phase composition. Comparison of the 2 chromatographic techniques places high-performance liquid chromatography lipophilicity indices slightly ahead of thin-layer chromatography.
Datum: 28.12.2017

Contribution of image processing for analyzing the cellular structure of cork

The alveolar structure of cork confers to this natural material specific physical properties such as low permeability to liquids and gases, advanced thermal and acoustic insulation, and high elasticity. In this paper, a morphological analysis of natural cork cells is presented including statistical distributions of structural quantities. These results were obtained from the study of scanning electron microscopy images of natural cork stoppers. After automation of the image processing analysis, the area and perimeter distributions were measured in different cork sections: axial, tangential, and radial. Perpendicular to the radial direction we also focused on growth rings, which are characterized by smaller cell sizes. This systematic image analysis offers new possibilities to investigate the material at the cell scale and provides useful statistical information about the cork structure.
Datum: 27.12.2017

Hyperspectral image analysis. When space meets Chemistry

Hyperspectral images provide spatial and structural information on samples. This article aims at providing an overview on hyperspectral image analysis research trends, mainly focusing on specific aspects related to this kind of analytical measurement, such as the global/local and the spectral/spatial dualities, the hurdles associated with image fusion strategies, and the design of tools devoted to enhance the spatial definition provided by the instrumental measurement. The complexity of the measurement and the wealth of possibilities to use image properties and information ensure a very lively field of research for the coming years.
Datum: 27.12.2017

Application of near-infrared spectroscopy combined with chemometrics for online monitoring of Moluodan extraction

In this study, the application of near-infrared (NIR) spectroscopy for online monitoring of the Moluodan extraction process was investigated. Paeoniflorin, the main active component of Moluodan, was chosen as the quality index. Samples were partitioned into calibration and validation sets by random select and set partitioning based on joint X-Y distances algorithm (SPXY), respectively. Wavelengths for modeling were selected by manual method and competitive adaptive reweighted sampling (Cars) algorithm. Particle swarm optimization–based least square support vector machines (PSO-LS-SVM) and partial least squares models were both established for quantitative analysis to determine the content of paeoniflorin online. At last, 8 models were obtained according to the combination of these algorithms and the SPXY-Cars-PSO-LS-SVM model had the best quantitative analysis performance compared with the other 7 models. Specifically, in the developed SPXY-Cars-PSO-LS-SVM model, the determination coefficients of the calibration Rc2 and validation Rv2 sets were 0.99 and 0.95, respectively, and the root mean square errors of the calibration and validation sets were 0.012 and 0.024 mg/mL, respectively, and the relative standard errors of the calibration and validation sets were 2.84% and 6.34%, respectively. These results suggested that the appropriate sample partition, wavelength selection, and regression analysis methods in this study, namely, the SPXY-Cars-PSO-LS-SVM algorithm combined with NIR spectroscopy, could be an effective and real-time approach for online monitoring of the extraction process of Moluodan.
Datum: 01.12.2017

An improved plant-wide fault detection scheme based on PCA and adaptive threshold for reliable process monitoring: Application on the new revised model of Tennessee Eastman process

An improved process monitoring scheme is presented in this paper, it is based on the integration of multivariate and univariate statistical analysis methods. Instead of conventional fixed control limits, adaptive thresholds are developed for common fault detection indices used with principal component analysis, including the Hotelling T2 statistic and the sum of squared prediction error known as the Q statistic. The thresholds are updated based on a modified exponentially weighted moving average chart with a limited window length. The primary goal of this strategy is to enhance the performance of principal component analysis–based process monitoring method and overcome its shortcomings, by increasing fault detection rate to improve monitoring sensitivity and eliminating false alarms to ensure higher robustness and reliability. Fault detection in the revised model of Tennessee Eastman process benchmark is also investigated. The developed monitoring scheme is tested and compared with conventional fixed threshold technique, and its performance is evaluated across various types of process faults. The obtained results demonstrate the promising capabilities of the developed scheme.
Datum: 01.12.2017

Group-wise partial least square regression

This paper introduces the group-wise partial least squares (GPLS) regression. GPLS is a new sparse PLS technique where the sparsity structure is defined in terms of groups of correlated variables, similarly to what is done in the related group-wise principal component analysis. These groups are found in correlation maps derived from the data to be analyzed. GPLS is especially useful for exploratory data analysis, because suitable values for its metaparameters can be inferred upon visualization of the correlation maps. Following this approach, we show GPLS solves an inherent problem of sparse PLS: its tendency to confound the data structure because of setting its metaparameters using standard approaches for optimizing prediction, like cross-validation. Results are shown for both simulated and experimental data.
Datum: 01.12.2017

NIR hyperspectral imaging spectroscopy and chemometrics for the discrimination of roots and crop residues extracted from soil samples

Roots play a major role in plant development. Their study in field conditions is important to identify suitable soil management practices for sustainable crop productions. Soil coring, which is a common method in root production measurement, is limited in sampling frequency due to the hand-sorting step. This step, needed to sort roots from other elements extracted from soil cores like crop residues, is time consuming, tedious, and vulnerable to operator ability and subjectivity. To get rid of the cumbersome hand-sorting step, avoid confusion between these elements, and reduce the time needed to quantify roots, a new procedure, based on near-infrared hyperspectral imaging spectroscopy and chemometrics, has been proposed. It was tested to discriminate roots of winter wheat (Triticum aestivum L.) from crop residues and soil particles. Two algorithms (support vector machine and partial least squares discriminant analysis) have been compared for discrimination analysis. Models constructed with both algorithms allowed the discrimination of roots from other elements, but the best results were reached with models based on support vector machine. The ways to validate models, with selected spectra or with hyperspectral images, provided different kinds of information but were complementary. This new procedure of root discrimination is a first step before root quantification in soil samples with near-infrared hyperspectral imaging. The results indicate that the methodology could be an interesting tool to improve the understanding of the effect of tillage or fertilization, for example, on root system development.
Datum: 29.11.2017

Current challenges in second-order calibration of hyphenated chromatographic data for analysis of highly complex samples

Coupling of multiway and multiset modeling methods with hyphenated chromatographic data for second-order calibration purposes allows the quantification of multiple target analytes in highly complex samples, which otherwise would be impossible or at least a very hard task using univariate calibration scenario. In this regard, there are some chromatographic challenges that complicate attaining the best quantification efficiency through the highlighted advantages such as increased sensitivity and selectivity and the well-known second-order advantage. The present paper tries to overview these issues and to address the most usable strategies for their handling with a special focus on the relevant recent studies in quantitation field.
Datum: 29.11.2017

Combination of heuristic optimal partner bands for variable selection in near-infrared spectral analysis

Variable selection plays a critical role in the analysis of near-infrared (NIR) spectra. A method for variable selection based on the principle of the successive projection algorithm (SPA) and optimal partner wavelength combination (OPWC) was proposed for NIR spectral analysis. The method determines a number of knot variables with sufficient independence by SPA, and candidate variable bands with a definite width are defined. The cooperative effect of the bands is then evaluated with the partial least squares regression model by using the method of OPWC. The performance of the proposed method was compared with those of SPA, OPWC, randomization test, competitive adaptive reweighted sampling, and Monte Carlo uninformative variable elimination by using NIR datasets for pharmaceutical tablets, corn, and soil. The results show that the proposed method can select informative variable bands with a cooperative effect and improves the model for quantitative analysis.
Datum: 29.11.2017

Integrating spatial, morphological, and textural information for improved cell type differentiation using Raman microscopy

Raman microscopy is a well-established tool for distinguishing different cell types in cell biological or cytopathological applications, since it can provide maps that show the specific distribution of biochemical components in the cell, with high lateral and spatial resolution. Currently, established data analysis approaches for differentiating cells of different types mostly rely on conventional chemometrics approaches, which tend to not systematically utilise the advantages provided by Raman microscopic data sets. To address this, we propose 2 approaches that explicitly exploit the large number of spectra as well as the morphological and textural information that are available in Raman microscopic data sets. Spatial bagging as our first approach is based on a statistical analysis of majority vote over classification results obtained from individual pixel spectra. Based on the Condorcet's Jury Theorem, this approach raises the accuracy of a relatively weak classifier for individual spectra to nearly perfect accuracy at the level of characterising whole cells. Our second approach extracts morphological and textural (morpho-textural) features from Raman microscopic images to differentiate cell types. While using few wavenumbers of the Raman spectrum only, our results indicate on a quantitative basis that Raman microscopic images carry more morphological and textural information than haematoxylin and eosin (H&E) stained images as the current gold standard in cytopathology. Our 2 approaches promise improved protocols for the fast acquisition of Raman imaging data, for instance, for the morphological analysis of coherent anti-Stokes Raman spectroscopy microscopic imaging data or for improving the accuracy of fibre optical probe systems by resampling spectra and utilising spatial bagging.
Datum: 28.11.2017

Coupling 2D-wavelet decomposition and multivariate image analysis (2D WT-MIA)

The use of 2D discrete wavelet transform in the feature enhancement phase of multivariate image analysis is discussed and implemented in a comparative way with respect to previous publications. In the proposed approach, all the resulting subimages obtained by discrete wavelet transform decomposition are unfolded pixel-wise and midlevel data fused to a feature matrix that is used for the feature analysis phase. Congruent subimages can be obtained either by reconstruction of each decomposition block to the original pixel dimensions or by using the stationary wavelet transform decomposition scheme. The main advantage is that all possible relationships among blocks, decomposition levels, and channels are assessed in a single multivariate analysis step (feature analysis). This is particularly useful in a monitoring context where the aim is to build multivariate control charts based on images. Moreover, the approach can be versatile for contexts where several images are analyzed at a time as well as in the multispectral image analysis. Both a set of simple artificial images and a set of real images, representative of the on-line quality monitoring context, will be used to highlight the details of the methodology and show how the wavelet transform allows extracting features that are informative of how strong the texture of the image is and in which direction it varies.
Datum: 27.11.2017

Comparison of latent variable-based and artificial intelligence methods for impurity detection in PET recycling from NIR hyperspectral images

In polyethylene terephthalate's (PET)'s recycling processes, separation from polyvinyl chloride (PVC) is of prior relevance due to its toxicity, which degrades the final quality of recycled PET. Moreover, the potential presence of some polymers in mixed plastics (such as PVC in PET) is a key aspect for the use of recycled plastic in products such as medical equipment, toys, or food packaging. Many works have dealt with plastic classification by hyperspectral imaging, although only some of them have been directly focused on PET sorting and very few on its separation from PVC. These works use different classification models and preprocessing techniques and show their performance for the problem at hand. However, still, there is a lack of methodology to address the goal of comparing and finding the best model and preprocessing technique. Thus, this paper presents a design of experiments-based methodology for comparing and selecting, for the problem at hand, the best preprocessing technique, and the best latent variable-based and/or artificial intelligence classification method, when using NIR hyperspectral images.
Datum: 27.11.2017

One-dimensional convolutional neural networks for spectroscopic signal regression

This paper proposes a novel approach for driving chemometric analyses from spectroscopic data and based on a convolutional neural network (CNN) architecture. For such purpose, the well-known 2-D CNN is adapted to the monodimensional nature of spectroscopic data. In particular, filtering and pooling operations as well as equations for training are revisited. We also propose an alternative to train the resulting 1D-CNN by means of particle swarm optimization. The resulting trained CNN architecture is successively exploited to extract features from a given 1D spectral signature to feed any regression method. In this work, we resorted to 2 advanced and effective methods, which are support vector machine regression and Gaussian process regression. Experimental results conducted on 3 real spectroscopic datasets show the interesting capabilities of the proposed 1D-CNN methods.
Datum: 27.11.2017

Machine learning-based genetic evolution of antitumor proteins containing unnatural amino acids by integrating chemometric modeling and cytotoxicity analysis

Antitumor proteins (ATPs) are small oligoproteins or peptides that have been recognized as new and promising therapeutics against a variety of human tumors and cancers. In order to extend the structural diversity space of ATPs, the unnatural amino acids were incorporated into naturally occurring ATPs by using a chemometrics-based genetic evolution strategy. Based on hundreds of ATPs derived from animals, plant and microbes statistical regression models were developed, optimized, and validated with a systematic combination of 5 widely used machine learning methods and 3 sophisticated unnatural amino acid descriptors. The best regression predictor was employed to guide genetic evolution of a large oligoprotein population. In the evolution procedure, a number of unnatural amino acids with desired physicochemical properties were introduced, resulting in an evolution-improved population, from which few oligoprotein candidates with top scores, containing 1 to 3 unnatural amino acids, and having diverse structures were successfully prepared, and their antitumor potency against 2 cancer cell lines was analyzed with biological assays. It was found that the high-activity ATPs are preferentially structured in partial ι-helix or β-sheet with an alternative sequence pattern of polar, charged, and hydrophobic amino acids, while the intrinsically disordering oligoproteins usually have low or no antitumor activity against tested cancer cell lines.
Datum: 27.11.2017

Hydration of hydrogels studied by near-infrared hyperspectral imaging

Hydrogels are an important class of biomaterials that can absorb large quantities of water. In this study, changes in hydration of natural hydrogels (agar, chitosan, gelatin, starch, and blends of each with chitosan) during storage and rehydration were studied by using near-infrared hyperspectral imaging (NIR-HSI). Moisture content was calculated based on changes in sample weight during hydration. The NIR-HSI data were acquired by using a push-broom system operating in diffuse reflectance in the wavelength range 943 to 1650 nm. A novel synthesis method was developed to enable common preparation of each hydrogel. Mean spectra obtained from the hyperspectral images were analyzed, and predictive models for moisture content were developed by using partial least squares regression. Models were compared in predictive performance by using an independent validation set of data. The optimal model in predictive performance was a 1 latent variable partial least squares regression model developed on second derivative and mean centered pseudo-absorbance data in the wavelength range 943 to 1272 nm. This model was applied to pixel spectra from samples in the validation set to inspect spatial variations during dehydration and rehydration. Challenges associated with NIR-HSI of hydrogels with a large variation in moisture content are discussed.
Datum: 27.11.2017

A multiobjective approach in constructing a predictive model for Fischer-Tropsch synthesis

Fischer-Tropsch synthesis (FTS) is an important chemical process that produces a wide range of hydrocarbons. The exact mechanism of FTS is not yet fully understood, so prediction of the FTS products distribution is a not a trivial task. So far, artificial neural network (ANN) has been successfully applied for modeling varieties of chemical processes whenever sufficient and well-distributed training patterns are available. However, for most chemical processes such as FTS, acquiring such amount of data is very time-consuming and expensive. In such cases, neural network ensemble (NNE) has shown a significant generalization ability. An NNE is a set of diverse and accurate ANNs trained for the same task, and its output is a combination of outputs of these ANNs. This paper proposes a new NNE approach called NNE-NSGA-II that tries to prune this set by a modified nondominated sorting genetic algorithm to achieve an optimum subset according to 2 conflicting objectives, which are minimizing root-mean-square error in training and unseen data sets. Finally, a comparative study is performed on a single best ANN, a regular NNE, NNE-NSGA, and 3 popular ensemble of decision trees called random forest, stochastic gradient boosting, and AdaBoost.R2. The results show that in training data set, stochastic gradient boosting and AdaBoost.R2 have better fitted the samples; however, for the predicted FTS products in unseen data set, NNEs methods specially NNE-NSGA-II have considerably improved the generalization ability in comparison with the other competing approaches.
Datum: 27.11.2017

Closure constraint in multivariate curve resolution

Multivariate curve resolution techniques try to estimate physically and/or chemically meaningful profiles underlying a set of chemical or related measurements. However, the estimation of profiles is not generally unique and it is often complicated by intensity and rotational ambiguities. Constraints as further information of chemical entities can be imposed to reduce the extent of ambiguities. Not only a long list of constraints has been introduced but also some of them can be applied in different ways. Either investigating constraint effects on the extent of rotational ambiguity or how they can be applied during curve resolution can shed light on curve resolution studies. The motivation behind this contribution is to pave the way to a clarification about the closure constraint. Considering simulated equilibrium and kinetic spectrophotometric data sets, different approaches to closure implementation were applied to demonstrate the geometrical explanation of closure constraint and its effect on multivariate curve resolution-alternating least squares results. Besides, the closure constraint is compared with normalization and it is proved that the closure constraint is a Borgen norm and has the same effect as other Borgen norms in multivariate curve resolution. Finally, to further examine the closure constraint, a real data set was investigated.
Datum: 23.11.2017

Diagnostics of sintering processes on the basis of PCA and two-level neural network model

The application of chemometrics methods to continuous monitoring and diagnostics of sintering process faults for improving iron-ore sinter quality is considered in the article. The sintering process is among complex multivariate processes. A number of agglomeration process faults have often similar symptoms, resulting in late fault detection by an operator and as a consequence, wrong process control decisions. To support the efficient operative decision making, it is proposed to use the process fault monitoring and diagnostics system. The proposed system uses a two-level neural network (NN) diagnostic model. The high-level neural network is used to localize the process faults whereas their reasons are determined by the low-level neural networks. To reduce essentially the time of HL-NN training and retraining, the task dimension is preliminarily reduced with the principal component analysis method so that the scores obtained from initial data are fed into high-level neural network inputs. The use of principal component analysis allowed detection of sintering process faults with T2 and Q statistics. Only upon detecting the fault, a NN diagnostic model starts working to determine the fault reason. The system algorithm provides for special measures to prevent the NN from possible “loss” of the identified fault due to operator's inactivity. To increase the diagnosis depth for controlling fault symptoms that are evident on the sinter cake surface, optical digital cameras are installed and images from them are processed with proposed algorithms on the basis of fuzzy clusterization to take into account uncertainties in the initial information.
Datum: 20.11.2017

Multimodal image analysis in tissue diagnostics for skin melanoma

Early diagnosis is a corner stone for a successful treatment of most diseases including melanoma, which cannot be achieved by traditional histopathological inspection. In this respect, multimodal imaging, the combination of TPEF and SHG, features a high diagnostic potential as an alternative approach. Multimodal imaging generates molecular contrast, but to use this technique in clinical practice, the optical signals must be translated into diagnostic relevant information. This translation requires automatic image analysis techniques. Within this contribution, we established an analysis pipeline for multimodal images to achieve melanoma diagnostics of skin tissue. The first step of the image analysis was the pre-treatment, where the mosaicking artifacts were corrected and a standardization was performed. Afterwards, the local histogram-based first-order texture features and the local gray-level co-occurrence matrix (GLCM) texture features were extracted in multiple scales. Thereafter, we constructed a local hierarchical statistical model to distinguish melanoma, normal epithelium, and other tissue types. The results demonstrated the capability of multimodal imaging combined with image analysis to differentiate different tissue types. Furthermore, we compared the histogram and the GLCM-based texture feature sets according to the Fisher's discriminant ratio (FDR) and the prediction of the classification, which demonstrated that the histogram-based texture features are superior to the GLCM features for the given task. Finally, we performed a global classification to achieve a patient diagnostics with the clinical diagnosis as ground truth. The agreement of the prediction and the clinical results demonstrated the great potential of multimodal imaging for melanoma diagnostics.
Datum: 16.11.2017

Baseline and interferent correction by the Tikhonov regularization framework for linear least squares modeling

Spectroscopic data are usually perturbed by noise from various sources that should be removed prior to model calibration. After conducting a preprocessing step to eliminate unwanted multiplicative effects (effects that scale the pure signal in a multiplicative manner), we discuss how to correct a model for unwanted additive effects in the spectra. Our approach is described within the Tikhonov regularization (TR) framework for linear regression model building, and our focus is on ignoring the influence of noninformative polynomial trends. This is obtained by including an additional criterion in the TR problem penalizing the resulting regression coefficients away from a selected set of possibly disturbing directions in the sample space. The presented method builds on the extended multiplicative signal correction, and we compare the two approaches on several real data sets showing that the suggested TR-based method may improve the predictive power of the resulting model. We discuss the possibilities of imposing smoothness in the calculation of regression coefficients as well as imposing selection of wavelength regions within the TR framework. To implement TR efficiently in the model building, we use an algorithm that is heavily based on the singular value decomposition. Because of some favorable properties of the singular value decomposition, it is possible to explore the models (including their generalized cross-validation error estimates) associated with a large number of regularization parameter values at low computational cost.
Datum: 14.11.2017

The hybrid of semisupervised manifold learning and spectrum kernel for classification

Manifold learning classification, as an advanced semisupervised learning algorithm in recent years, has gained great popularity in a variety of fields. Moreover, kernel methods are a group of algorithms for pattern analysis, the task of which is to find and study general types of relations in datasets. Thus, under the framework of kernel methods, manifold learning classifier has been introduced and explored to directly detect the intrinsic similarity by local and global information hidden in datasets. Two validation approaches were used to evaluate the performance of our models. Experiments indicate that the proposed model can be considered as an effective and alternative modeling algorithm, and it could be further applied to the areas of biochemical science, environmental analysis, clinical, etc.
Datum: 10.11.2017

Application of image moments in MIA-QSAR

Owing to the presence of some significant chemical information, the conventional images of molecular structures have been used in the studies of quantitative structure-activity relationship by multivariate image analysis (MIA-QSAR). In this contribution, we suggest that the Tchebichef moments (TMs) calculated directly from the grayscale images of molecular structures are used as molecular descriptors to build linear QSAR models by stepwise regression. The proposed approach was applied to QSAR research on a series of HIV-1 non-nucleoside reverse transcriptase inhibitors, and satisfactory results were obtained. Compared with several published methods, the results indicate that the TM method possesses higher accuracy and reliability. The TMs effectively decompose the image information of molecular structures at different levels without any pretreatment owing its very favorable multiresolution, holographic, and inherent invariance properties. Our study successfully extends the application of image moments to MIA-QSAR research.
Datum: 10.11.2017

Impact of time and temperature of storage on the spoilage of swordfish and the evolution of biogenic amines through a multiway model

A new multiway/multivariate approach is proposed to study and model the spoilage of swordfish with time and temperature of storage through the profiles of putrescine, spermidine, histamine, tyramine, tryptamine, cadaverine, spermine, and 2-phenylethylamine. The evolution of these biogenic amines in food is a complex process that cannot be characterized by a single parameter but by a modification of the amine profiles. An experimental strategy is designed to determine these profiles in such a way that data are structurally 3-way. Modeling the joint evolution of the biogenic amines with a PARAFAC model which explains 97.8% of variability (CORCONDIA index equals 100%) leads to estimate the storage time, storage temperature, and biogenic amines profiles. A multiple regression (determination coefficient of 0.98) based on the loadings of the 2 factors of the time profile of the PARAFAC model enables the estimation of the storage time with an error of 0.5 days.
Datum: 10.11.2017

Structure-based statistical modeling and analysis of peptide affinity and cross-reactivity to human senile osteoporosis OSF SH3 domain

Human osteoclast-stimulating factor (OSF) induces osteoclast formation and bone resorption in senile osteoporosis by recruiting multiple signaling complexes with cognate interacting partners through its N-terminal Src homology 3 (SH3) peptide-recognition domain. The domain can recognize and bind to the polyproline regions of its partner proteins, rendering a broad ligand specificity and cross-reactivity. Here, the structural basis and physicochemical property of peptide affinity and cross-reactivity to OSF SH3 domain were investigated systematically by using an integration of statistical analysis and molecular modeling. A structure-based quantitative structure-activity relationship method called cross-nonbonded interaction characterization and statistical regression was used to characterize the intermolecular interactions involved in computationally modeled domain-peptide complex structures and then to correlate the interactions with affinity for a panel of collected SH3-binding peptide samples. Both the structural stability and generalization ability of obtained quantitative structure-activity relationship regression models were examined rigorously via internal cross-validation and external test, confirming that the models can properly describe even single-residue mutations at domain-peptide complex interface and give a reasonable extrapolation for the mutation effect on peptide affinity. Subsequently, the best model was used to investigate the promiscuity and cross-reactivity of OSF SH3 domain binding to its various peptide ligands. It is found that few key residues in peptide ligands are primarily responsible for the domain affinity and selectivity, while most other residues only play a minor role in domain-peptide binding affinity and stability. The peptide residues can be classified into 3 groups in terms of their contribution to ligand selectivity: key, assistant, and marginal residues. Considering that the key residues are very few so that many domain interacting partners share a similar binding profile, additional factors such as in vivo environments and biological contexts would also contribute to the specificity and cross-reactivity of OSF SH3 domain.
Datum: 09.11.2017

Quantitative structure-property relationship modeling of small organic molecules for solar cells applications

Despite the need of a reliable technology for solar energy harvesting, research on new materials for third generation photovoltaics is slowed down by the diffuse use of trial and error rather than rational material design approaches. The proposed study investigates the use of alternative strategies to material discovery inspired by drug design and molecular modeling. In particular, training set and test set (for validation purposes) comprising well-known small molecule-bulk heterojunction organic photovoltaics were built. Molecules were characterized by semiempirical calculated and 3D molecular interaction fields–based descriptors. Then partial least squares algorithm was applied to rationalize structure-photovoltaic activity relationships, and coefficients were investigated to clarify contributions played by the different molecular properties to the final performance. In addition, a photovoltaic desirability function (PhotD) was also proposed as alternative and versatile novel tool for ranking potential candidates. The partial least squares model and PhotD function were both internally and externally validated demonstrating their ability in estimating new candidates performances. The proposed approach demonstrates that, in the context of computational materials science, chemometrics and molecular modeling tools could effectively boost the discovery of novel promising candidates for photovoltaic application.
Datum: 09.11.2017

Automated data mining of secondary ion mass spectrometry spectra

Time of flight secondary ion mass spectrometry (ToF-SIMS) allows the reliable analytical determination of organic and polymeric materials. Since a typical raw data may contain thousands of peaks, the amount of information to deal with is accordingly large, so that data reduction techniques become indispensable for extracting the most significant information from the given dataset. Here, the use of the wavelet-principal component analysis–based signal processing of giant raw data acquired during ToF-SIMS experiments is presented. The proposed procedure provides a straightforwardly “manageable” dataset without any binning procedure neither detailed integration. By studying the principal component analysis results, detailed and reliable information about the chemical composition of polymeric samples have been gathered.
Datum: 09.11.2017

Blessing of randomness against the curse of dimensionality

Modern hyperspectral images, especially acquired in remote sensing and from on-field measurements, can easily contain from hundreds of thousands to several millions of pixels. This often leads to a quite long computational time when, eg, the images are decomposed by Principal Component Analysis (PCA) or similar algorithms. In this paper, we are going to show how randomization can tackle this problem. The main idea is described in detail by Halko et al in 2011 and can be used for speeding up most of the low-rank matrix decomposition methods. The paper explains this approach using visual interpretation of its main steps and shows how the use of randomness influences the speed and accuracy of PCA decomposition of hyperspectral images.
Datum: 09.11.2017

Hybrid central composite design for simultaneous optimization of removal of methylene blue and alizarin red S from aqueous solutions using Vitis tree leaves

Vitis tree leaves powder was used for efficient removal of dyes (eg, alizarin red and methylene blue) from water samples in binary batch systems. The influence of various parameters such as initial pH, initial dye concentration, and sorbent mass on the biosorption process was investigated. Statistical experimental design was utilized to optimize this biosorption process. A regression model was derived using a response surface methodology through performing the 416B model of hybrid central composite design. Model adequacy was checked by means of tests such as analysis of variance, a lack of fit test, and residual distribution consideration. The proposed quadratic model resulted from the hybrid design approach fitted very well to the experimental data. The optimal conditions for dye biosorption were as follows: pH = 3.0, sorbent mass = 0.05 g, initial alizarin red concentration (CAR) = 999.6 mg L−1 and initial methylene blue concentration (CMB) = 878.5 mg L−1. Evaluation of biosorption data with Langmuir and Freundlich isotherms shows that the Langmuir model indicated the best fit to the equilibrium data with maximum adsorption capacity of 66.4 and 53.5 mg g−1 in single system and 54.6 and 43.9 mg g−1 in binary system for AR and MB, respectively. Moreover, kinetics of the biosorption process was also investigated.
Datum: 23.10.2017

Ensemble calibration for the spectral quantitative analysis of complex samples

Ensemble strategies have gained increasing attention in multivariate calibration for quantitative analysis of complex samples. The aim of ensemble calibration is to obtain a more accurate, stable, and robust prediction by combining the predictions of multiple submodels. The generation and calibration of the training subsets, as well as the integration of the submodels, are three keys to the success of ensemble calibration. Many training subset generating and submodel integrating strategies have been developed to form numerous ensemble calibration methods for improving the performance of the basic calibration method. This contribution focuses on the recent ensemble strategies in relation to calibration, especially the ensemble modeling for quantitative analysis of complex samples. The limitations and perspectives of ensemble strategies are also discussed.
Datum: 17.10.2017

Comparative chemometric analysis for classification of acids and bases via a colorimetric sensor array

With the increasing availability of digital imaging devices, colorimetric sensor arrays are rapidly becoming a simple, yet effective tool for the identification and quantification of various analytes. Colorimetric arrays utilize colorimetric data from many colorimetric sensors, with the multidimensional nature of the resulting data necessitating the use of chemometric analysis. Herein, an 8 sensor colorimetric array was used to analyze select acid and basic samples (0.5 – 10 M) to determine which chemometric methods are best suited for classification quantification of analytes within clusters. PCA, HCA, and LDA were used to visualize the data set. All three methods showed well-separated clusters for each of the acid or base analytes and moderate separation between analyte concentrations, indicating that the sensor array can be used to identify and quantify samples. Furthermore, PCA could be used to determine which sensors showed the most effective analyte identification. LDA, KNN, and HQI were used for identification of analyte and concentration. HQI and KNN could be used to correctly identify the analytes in all cases, while LDA correctly identified 95 of 96 analytes correctly. Additional studies demonstrated that controlling for solvent and image effects was unnecessary for all chemometric methods utilized in this study.
Datum: 13.10.2017

Accurate model based on artificial intelligence for prediction of carbon dioxide solubility in aqueous tetra-n-butylammonium bromide solutions

This study highlights the application of radial basis function (RBF) neural networks, adaptive neuro-fuzzy inference systems (ANFIS), and gene expression programming (GEP) in the estimation of solubility of CO2 in aqueous solutions of tetra-n-butylammonium bromide (TBAB). The experimental data were gathered from a published work in literature. The proposed RBF network was coupled with genetic algorithm (GA) to access a better prediction performance of model. The structure of ANFIS model was trained by using hybrid method. The input parameters of the model were temperature, pressure, mass fraction of TBAB in feed aqueous solution (wTBAB), and mole fraction of TBAB in aqueous phase (xTBAB). The solubility of CO2 (xCO2) was the output parameter. Statistical and graphical analyses of the results showed that the proposed GA-RBF, Hybrid-ANFIS, and GEP models are robust and precise in the estimation of literature solubility data.
Datum: 13.10.2017

To correlate and predict the potential and new functions of traditional Chinese medicine formulas based on similarity indices

A typical traditional Chinese medicine (TCM) formula (or a prescription) is composed of 1 or several single herbs. The number of possible TCM formulas is nearly as large as that of chemical structures, so the development of quantitative formula-activity relationship models is as appealing as to build a quantitative structure-activity relationship model. In this work, a formula descriptor system based on the TCM holistic medical model is generated to correlate and predict formula functions by using similarity indices. First, 73 general descriptors of 78 formulas from Chinese Pharmacopeia (2010) are computed. Second, 6 different similarity indices are used to evaluate the similarities among the 78 formulas. As the main functions of the 78 formulas are known and annotated, a significant similarity implies that a formula is likely to have some new functions owned by its “analogue.” Finally, different similarity measures are compared with reference to the results of experimental and clinical studies. The consistency between some predictions and the literature results indicates that the proposed method can provide clues for mining and investigating the unknown functions of TCM formulas.
Datum: 28.09.2017

Introducing special issue on chemical image analysis

Datum: 15.09.2017

Post-modified non-negative matrix factorization for deconvoluting the gene expression profiles of specific cell types from heterogeneous clinical samples based on RNA-sequencing data

The application of supervised algorithms in clinical practice has been limited by the lack of information on pure cell types. Several supervised algorithms have been proposed to estimate the gene expression patterns of specific cell types from heterogeneous samples. Post-modified non-negative matrix factorization (NMF), the unsupervised algorithm we proposed here, is capable of estimating the gene expression profiles and contents of the major cell types in cancer samples without any prior reference knowledge. Post-modified NMF was first evaluated using simulation data sets and then applied to deconvolution of the gene expression profiles of cancer samples. It exhibited satisfactory performance with both the validation and application data. For application in 3 types of cancer, the differentially expressed genes (DEGs) identified from the deconvoluted gene expression profiles of tumor cells were highly associated with the cancer-related gene sets. Moreover, the estimated proportions of tumor cells showed significant difference between the 2 compared patient groups in clinical endpoints. Our results indicated that the post-modified NMF can efficiently extract the gene expression patterns of specific cell types from heterogeneous samples for subsequent analysis and prediction, which will greatly benefit clinical prognosis.
Datum: 31.08.2017

Calculation of topological indices from molecular structures and applications

This mini review presents a brief description of the research efforts for new topological indices of organic molecular structures undertaken in the authors' laboratory at Changchun Institute of Applied Chemistry, Chinese Academy of Sciences. They were used for the processing of chemical information, as highly selective topological indices for uniqueness determination, as highly selective atomic chiral indices for chiral center recognition, in the exhaustive generation of isomers, in a stereo code for the exhaustive generation of stereoisomers, in the prediction of C-13 nuclear magnetic resonance spectra, and in studies on rare earth extractions. The topological indices Ami, 3D descriptors, and chiral descriptors are described, as well as their applications in quantitative structure activity/property relationship studies.
Datum: 24.08.2017

Sampling error profile analysis (SEPA) for model optimization and model evaluation in multivariate calibration

A novel method called sampling error profile analysis (SEPA) based on Monte Carlo sampling and error profile analysis is proposed for outlier detection, cross validation, pretreatment method and wavelength selection, and model evaluation in multivariate calibration. With the Monte Carlo sampling in SEPA, a number of submodels are prepared and the subsequent error profile analysis yields a median and a standard deviation of the root-mean-square error (RMSE) for the submodels. The median coupled with the standard deviation is an estimation of the RMSE that is more predictive and robust because it uses representative submodels produced by Monte Carlo sampling, unlike the normal method, which uses only 1 model. The error profile analysis also calculates skewness and kurtosis for an auxiliary judgment of the estimated RMSE, which is useful for model optimization and model evaluation. The proposed method is evaluated with 3 near-infrared datasets for wheat, corn, and tobacco. The results show that SEPA can diagnose outliers with more parameters, select more reasonable pretreatment method and wavelength points, and evaluate the model more accurately and precisely. Compared with the results reported in published papers, a better model could be obtained with SEPA concerning RMSECV, RMSEC, and RMSEP estimated with an independent prediction set.
Datum: 24.08.2017

Robust variable selection based on bagging classification tree for support vector machine in metabonomic data analysis

In metabonomics, metabolic profiles of high complexity bring out tremendous challenges to existing chemometric methods. Variable selection (ie, biomarker discovery) and pattern recognition (ie, classification) are two important tasks of chemometrics in metabonomics, especially biomarker discovery that can be potentially used for disease diagnosis and pathology discovery. Typically, the informative variables are elicited from a single classifier; however, it is often unreliable in practice. To rectify this, in the current study, bagging and classification tree (CT) were combined to form a general framework (ie, BAGCT) for robustly selecting the informative variables, based on the advantages of CT in automatically carrying out variable selection as well as measuring variable importance and the properties of bagging in improving the reliability and robustness of a single model. In BAGCT, a set of parallel CT models were established based on the idea of bagging, each CT providing some endowed information such as the splitting variables and their corresponding importance values. The informative variables can be successfully spied via inspecting the variable importance values over all CTs in BAGCT. Taking the promising properties of support vector machine (SVM) into account, we used the informative variables identified by BAGCT as the inputs of SVM, forming a new classification tool abbreviated as BAGCT-SVM. A metabonomic dataset by hydrogen-1 nuclear magnetic resonance from the patients with lung cancer and the healthy controls was used to validate BAGCT-SVM with CT and SVM as comparisons. Results showed that BAGCT-SVM with less number of variables can give better predictive ability than CT and SVM.
Datum: 19.07.2017

Quantitative analysis based on spectral shape deformation: A review of the theory and its applications

Most of the commonly used calibration methods in quantitative spectroscopic analysis are established or derived from the assumption of a linear relationship between the concentrations of the analytes of interest and corresponding absolute spectral intensities. They are not applicable for heterogeneous samples where the potential uncontrolled variations in optical path length due to the changes in samples' physical properties undermine the basic assumption behind them. About a decade ago, a unique calibration strategy was proposed to extract chemical information from spectral data contaminated with multiplicative light scattering effects. From then on, this calibration strategy has been attentively examined, modified, and used by its developers. After more than 10 years of development, some important features of the calibration strategy have been identified. It has been proved that the calibration strategy can solve many complex problems in quantitative spectroscopic analysis. But, because of the relatively low awareness of the calibration strategy among chemometrics society, its potential has not been fully exploited yet. This paper reviews the theory of the calibration strategy and its applications with a view to introducing the unique and powerful calibration strategy to a wider audience.
Datum: 15.06.2017

Design matrices and modelling

Datum: 02.06.2017

The O-PLS methodology for orthogonal signal correction—is it correcting or confusing?

The separation of predictive and nonpredictive (or orthogonal) information in linear regression problems is considered to be an important issue in chemometrics. Approaches including net analyte preprocessing methods and various orthogonal signal correction (OSC) methods have been studied in a considerable number of publications. In the present paper, we focus on the simplest single response versions of some of the early OSC approaches including Fearns OSC, the orthogonal projections to latent structures, the target projection (TP), and the projections to latent structures (PLS) postprocessing by similarity transformation. These methods are claimed to yield improved model building and interpretation alternatives compared with ordinary PLS, by filtering “off” the response-orthogonal parts of the samples in a dataset. We point out at some fundamental misconceptions that were made in the justification of the PLS-related OSC algorithms and explain the key properties of the resulting modelling.
Datum: 11.04.2017


Category: Current Chemistry Research

Last update: 04.01.2018.

© 1996 - 2018 Internetchemistry

I agree!

This site uses cookies. By using this website, you agree to the use of cookies! Learn more ...