Skip to main content
School of Mathematical Sciences

Advanced Inference methods for High-Dimensional and Functional Data

Supervisor: Dr Nicolás Hernández

Project description:

In recent years there has been a deluge of population level data arising from far ranging fields. Near-infrared (NIR) spectra samples consist of numerous overlapping absorption bands, each corresponding to different vibrational modes of the molecular components. These vibrations are highly sensitive to the physical and chemical properties of the compounds involved. As a result, spectroscopic data exhibit a strongly correlated structure due to the complex nature of spectral absorption bands, with underlying information changing smoothly across wavelengths. This characteristic distinguishes spectral data from typical high-dimensional statistical data.

These large collections of complex data, popularly denominated as Big–Data, must be represented necessarily in a coordinate spaces of high dimensions and require sophisticated analytical techniques to be transformed in valuable information. This type of data can be embedded in what is called nowadays: Functional Data. In this context, one promising research avenue is the development of novel methodologies to make robust inference for this type of data. This involves creating robust statistical tools that can handle the inherent complexity and variability of functional data.

Traditionally, spectral data analysis has mainly relied on multivariate techniques such as partial least squares (PLS) regression, given its capability of dealing with high-dimensional and correlated datasets effectively. However, this method treats spectra as a series of discrete variables rather than as a continuous function. From a physical standpoint, it is more meaningful to view the spectrum as a smooth function, composed of absorption peaks that reflect the various chemical constituents in the sample, where the absorbance at nearby wavelengths is strongly correlated. In this sense Functional Partial Least Squares (FPLS) regression models are an extension of PLS regression designed for handling functional data. In FPLS, the predictor and/or response variables are not scalar or vector-valued but functions (e.g., curves, surfaces). The key idea behind FPLS is to generalize the PLS approach to a functional setting, allowing the extraction of relevant components from high-dimensional functional data.

The main objectives of this projects are:

  1. Develop computationally efficient models for domain selection in the FPLS context: Interval Partial Least-Squares Regression (iPLS) is an adaptation of PLS tailored for high-dimensional spectral data, such as Near-infrared spectra. Spectrometric data is expressed over a continuous domain, therefore interval selection is a more viable alternative for feature extraction than variable selection. Despite its potential, a primary challenge in iPLS remains in the selection of optimal intervals. Although traditional approaches, such as forward and backward selection methods, have practical benefits, they have crucial limitations of heavy reliance on heuristic approaches. This task aims to propose a novel approach to interval selection in iPLS via history matching, a statistical method for calibrating complex computer models, and uncertainty quantification techniques. Gaussian Process Regression or Stochastic Partial Differential Equations (SPDE) could be used as an emulator, emphasising its ability for flexible modelling and its provision of uncertainty estimates. This integration aims to optimise the accuracy of interval selection by utilising implausibility measures to highlight discrepancies between model predictions and observations. This work will contribute to the evolving dialogue on improving spectral data analysis techniques, with an application to the Spectrometric field and a publication of an R-Package.
  2. Ordinal data is a specific type of categorical data where the categories have a natural order. This type of data is common in real-world applications, such as in

Food Quality Assessment. In food quality control, NIR (near-infrared) spectroscopy is used to measure properties of products like wine or olive oil. The spectrometric data could be classified into quality ordinal categories such as, Premium (High Quality), Standard (Medium Quality) and Below Standard (and Low Quality). However, practitioners often misinterpret this data, either treating it as quantitative by assigning integer values to the categories or ignoring the order altogether and treating it as nominal data. Given the importance of FPLS in the Analysis of Spectroscopic Data and the relevance of ordinal data in the field, is key to develop an appropriate model for this type of setting.

This is a methodological and applied project in one of the current hot research topics in statistics, with a strong computational focus and the potential for significant impact in the Food, Pharmaceutical, and Textile industries.

Further information: 
How to apply 
Entry requirements 
Fees and funding

Back to top