Supervisor: Dr Nicolás Hernández
Project description:
In recent years there has been a deluge of population level data arising from far ranging fields. Near-infrared (NIR) spectra samples consist of numerous overlapping absorption bands, each corresponding to different vibrational modes of the molecular components. These vibrations are highly sensitive to the physical and chemical properties of the compounds involved. As a result, spectroscopic data exhibit a strongly correlated structure due to the complex nature of spectral absorption bands, with underlying information changing smoothly across wavelengths. This characteristic distinguishes spectral data from typical high-dimensional statistical data.
These large collections of complex data, popularly denominated as Big–Data, must be represented necessarily in a coordinate spaces of high dimensions and require sophisticated analytical techniques to be transformed in valuable information. This type of data can be embedded in what is called nowadays: Functional Data. In this context, one promising research avenue is the development of novel methodologies to make robust inference for this type of data. This involves creating robust statistical tools that can handle the inherent complexity and variability of functional data.
Traditionally, spectral data analysis has mainly relied on multivariate techniques such as partial least squares (PLS) regression, given its capability of dealing with high-dimensional and correlated datasets effectively. However, this method treats spectra as a series of discrete variables rather than as a continuous function. From a physical standpoint, it is more meaningful to view the spectrum as a smooth function, composed of absorption peaks that reflect the various chemical constituents in the sample, where the absorbance at nearby wavelengths is strongly correlated. In this sense Functional Partial Least Squares (FPLS) regression models are an extension of PLS regression designed for handling functional data. In FPLS, the predictor and/or response variables are not scalar or vector-valued but functions (e.g., curves, surfaces). The key idea behind FPLS is to generalize the PLS approach to a functional setting, allowing the extraction of relevant components from high-dimensional functional data.
The main objectives of this projects are:
Food Quality Assessment. In food quality control, NIR (near-infrared) spectroscopy is used to measure properties of products like wine or olive oil. The spectrometric data could be classified into quality ordinal categories such as, Premium (High Quality), Standard (Medium Quality) and Below Standard (and Low Quality). However, practitioners often misinterpret this data, either treating it as quantitative by assigning integer values to the categories or ignoring the order altogether and treating it as nominal data. Given the importance of FPLS in the Analysis of Spectroscopic Data and the relevance of ordinal data in the field, is key to develop an appropriate model for this type of setting.
This is a methodological and applied project in one of the current hot research topics in statistics, with a strong computational focus and the potential for significant impact in the Food, Pharmaceutical, and Textile industries.
Further information: How to apply Entry requirements Fees and funding
Functional Time Series (FTS) approaches offer an effective framework for analyzing data that exhibit continuous, time-indexed patterns, often represented as random curves. Despite their value, pure FTS data—smoothly evolving curves over time—are rarely directly observable in practical scenarios. As a result, a common technique is to segment univariate time series into regular intervals, thereby transforming discrete observations into continuous curves that approximate FTS. This segmentation technique has proven valuable in various domains, such as environmental monitoring, traffic flow analysis, energy demand forecasting, stock market trend analysis, and temperature prediction. However, segmenting time series data brings its own challenges, primarily determining the optimal slicing of intervals and mitigating any artificial dependencies that may emerge at the segment boundaries.
The first challenge, determining where to segment the time series, is crucial because each slicing approach can impact the resulting FTS model's reliability. If the interval length or position is poorly chosen, it may lead to inconsistent representations that degrade clustering and classification performance. To address this, we propose selecting the interval that minimizes within-group variance, enhancing the consistency of each segment’s characteristics and ensuring optimal clustering performance. The second challenge is the artificial dependencies that slicing can introduce at the boundaries of each segment. These dependencies can distort autocorrelation function estimations and, consequently, affect model specification and parameter inference. For example, Gaussian bridge processes demonstrate how boundary artifacts may skew dependency structures in FTS models. To mitigate this, we could explore a domain selection based on divergence criteria, removing boundary portions until the artificially introduced dependencies are minimized. Additionally, applying a whitening transformation to conditional segments of the domain could further reduce boundary-induced autocorrelation, ultimately enhancing the robustness and accuracy of FTS-based modeling.
A third key challenge in Functional Time Series (FTS) analysis lies in conducting statistical inference, especially in building reliable predictive intervals for future curves. Inference for FTS models is crucial for forecasting applications where uncertainties must be quantified to provide actionable insights. Traditional approaches to inference may be limited in handling the high-dimensional, complex structures often present in FTS data. Bayesian inference presents a promising alternative, offering a structured approach to incorporate prior knowledge and estimate the distribution of future curves. By leveraging Bayesian methods, it becomes possible to construct predictive credible bands—intervals that provide a probabilistic range for future observations, accounting for the inherent uncertainty and variability in FTS models.
Constructing credible bands for FTS models through Bayesian methods allows for more flexible and informative forecasts. Unlike frequentist confidence intervals, Bayesian credible bands incorporate both data-driven likelihood and prior knowledge, producing intervals that reflect the actual uncertainty around future curves rather than relying on asymptotic approximations. This approach is particularly advantageous when dealing with sparse data or when underlying assumptions about distributional properties may not hold. Bayesian credible bands enhance FTS applications by providing a range of plausible future values, offering not just point estimates but a spectrum of possible outcomes that can inform risk assessments, decision-making processes, and resource planning across domains such as environmental monitoring, financial forecasting, and energy consumption prediction. This Bayesian perspective on inference aligns naturally with the inherent randomness and structural complexity of FTS models, ultimately contributing to a more robust, adaptable framework for functional time series analysis.
This is a methodological and applied project in one of the current hot research topics in statistics, with a strong computational focus and the potential for significant impact in the Finance and Energy sector.