Output list
Journal article
Bayesian sequential ensemble learning
Published 2025
International journal of data science and analytics, 21, 1, 53
Efficient methodologies capable of capturing influential effects count as a key concept of medical prognosis when predicting the time until the occurrence of an event of interest. As a major concern, high-dimensional feature spaces need to be addressed in survival analysis. To this end, we propose a novel ensemble framework based on semi-stochastic selection of features that can be used for survival prediction in the presence of high-dimensional feature spaces. A sequential ensemble learning scheme is presented in which each survival tree is trained on a subspace of features. We incorporate a Bayesian framework with a conjugate prior for selecting features in each step by a semi-stochastic procedure. To explore the most influential features in the learning process, the Beta-Bernoulli bandit scheme is utilized. For interpretation, posterior means provided by the Bayesian framework are used as a measure of variable importance. We evaluate the performance of our proposed method based on real data analysis, and the efficiency of the novel variable importance measure by simulation studies.
Journal article
Published 2025
Neurocomputing (Amsterdam), 651, 130848
Feature selection aims to improve predictive performance and interpretability in the analysis of datasets with high dimensional feature spaces. Imbalanced class distribution can make the process of feature selection more severe. Robust methodologies are essential for dealing with this case. Therefore, we present a filter method based on ensemble learning, in which each classifier is built on randomly selected subspaces of features. Variable importance measure is computed based on a class-wise procedure within each classifier, and a feature weighting procedure is subsequently applied. The performance of classifiers is considered in the combination phase of the ensemble learning. Different choices of hyperparameters consisting of the subspace size and the number of classification trees are investigated through simulation studies for determining their effects on the predictive performance. The efficiency of the proposed method is evaluated with respect to predictive performance by different selection strategies based on real data analysis in the presence of class imbalance.
Journal article
The effects of climatic and soil properties on soil water repellency
Published 2025
Catena (Giessen), 258, 109218
Soil water repellency (SWR) is a major agro-ecological soil management issue caused by hydrophobic organic compounds that hinder soil water absorption and affect soil function. Recent modelling studies indicate that climate change will increase the severity of SWR, compounding these effects. This study investigated the effects of climatic and soil factors on SWR in surface (0–10 cm) soils from 355 sites under uniform land-use across an area of 60,000 km2 in south-western Australia, a region with a Mediterranean climate. There were marked gradients in temperature (mean minimum temperature (Meanmin, 7.7–12.2 °C), mean maximum temperature (Meanmax, 19.0–22.9 °C), rainfall (507–1443 mm/year) and pan evaporation (Evap, 1169–1772 mm/year) across the sites. SWR was measured in the laboratory on oven dried samples using the ethanol droplet test. Boosted regression tree analysis showed that 10 soil variables explained 78 % of the variance in SWR, with clay, silt and OC contents the main contributors. Incorporating the four climatic variables explained 84 % of the variance of SWR, with Meanmax the major contributing factor. Thus, while soil properties dominated the expression of SWR, climate had a secondary impact. Meanmax however, was inversely related to SWR, suggesting that rising temperatures due to climate change could result in a reduction in SWR. Furthermore, given the strong relationship between SWR and OC content, climate mitigation projects aimed at enhancing soil OC storage may inadvertently increase the expression and severity of SWR. Recognition of this should be included in soil carbon mitigation project protocols.
Journal article
An intelligent maintenance policy for a latent degradation system
Published 2024
Reliability engineering & system safety, 242, 109739
This paper looks at the challenge of making maintenance decisions for deteriorating systems when the degradation process leading to failure cannot be directly observed or measured. In this scenario, the system’s health is monitored by observing the progression of a degradation-related marker index, which can be obtained through inspections. To model this configuration, a bivariate gamma process is employed. One component represents the marker process, while the other represents the degradation process, which dictates the time of failure. Two condition-based maintenance (CBM) policies are proposed and analyzed. The first policy is based on a conventional decision structure, utilizing a fixed preventive threshold directly applied to the measured process. The second policy relies on monitoring data related to the marker process to estimate the level of latent degradation at inspections. We demonstrate that the second policy is equivalent to a policy employing an adaptive preventive threshold that sequentially evolves. We provide insights into some key properties associated with this approach. The expected cost rate is calculated and employed for policy optimization. Additionally, a numerical study is presented that showcases the practical implementation of the method and highlights the effectiveness of the second approach, even when the correlation between degradation and the marker process is low.
Journal article
Mixture cure model methodology in survival analysis: Some recent results for the one-sample case
Published 2024
Statistics surveys, 18, 82 - 138
The mixture cure model in survival analysis has received large and growing attention in the last few decades. Restricting ourselves mainly to the one -sample case, we present here an overview drawing together some recent significant advances and earlier results, and pointing out areas where further work is needed. New results presented include a discussion of testing for the presence of long term survivors in the null case (when there are no cures present), the probability that an individual is cured (when cures are present), and further analysis of the idea of sufficient followup. Extreme value methods play a key role. We draw attention to some challenging open problems.
Preprint
A Gibbs Sampling Scheme for a Generalised Poisson-Kingman Class
Posted to a preprint site 2024
ArXiv.org
A Bayesian nonparametric method of James, Lijoi \& Prunster (2009) used to predict future values of observations from normalized random measures with independent increments is modified to a class of models based on negative binomial processes for which the increments are not independent, but are independent conditional on an underlying gamma variable. Like in James et al., the new algorithm is formulated in terms of two variables, one a function of the past observations, and the other an updating by means of a new observation. We outline an application of the procedure to population genetics, for the construction of realisations of genealogical trees and coalescents from samples of alleles.
Journal article
Learning from high dimensional data based on weighted feature importance in decision tree ensembles
Published 2024
Computational statistics, 39, 1, 313 - 342
Learning from high dimensional data has been utilized in various applications such as computational biology, image classification, and finance. Most classical machine learning algorithms fail to give accurate predictions in high dimensional settings due to the enormous feature space. In this article, we present a novel ensemble of classification trees based on weighted random subspaces that aims to adjust the distribution of selection probabilities. In the proposed algorithm base classifiers are built on random feature subspaces in which the probability that influential features will be selected for the next subspace, is updated by incorporating grouping information based on previous classifiers through a weighting function. As an interpretation tool, we show that variable importance measures computed by the new method can identify influential features efficiently. We provide theoretical reasoning for the different elements of the proposed method, and we evaluate the usefulness of the new method based on simulation studies and real data analysis.
Journal article
Asymptotics of the allele frequency spectrum and the number of alleles
Published 2024
Journal of applied probability, Early View
We derive large-sample and other limiting distributions of components of the allele frequency spectrum vector, $\mathbf{M}_n$ , joint with the number of alleles, $K_n$ , from a sample of n genes. Models analysed include those constructed from gamma and $\alpha$ -stable subordinators by Kingman (thus including the Ewens model), the two-parameter extension by Pitman and Yor, and a two-parameter version constructed by omitting large jumps from an $\alpha$ -stable subordinator. In each case the limiting distribution of a finite number of components of $\mathbf{M}_n$ is derived, joint with $K_n$ . New results include that in the Poisson–Dirichlet case, $\mathbf{M}_n$ and $K_n$ are asymptotically independent after centering and norming for $K_n$ , and it is notable, especially for statistical applications, that in other cases the limiting distribution of a finite number of components of $\mathbf{M}_n$ , after centering and an unusual $n^{\alpha/2}$ norming, conditional on that of $K_n$ , is normal.
Journal article
Fluid biomarkers in cerebral amyloid angiopathy
Published 2024
Frontiers in neuroscience, 18, 1347320
Cerebral amyloid angiopathy (CAA) is a type of cerebrovascular disorder characterised by the accumulation of amyloid within the leptomeninges and small/medium-sized cerebral blood vessels. Typically, cerebral haemorrhages are one of the first clinical manifestations of CAA, posing a considerable challenge to the timely diagnosis of CAA as the bleedings only occur during the later disease stages. Fluid biomarkers may change prior to imaging biomarkers, and therefore, they could be the future of CAA diagnosis. Additionally, they can be used as primary outcome markers in prospective clinical trials. Among fluid biomarkers, blood-based biomarkers offer a distinct advantage over cerebrospinal fluid biomarkers as they do not require a procedure as invasive as a lumbar puncture. This article aimed to provide an overview of the present clinical data concerning fluid biomarkers associated with CAA and point out the direction of future studies. Among all the biomarkers discussed, amyloid β, neurofilament light chain, matrix metalloproteinases, complement 3, uric acid, and lactadherin demonstrated the most promising evidence. However, the field of fluid biomarkers for CAA is an under-researched area, and in most cases, there are only one or two studies on each of the biomarkers mentioned in this review. Additionally, a small sample size is a common limitation of the discussed studies. Hence, it is hard to reach a solid conclusion on the clinical significance of each biomarker at different stages of the disease or in various subpopulations of CAA. In order to overcome this issue, larger longitudinal and multicentered studies are needed.
Journal article
Prediction of protein aggregation propensity employing SqFt-based logistic regression model
Published 2023
International journal of biological macromolecules, 249, 126036
Here we present a novel machine-learning approach to predict protein aggregation propensity (PAP) which is a key factor in the formation of amyloid fibrils based on logistic regression (LR). Amyloid fibrils are associated with various neurodegenerative diseases (ND) such as Alzheimer's disease (AD) and Parkinson's disease (PD), which are caused by oxidative stress and impaired protein homeostasis. Accordingly, the paper uses a dataset of hexapeptides with known aggregation tendencies and eight physiochemical features to train and test the LR model. Also, it evaluates the performance of the LR model using F-measure and Matthews correlation coefficient (MCC) as metrics and compares it with other existing methods. Moreover, it investigates the effect of combining sequence and feature information in the prediction. In conclusion, the LR model with sequence and feature information achieves high F-measure (0.841) and MCC (0.6692), outperforming other methods and demonstrating its efficiency and reliability for PAP prediction. In addition, the overall performance of the concluded method was higher than the other known servers, for instance, Aggrescan, Metamyl, Foldamyloid, and PASTA 2.0. The LR model can be accessed at: https://github.com/KatherineEshari/Protein-aggregation-prediction.
[Display omitted]